Notes on "Efficient Natural Language Response Suggestion for Smart Reply" by Henderson et al.

Previously: One-Paragraph Reviews, Vol. I

I didn’t manage to stick to the one-paragraph format this time. I’m trying to write down:

everything that was novel and notable to me when reading the paper
as concisely as possible

… but (1) can be a lot of stuff because the things I’m reading about are generally things on which I’m not an expert! I’ll try moving stuff that isn’t related to the main point into footnotes to cheat. If the trend continues, though, I’ll have to think of how to make things more concise…

“Efficient Natural Language Response Suggestion for Smart Reply” is a paper by Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil (2017) on the algorithm behind Google’s pre-LLM¹ “Smart Reply” feature, which suggests short replies like “I think it’s fine” or “It needs some work.”

The authors train a model composed of two neural network “towers”, one for the input email and one for the reply: each takes a vector representing an email, encoded as the sum² of the n-gram embeddings of its words. The model learns to computes two vectors, $h_x$ for input emails and $h_y$ for response emails, such that $P(y|x) = h_x \cdot h_y$ is the probability that an email $y$ is the reply to an email $x$.

There are a few post-processing steps:

adding $\alpha \log P_{\text{LM}}(y)$ to the score, where $\alpha$ is an arbitrary constant and $P_{\text{LM}}$ is computed by a language model, because the learned score $h_x \cdot h_y$ is biased towards “specific and long responses instead of short and generic ones;”
a “diversification” stage where responses are clustered, to “omit redundant suggestions… and ensure a negative suggestion is given if the other two are affirmative and vice-versa”; and
instead of computing full dot products when searching for appropriate response emails, computing smaller quantized³ dot products.

These days, you might use someone else’s text embedding model for $h_x$ and $h_y$, but you’d still need the post-processing steps; you would also need some transformation from input vectors to reply vectors so that $h_x \cdot h_y$ represents “$h_y$ is a reply to $h_x$” rather than just “$h_y$ is similar to $h_x$.” I wonder if LLMs might become cheap enough that $P_{\text{LM}}$ becomes all you need, similar to how spellchecking used to be an engineering feat but is now “3-5 lines of Python.”

Seq2Seq, the direct ancestor of the current generation of GPT-style LLMs, already existed at the time, but the authors wanted something more efficient.

C.f. the sinusoidal or learned positional encoding used in many current models. Sinusoidal positional encodings have a vaguely geometric interpretation: a word/token at a given position in a sentence is the token’s embedding vector with a translation applied, such that the distance between the translation applied to tokens at two positions is “symmetrical and decays nicely with time”.

They learn a “hierarchical quantization” for each vector such that $h_y \approx \text{VQ}(h_y) + \text{R}^T \text{PQ}(r_y)$, where $\text{VQ}$ is a vector quantization, $\text{R}$ is a rotation, and $\text{PQ}$ is a product quantization (the Cartesian product of $\mathcal K$ independent vector quantizers). Vector quantization just means expressing a $a$-dimensional vector as a linear combination of $b$ other vectors (the “codebook”); compression comes from $b < a$. It feels vaguely reminiscent of $k$-means, which predicts the output for a given input using the $k$ nearest input vectors.