The 2017 paper that every AI chatbot is built on

June 26, 2026 · 7 min

It gets to me when people turn a language model into a miracle. Go read how it works, it's all out there. So one day I sat down and went through the papers it stands on. There aren't many. The whole industry, every chatbot, every gadget with "AI" stuck on it, rides on a short chain of work, and almost all of it fits in twenty years. You can walk it fast. And once you do, the miracle evaporates, and the fear that's off the mark goes with it, and so do the inflated expectations.

Here's the chain, link by link.

The idea is old: language is prediction

The thought that language can be predicted isn't new. Claude Shannon, back in 1948, in the paper that started information theory, measured how predictable an English letter is when you know the ones before it. He showed text is full of redundancy, that knowing the start, the next character is easier to guess than you'd think. There's the seed of everything that comes later. Language is statistically predictable. The only question was what to predict it with.

For decades they counted crudely. They took frequencies, how often a word followed the two or three before it, and called them n-grams. It worked on short phrases and fell apart on long ones. The machine had no idea that "cat" and "kitten" sit close together. To it they were two different marks, and that was that.

Words get a geometry

The first shift came from Yoshua Bengio and his team in 2003, the paper "A Neural Probabilistic Language Model." Simple idea: represent a word not as a mark but as a set of numbers, coordinates in a space. Then words close in meaning sit near each other, and the machine, seeing one, already knows a little about the neighbor. That's how neural networks came into language models, and how embeddings came in, words as vectors.

In 2013 Tomas Mikolov's team at Google made it visual, the word2vec model. It turned out meaning sits in that space as geometry. Their textbook example: take the vector for "king," subtract "man," add "woman," and you land almost exactly on "queen." Relations between words became arithmetic on numbers. It's still in the foundation today, inside any current model, text is first turned into vectors like these.

Machines learn to read in order

Text is a sequence, and order is everything in it. "Dog bites man" and "man bites dog" are different news from the same words. To hold the order they used recurrent networks, in particular LSTM (Hochreiter and Schmidhuber, 1997, long before the hype). They read text word by word and dragged a memory of what they'd read along with them.

In 2014 they built translation on that. Ilya Sutskever's team (seq2seq) showed how one network could squeeze a sentence in one language into a vector, and another could unfold it into a translation. Around the same time Dzmitry Bahdanau and his coauthors added the key piece, the attention mechanism. As the model produces each word of the translation, it learns to look at the words of the original that matter most right now. Not drag everything in one squeezed vector, but pick at each step what to look at.

Remember the word "attention." In three years it becomes the name of everything.

The turn: 2017

In 2017 eight people at Google put out a paper with a bold title, "Attention Is All You Need." And they threw out recurrence entirely. They kept the one attention mechanism and built a new architecture out of it, the transformer.

Here's what they did. Instead of reading text word by word in order, the transformer looks at all the words at once and, for each one, works out how related it is to every other. The word "it" in a sentence scans all the nouns and decides which one it refers to. That's self-attention.

Sounds like a small detail. It's actually the foundation of the whole industry, for two reasons.

First, quality. The model finally sees links at any distance, between the start and end of a long paragraph, not just between neighbors.

Second, and this one matters more for everything that followed, speed. Once the words no longer have to be read strictly in order, all the work can run in parallel, on thousands of graphics cards at once. The old recurrent networks couldn't do that, they were stuck on their own sequence. The transformer lifted that ceiling. Which meant it could be blown up to sizes nobody had been talking about.

Everything you call "AI" today is a transformer. GPT, Claude, Gemini, the open models, translators, code generators. The T in GPT is Transformer. The architecture is almost ten years old, and nothing has pushed it aside yet.

Two branches

From that 2017 paper, two branches grew.

In 2018 Google made BERT, a model that reads text both directions at once and understands it well, for search and classification. OpenAI went toward generation and made the GPT line, a model that reads left to right and continues text. The same expensive autocomplete from the first piece. GPT-1 in 2018, GPT-2 in 2019, GPT-3 in 2020.

GPT-3 is worth a stop. In the 2020 paper (Brown and coauthors, titled "Language Models are Few-Shot Learners") OpenAI showed a strange thing. Make the model big enough and feed it enough text, and it starts solving tasks nobody trained it on directly. Show it two or three examples of translation in the prompt, and it translates the fourth. Nobody put a "translation skill" in it separately. It surfaced on its own, a byproduct of predicting text at huge scale. That both scared and electrified the industry: looks like scale, by itself, gives you something.

The law of scale

How much "something" gives you, they went and measured. In 2020 Kaplan and coauthors at OpenAI worked out the scaling laws: quality climbs predictably, along a smooth curve, as you turn three knobs, model size, amount of data, and compute. Not in jumps, not by magic. By a formula.

In 2022 DeepMind sharpened it (the paper on the Chinchilla model, Hoffmann and coauthors). It turned out the industry was chasing model size and underfeeding the model on data. On the same budget, a smaller model trained on more text comes out stronger. That rewrote the recipe and explains why the models of recent years are trained on such monstrous amounts of text.

Worth naming the thing plainly here. The leap of recent years isn't a new idea about mind. It's the engineering of scale on top of the 2017 architecture. The same text-prediction machine, blown up to the limit the hardware will carry.

The last link

One snag remained. Raw GPT-3 continued text but didn't obey. You'd ask it for one thing and get a plausible continuation, often beside the point. For a regular person it was hard to use.

They closed that with training on human ratings. The method is called RLHF (reinforcement learning from human feedback), its roots are in the Christiano and coauthors paper from 2017, and it was carried to a product at OpenAI in the InstructGPT paper (Ouyang and coauthors, 2022). The scheme is simple at heart: the model puts out several candidate answers, live people mark which is better, and on those ratings the model is tuned to put out more of what people find useful and fitting. It doesn't add intelligence. It puts manners on the prediction machine, follow the instruction, hold the format, don't be rude.

That last link, the training in obedience, is what turned a research curiosity into a product. In November 2022 OpenAI wrapped a model like that in a chat window and shipped it as ChatGPT. The architecture inside was five years old. The only new thing was an easy door in.

Lay the chain out

Put it all together. The old thought that language is predictable (Shannon, 1948). Words as vectors (Bengio 2003, word2vec 2013). The attention mechanism for translation (2014). The transformer, which kept attention alone and gave parallelism (2017). Scale as a predictable lever (GPT-3 2020, the scaling laws, Chinchilla 2022). Training in obedience (InstructGPT 2022). At the end of the chain, the chat in your browser.

Not one link needs a mind sitting inside. Each is engineering, get text into numbers, compute the links in parallel, blow up the scale, comb the output to suit people. Knowing the chain is worth more than trivia. When you understand what the machine is made of, you see where its edges are. And you stop overrating it, and you stop fearing the wrong thing.

← All notes