A number of people have asked me for this, so I’m posting it here:
- Bob Carpenter. 2023. Transformer decoding in fifty lines of pseudocode.
[Edit: revised original draft (twice) to fix log density context variable and width of multi-head attention values.]
This is a short note that provides complete and relatively simple pseudocode for the neural network architecture behind the current crop of large language models (LLMs), the generative pretrained transformers (GPT). These are based on the notion of (multi-head) attention, followed by feedforward neural networks, stacked in a deep architecture.
I simplified the pseudocode compared to things like Karpathy’s nanoGPT repository in Python (great, but it’s tensorized and batched PyTorch code for GPU efficiency) or Hunter and Phuong’s pseudocode, which is more general and covers encoding and multiple different architectures. I also start from scratch with the basic notions of tokenization and language modeling.
I include the pseudocode for evaluating the objective function for training and the pseudocode for generating responses. The initial presentation uses single-head attention to make the attention stage clearer, with a note afterward with pseudocode to generalize to multi-head attention.
I also include references to other basic presentations, including Daniel Lee’s version coded in Stan.
If this is confusing or you think I got a detail wrong, please let me know—I want to make this as clear and correct (w.r.t. GPT-2) as possible.