«I made this video where I walk through GPT-2’s architecture, show how to count its learnable parameters, and compare it with LLaMA 3.1. Check it out and let me know what you think!»
What is GPT-2 and why should I care?
GPT-2, introduced by OpenAI in 2019, was groundbreaking in its ability to capture long-range context in text and produce coherent, fluent passages. Sure, 2019 feels like eons ago in LLM Land, but today’s state of the art models aren’t fundamentally that different from GPT-2. It is complex enough to be interesting, but small enough that it can be analyzed deeply. If you grasp how GPT-2 works, you’re well on your way to understanding how its successors function under the hood. Here are some benefits to AI students, enthusiasts, and even AI application developers:
You can build an intuition for how large language models “think” by learning the fundamental components of GPT-2, such as token embeddings, position embeddings, multi-head attention mechanisms, and feed-forward networks used alongside attention blocks.
You can learn about and evaluate new models more easily, because they are often scaled-up versions of the same fundamental architecture with modifications in some areas. You can also understand innovations in this space more easily. For example, if you hear about DeepSeek creating a “distilled” version of a Llama model, you can do a quick read on “knowledge distillation” in LLMs to quickly, intuitively, and more accurately grasp how knowledge distillation works. Without foundational knowledge, you might be left with a superficial and potentially incorrect understanding of distillation.
You can play a confident and useful part in discussions around LLMs (which seem to suck all the AI oxygen up these days, but that’s a topic for another day :)). As you deepen your knowledge and expand it to adjacent areas, you’ll also gain the ability to critically analyze the broader AI trends and advancements.
If you’re an LLM application developer, you can move beyond black-box usage and engage with these systems more effectively. A foundational understanding of LLMs can help developers estimate infrastructure needs, predict performance, fine-tune models for specific tasks, and debug when results go awry. More adventurous individuals can even experiment with creative customizations of an LLM!
How do I dive deeper?
Unlike open-weight LLMs like Llama or DeepSeek whose code is not publicly released, GPT-2’s code is readily available in multiple forms — using PyTorch, TensorFlow, etc. The code is also very well documented, and there are tons of resources online covering it in many different ways (I listed some of my favorites later in this article). This makes GPT-2 an ideal entry point to study the code and experiment with the model’s behavior. Here are some things you can try with GPT-2:
Write all the code from scratch. Train your own model using datasets from earlier research. Then fine-tune it with different types of data and observe the results.
See if you can explain GPT-2’s architecture to someone else. Can you count the parameters for the small version? Try to compare it with a later model — say, Llama 3.1 8B. You can even make a video sharing your learnings :).
Change/replace different components and see how it changes the model training and performance. For example: replace GELU with ReLU or SwiGLU in the feed-forward layers; try rotary or static position embeddings instead of learned embeddings; reduce or increase the number of attention heads; alter how residuals are added (e.g., try LayerNorm before vs. after) and see how this affects training stability.
What are the pre-requisites?
Getting started with GPT-2’s architecture does require some background knowledge/skills, but it’s within reach if you have a bit of programming and deep learning experience. Here are a few recommended prerequisites to make your learning journey smoother:
Math: You don’t need to be a math guru, but understanding linear algebra basics (especially matrix multiplication) is quite helpful. Transformers rely on matrix operations to compute attentions and neural network transformations. If you know how these operations work, even at a high level, you’ll find it easier to follow the architecture’s logic. Of course, an intuitive understanding of introductory calculus, probability, and statistics is essential if you want a deeper understanding of LLMs and neural networks in general.
Programming and ML frameworks: You should be comfortable programming in Python and ideally familiar with a deep learning library such as PyTorch or TensorFlow. This will allow you to read example code, use GPT-2 via libraries, or even implement it yourself for learning purposes.
Basic neural network knowledge: It helps to understand the basics of neural networks – things like layers, weights, activations, backward propagation with gradient descent and how training works. GPT-2 is essentially a very deep neural network with many repeated layers, so knowing how a simpler network learns will make it easier to grasp GPT-2.
NLP fundamentals: Familiarity with natural language processing concepts like tokenization (breaking text into words/subwords) and word embeddings (numeric representations of words) is useful. LLMs process text as sequences of tokens and converts them to embeddings, so these concepts are part of its foundation.
Some resources that I found to be quite useful
ARENA was created with mechanistic interpretability research in mind, but it contains the version of GPT-2 code that I liked the best. Specifically, here’s the exercises colab version, and here’s the solutions colab version.
Jay Alammar’s The Illustrated GPT-2 post was the first one I read a couple of years ago, and it still remains the best written explanation of GPT-2 that I know of.
3Blue1Brown has a fantastic series of videos on the basics of neural networks and a spectacularly intuitive explanation of how LLMs work (attention mechanism in particular). In fact, check out his other videos on linear algebra and calculus if you want to refresh and solidify your understanding of those topics.
Here’s a great post on how to count the trainable parameters in GPT-2.
You can try this 5-course specialization in deep learning or find something similar.
What’s next?
OK, enough about GPT-2. You can code it in your sleep. You understand the math, the concepts, the architectural trade offs, etc. Here are some things you can do to expand your knowledge:
Start looking at later open source models - say Llama, DeepSeek, or Qwen. How are they different? What innovations did they add? Research and learn about them.
Learn about training — a lot of progress in LLMs over the past couple of years has actually been in this area. The terminology developed over the years and can be confusing to newcomers (does it ever make you wonder, if there’s the pre-training phase and the post-training phase, why isn’t there a “training” phase!), but this is an area worth exploring.
In fact, all the action in the LLM Land has now shifted to reasoning models — some even started calling them LRMs! The idea is to use reinforcement learning to improve the reasoning capabilities (there are debates raging on whether it is true reasoning, and how one even defines reasoning) of these models
Look at the open source models from Ai2 - they have the phenomenal mission of creating fully open-source, reproducible LLMs. Here’s a description of OLMo 2, their latest model, in Ai2’s own words - “OLMo 2 is a family of fully-open language models, developed start-to-finish with open and accessible training data, open-source training code, reproducible training recipes, transparent evaluations, intermediate checkpoints, and more.” Maybe they should really be using the name OpenAI :) One can even make a case for starting their learning journey with OLMo instead of GPT-2!
Contribute to open source projects or launch your own - the world’s your oyster!