Computer Science

How ChatGPT Actually Works: LLMs Explained

Demystify transformers, training, fine-tuning, and hallucinations — how large language models really think.

Apr 22, 20267 min listen5 chapters

What you'll learn

The transformer architecture in plain language
How pre-training and fine-tuning shape model behavior
What "emergent abilities" actually means
Limitations: hallucination, reasoning gaps, context windows

1. What a large language model is

note

How ChatGPT Actually Works: LLMs Explained

Demystify transformers, training, fine-tuning, and hallucinations — how large language models really think.

note

Large language model basics

A large language model predicts the next token from the tokens before it.

A token is a chunk of text. For English, one token is often about 3 to 4 characters on average, though it varies a lot. The word “unbelievable” may split into multiple tokens. That matters because the model does not see text the way humans do.

What the model actually stores

The model stores parameters, often billions of them. GPT-3, released by OpenAI in 2020, had 175 billion parameters. Those parameters are learned weights inside a neural network. They are not a fact table.

Why next-token prediction is powerful

If a model gets very good at predicting the next token across huge amounts of text, it must learn patterns in:

syntax
semantics
style
common facts
code structure
question answering patterns

Analogy

Think of it like a musician who has heard millions of songs. They are not reciting a database of melodies. They have internalized patterns so deeply that they can improvise in many styles.

diagram

chart · bar

Approximate scale of notable language models

2. The transformer and attention

note

Transformer architecture in plain language

A transformer is a neural network built to process tokens in parallel and use attention to decide which tokens matter most.

The core pieces

Token embeddings: turn tokens into vectors
Positional information: tells the model token order
Self-attention: lets each token look at other tokens
Feed-forward layers: transform the attended information
Many stacked layers: build higher-level patterns

Why attention matters

Attention gives the model a way to connect distant parts of text. That is much better than relying only on nearby words.

Analogy

Imagine a meeting where every person can glance at every other person’s notes before speaking. Self-attention is that shared glance. It helps the model decide which earlier words deserve focus right now.

diagram

illustration

A transformer model pipeline showing tokens entering embeddings then self attention then feed forward layers then output probabilities

3. Pre-training and fine-tuning

note

Pre-training vs fine-tuning

Pre-training teaches broad language and world patterns.

Fine-tuning teaches behavior: how to answer, what tone to use, and which tasks to prioritize.

Common fine-tuning methods

Supervised fine-tuning: train on example prompts and ideal answers
Instruction tuning: train on many instruction-following examples
Reinforcement learning from human feedback: optimize toward human preferences

Real example

OpenAI’s InstructGPT paper, published in 2022, showed that models aligned with human feedback were preferred over the base model, even when they were smaller.

Analogy

Pre-training is like studying the whole library. Fine-tuning is like taking a specialized seminar on how to answer questions in the style your teacher wants.

diagram

python

import math

# Tiny next-token example with made-up probabilities
probs = {"Paris": 0.72, "London": 0.12, "Berlin": 0.08, "Rome": 0.08}

# Choose the most likely token
prediction = max(probs, key=probs.get)
print(prediction)

# Entropy shows uncertainty; lower means more confident
entropy = -sum(p * math.log2(p) for p in probs.values())
print(round(entropy, 3))

4. Emergent abilities and context windows

note

Emergent abilities

Emergent abilities are behaviors that appear suddenly in evaluation even when underlying scale changes gradually.

That does not mean the model has consciousness or hidden intent. It means the measurement crossed a threshold.

Context window

The context window is the maximum amount of text the model can use at one time.

If the needed fact is outside the window, the model cannot directly consult it.

Why this matters

Long documents, large codebases, and multi-turn chats can exceed the context window. Then the model may forget early details or contradict itself.

Analogy

A context window is like the size of the whiteboard in front of a student. A bigger whiteboard lets them keep more of the problem visible at once.

diagram

5. Hallucinations, reasoning gaps, and safe use

note

Hallucination

A hallucination is a generated statement that is fluent but false, ungrounded, or misleading.

Why hallucinations happen

The model is trained to predict likely text, not to inspect reality directly.

Common failure modes

Invented citations
Confident but wrong factual claims
Broken multi-step reasoning
Forgetting earlier details beyond the context window

Best practice

Use the model for drafting, brainstorming, and pattern recognition. Verify important claims with trusted sources, execution, or calculation.

Analogy

A language model is like a brilliant improv actor. It can stay in character and keep the scene moving, but it may invent details that were never in the script.

diagram

note

Practical checklist for using ChatGPT well

Ask for the reasoning or assumptions when the answer matters
Verify names, dates, numbers, and citations
Use retrieval or search for fresh facts
Test code instead of trusting it blindly
Keep the context focused when accuracy matters

Bottom line

ChatGPT is not a database and not a person. It is a transformer trained on next-token prediction, then tuned to follow instructions. Its strength is fluent pattern generation. Its weakness is that fluency can outrun truth.

Transcript

Welcome to Slate. Today we're looking at How ChatGPT Actually Works: LLMs Explained. We'll cover The transformer architecture in plain language, How pre-training and fine-tuning shape model behavior, What "emergent abilities" actually means, and Limitations: hallucination, reasoning gaps, context windows. Let's get into it.

A large language model, or L-L-M, is a system trained to predict the next token. A token is usually a word piece, not always a whole word. If you feed it “The capital of France is,” it learns that “Paris” is a high-probability next step. That simple task sounds narrow. But when a model trains on trillions of tokens, it absorbs patterns in grammar, facts, style, code, and even some reasoning shortcuts. Think of it like a giant autocomplete that has read a huge fraction of the internet, books, and code. The key idea is this: it does not store a neat database of facts. It stores numbers called parameters. Those numbers shape how likely each next token is in each context. During training, the model keeps adjusting those numbers to reduce prediction error. The result is a compressed statistical map of language. That is why ChatGPT can write a poem, explain a bug, or answer a history question in the same system. It is using one engine: next-token prediction. The visuals here show the pipeline from text to tokens to probabilities. Notice the difference between memorizing a sentence and learning the patterns that generate many sentences. That distinction is the heart of the whole subject.

The transformer is the architecture that made modern language models practical at scale. Before transformers, many language models read text step by step with recurrent networks. That was like reading a long sentence through a narrow straw. Transformers use attention instead. Attention lets the model compare each token with other tokens in the context and decide what matters most right now. If the sentence says, “The trophy did not fit in the suitcase because it was too big,” attention helps the model connect “it” to “the trophy,” not “the suitcase.” The same mechanism helps with code, where a variable name introduced earlier can matter many lines later. Here’s the big picture. Tokens are turned into vectors, which are lists of numbers. The model adds positional information so order is not lost. Then attention scores how much each token should influence every other token. Multiple attention heads let the model look for different patterns at once. One head might track subject-verb agreement. Another might track quotation marks. Another might track code indentation. After attention, feed-forward layers transform the information further. Stacking many layers lets the model build from simple patterns to more abstract ones. The diagram shows this flow. It is less like a single rule engine and more like a team of specialists, each passing notes to the next.

Pre-training is where the model learns broad language patterns from massive text data. Fine-tuning is where we shape that general model toward a job. The difference is like training a medical student first on all of biology, then on emergency medicine. Both stages matter, but they do different work. During pre-training, the objective is usually simple: predict the next token. That objective is easy to state and incredibly hard to master at scale. The model sees examples from books, websites, articles, code, and other sources. It learns statistical structure, not a hand-written rulebook. Then comes fine-tuning. Supervised fine-tuning uses example prompt-response pairs so the model learns the style and format we want. In instruction tuning, the model learns to follow directions more reliably. A further step called reinforcement learning from human feedback, or R-L-H-F, uses human preferences to make responses more helpful and less harmful. OpenAI described this approach in 2022 for InstructGPT. This is why ChatGPT feels more conversational than a raw base model. The base model may know a lot, but the tuned model is better at answering as a assistant. The flowchart shows the stages. Notice that fine-tuning does not erase pre-training. It steers it. The core knowledge still comes from the huge first phase.

People often say models develop emergent abilities. That phrase can be misleading if you picture a sudden magical spark. What usually happens is more gradual. As models get larger and training data gets richer, some capabilities become visible only after crossing a threshold in evaluation. For example, a model may appear unable to do multi-step arithmetic at one scale, then perform much better at a larger scale. That can look abrupt on a chart, even if the underlying improvement is smooth. So emergent ability often means a capability that becomes measurable only after scale, not something supernatural. Context windows are another limit that matters a lot. The context window is how much text the model can consider at once. GPT-4 was reported in 2023 with context windows ranging from 8,192 tokens in early versions to 32,768 tokens in some variants. Newer systems have gone much larger, but the basic issue remains: if information falls outside the window, the model cannot directly attend to it. That is like trying to solve a puzzle while only seeing part of the table. You can still do a lot, but not everything. The chart shows how performance can jump as scale increases. The important lesson is that capability depends on both model size and the evaluation you choose.

A hallucination is a confident answer that is wrong or unsupported. It happens because the model is optimizing for plausible next tokens, not truth in the human sense. If the training data contains many similar patterns, the model may generate a response that sounds right even when it is not. That is why a model can cite a fake paper, invent a date, or mix up two people with similar names. Reasoning gaps show up when a task needs exact step-by-step consistency. The model may do well on one step and slip on the next. In benchmarks, chain-of-thought style prompting can help some problems, but it does not guarantee correctness. The safest way to use these systems is to treat them like fast, fluent assistants that need verification. For factual work, ask for sources and check them. For code, run the code. For math, verify the derivation. For high-stakes decisions, use the model as support, not authority. The sequence diagram shows a better workflow: the model drafts, the human checks, and external tools verify. That is the practical lesson. ChatGPT is powerful because it learned patterns at scale. It is limited because pattern prediction is not the same thing as grounded truth.

X LinkedIn WhatsApp

Keep going with Slate

Pick up where this left off in your own voice session.

Built with Slate