LLM Basic Notes
Introduction to Generative AI and Large Language Models
This comprehensive lecture covers fundamental concepts of Large Language Models (LLMs), their architecture, training methods, tooling frameworks, and practical applications through AI agents. The material explores both theoretical foundations and practical implementations.
1. Generative Models: Autoregressive vs Non-Autoregressive
Autoregressive (AR) Models
Autoregressive models generate output one token at a time, with each token conditioned on all previously generated tokens. This sequential approach creates a dependency chain where the model predicts the next element based on historical context.[1][2]
Key Characteristics:
- Sequential Generation: Must wait for each token before generating the next
- High Quality: Strong dependency modeling produces coherent, contextually appropriate outputs
- Slow Speed: Sequential nature limits parallelization
- Examples: GPT-3, GPT-4, and most modern LLMs[3][1]
The mathematical formulation represents the joint probability as a product of conditional probabilities, where each variable depends on all preceding variables.[3]
Non-Autoregressive (NAR) Models
Non-autoregressive models generate all tokens simultaneously in parallel, without conditioning on previous outputs.[1]
Key Characteristics:
- Parallel Generation: All outputs produced at once
- High Speed: Significantly faster inference due to parallelization
- Lower Quality: Lack of dependency modeling can reduce coherence
- Applications: Real-time translation, low-latency tasks[1]
Comparison Table
| Feature | AR Model | NAR Model |
|---|---|---|
| Parallelism | Low | High |
| Speed | Slow | Fast |
| Quality | High | Lower |
| Use Case | High-quality text generation | Real-time applications |
Hybrid Approaches
Modern architectures combine both methods to balance speed and quality:[4][5]
- Encoder-Decoder Architecture: AR model generates intermediate representations, NAR model produces final output
- Iterative Refinement: NAR model generates multiple drafts that are progressively refined
- Speculative Decoding: NAR model rapidly generates candidate tokens, AR model verifies them in parallel
2. Speculative Decoding: Accelerating LLM Inference
Speculative decoding is an optimization technique that achieves 2-3× speedup without sacrificing output quality.[6][7][4]
How It Works
The technique employs two models working in tandem:[5][8][4]
- Draft Model: A smaller, faster model proposes multiple candidate tokens
- Target Model: The larger, high-quality model verifies candidates in parallel
Process Flow:
- Draft model generates K speculative tokens (fast)
- Target model verifies all tokens simultaneously (parallel verification)
- Accepted tokens become output; rejected tokens are discarded
- Target model generates one additional token
- Process repeats with new context
Performance Gains
Real-world implementations show impressive results:[7][6]
- Llama3 8B: 2× speedup
- Granite 20B code models: 3× speedup
- Guaranteed equivalence: Output distribution identical to standard decoding
The speedup depends heavily on the acceptance rate of draft tokens. If 2 out of 4 draft tokens are accepted on average, the effective time per token can be halved.[8]
Key Insight
Speculative decoding exploits two observations:[4][6]
- LLM inference is memory-bound, leaving GPU compute underutilized
- Many tokens are easy to predict and can be accurately proposed by smaller models
3. Scaling Laws: Predicting LLM Performance
Chinchilla Scaling Laws
The Chinchilla scaling laws provide data-optimal guidelines for training LLMs:[9][10][11]
Key Formula: For optimal performance, use approximately 20 tokens per parameter[10]
Examples:
- A 70B parameter model should be trained on ~1.4 trillion tokens
- GPT-3 (175B parameters) was undertrained with only 300B tokens
- To be data-optimal, GPT-3 should have used ~3.5 trillion tokens (11× more data)[10]
Beyond Chinchilla: Inference Considerations
Recent research shows the original Chinchilla laws don't account for inference costs. When considering deployment with high inference demand:[11][12]
- Train smaller models longer than Chinchilla-optimal
- Inference efficiency becomes more important than training efficiency
- Models like Llama are deliberately trained beyond Chinchilla point[13][11]
Important Caveat
Current pre-training data demands far exceed scaling law estimates. Models continue improving with massive datasets, and no saturation point has been definitively observed—Transformer architectures exhibit excellent data scalability.[14]
4. Emergent Abilities of Large Language Models
Definition
Emergent abilities are capabilities that are absent in smaller models but present in larger models. These abilities cannot be predicted by extrapolating performance from smaller-scale systems.[15][16]
Key Emergent Abilities
In-Context Learning: The ability to learn from example demonstrations without parameter updates[17][16]
Instruction Following: Understanding and executing complex instructions
Step-by-Step Reasoning: Chain-of-thought problem-solving[17]
Specific Examples:[17]
- Arithmetic reasoning
- Decoding International Phonetic Alphabet
- Unscrambling words
- Understanding spatial relationships and cardinal directions
- Multi-language understanding (e.g., Hinglish offensive content detection)
The Debate
The emergence phenomenon is contested:[18][19]
- Proponents: Truly emergent capabilities arise unpredictably at scale
- Skeptics: "Emergence" may result from metric choice, in-context learning, and model memory rather than fundamental phase transitions[19][18]
Recent research suggests emergent abilities might be better explained as combinations of in-context learning, model memory, and linguistic knowledge rather than genuinely novel capabilities.[19]
5. ChatGPT Training Pipeline
ChatGPT represents a sophisticated three-stage training process:
Stage 1: Pre-Training
Objective: Learn language patterns through next-token prediction
- Method: Self-supervised learning on massive web-scale corpora
- Task: Predict the next word given previous context
- Result: Foundation model with broad language understanding
Stage 2: Instruction Tuning (IT)
Objective: Guide model toward helpful, instruction-following behavior
- Method: Supervised learning on human-annotated question-answer pairs
- Data: Carefully curated examples of desired behaviors
- Result: Model aligned with user intent
Stage 3: Reinforcement Learning from Human Feedback (RLHF)
Objective: Optimize outputs based on human preferences[20][21][22][23]
Process:[21][24]
-
Reward Model Training:
- Collect human rankings of multiple model outputs
- Train separate reward model to predict human preferences
- Reward model scores output quality automatically
-
RL Optimization:
- Use reward model as objective function
- Apply Proximal Policy Optimization (PPO) algorithm
- Iteratively improve policy to maximize reward
Three H's of Alignment:[22][14]
- Helpfulness: Provides useful, relevant information
- Honesty: Truthful and accurate responses
- Harmlessness: Avoids harmful, biased, or unsafe content
Alignment
The combination of Instruction Tuning and RLHF is called alignment—ensuring AI systems behave in accordance with human values and intentions.[23][14][22]
6. Diffusion Models for Image Generation
Diffusion models are generative models that create images through a two-stage process.[25][26][27]
Forward Process (Diffusion)
Gradually adds Gaussian noise to data over T timesteps:[26][27]
- Start with real image x₀
- Add small amounts of noise at each step: x₁, x₂, ..., xT
- After T steps, image becomes pure Gaussian noise
- Process is deterministic and predefined
Mathematical Property: Can sample noisy image at any timestep t using closed-form equation[27]
Reverse Process (Denoising)
Neural network learns to reverse the diffusion and remove noise:[28][25][26]
- Start with pure noise xT ~ N(0, I)
- Predict and remove noise step-by-step
- Gradually reconstruct coherent image
- Result: New sample from learned data distribution
Key Challenge: Cannot directly compute reverse distribution q(xt-1|xt)[28][29]. Instead, train neural network to approximate this distribution by predicting noise patterns.
Stable Diffusion
Stable Diffusion operates in latent space rather than pixel space:[14]
- Encoder compresses images to lower-dimensional latent representations
- Diffusion process occurs in compact latent space (more efficient)
- Decoder reconstructs final image from denoised latent
This architecture dramatically reduces computational requirements while maintaining quality.
7. LangChain: Framework for LLM Applications
LangChain provides modular components to simplify LLM application development.[30][31][32]
Core Architecture
LangChain follows a pipeline approach with distinct modules:[30]
- User Query → Input processing
- Vector Representation → Semantic search in vector database
- Information Retrieval → Fetch relevant context
- LLM Processing → Generate response with context
- Output → Formatted response to user
Key Modules
1. Model I/O Module[32][33]
Normalizes interaction with different LLMs through unified interface:
Components:
- LLMs/Chat Models: Text completion or conversational interfaces
- Prompt Templates: Reusable, parameterized prompts
- Output Parsers: Structure and format model outputs
Example:
const template = PromptTemplate.fromTemplate('List 10 {subject}.\n{format_instructions}')
const model = new OpenAI({ temperature: 0 })
const listParser = new CommaSeparatedListOutputParser()
const chain = RunnableSequence.from([template, model, listParser])
const result = await chain.invoke({
subject: 'countries',
format_instructions: listParser.getFormatInstructions(),
})
2. Retrieval Module[32][30]
Implements Retrieval-Augmented Generation (RAG):
Components:
- Document Loaders: Ingest data from various sources (CSV, PDF, databases)
- Text Splitters: Chunk documents into manageable pieces
- Embeddings: Convert text to vector representations
- Vector Stores: Efficient semantic search databases
- Retrievers: Query and fetch relevant information
3. Chains Module[33]
Link multiple tasks into sequences using LangChain Expression Language (LCEL):
- Compose operations with pipe operator (
|) - Each component implements
Runnableinterface - Automatic handling of data flow between stages
4. Agents Module
Dynamic systems that choose actions based on LLM reasoning:
- Access to tools (functions the agent can execute)
- LLM decides which tools to use and when
- Unlike hardcoded chains, agents adapt behavior dynamically
8. AI Agents: Advanced LLM Applications
AI agents are autonomous systems powered by sophisticated prompting techniques that can interact with environments and use tools.
Agent Instruction Design
Best Practices:[14]
- Use Existing Documents: Convert procedures and policies to LLM-friendly routines
- Break Down Tasks: Smaller, clearer steps reduce ambiguity
- Define Clear Actions: Each step corresponds to specific output
- Capture Edge Cases: Include conditional logic and alternative paths