LLMs Learning Path
Fundamentals
From Seq-to-Seq and RNN
to Attention and Transformers
Vasilev :Chapters 2, 3, 6, 7, 8
Attention Variants Papers
Fine-Tuning (QA Chatbot)
Gemma-2
LLama 3.2
Mistral 7b
Zephyr
Low Latency Deployment Chatbot
Jan
HF
Ollama
GPT4All
Fine-Tuning, PEFT, Quantazations
Promp-Engineering
Langchain
Agentic CoQA
Sentiment , 220000 Reviews
seconds ~ hours
- BytePairEncoding
- WordPieceEncoding
- SentencePieceEncoding
- Absolute Positional Embeddings
- Relative Positional Embeddings
- Rotary Position Embeddings
- Relative Positional Bias
- Decoder-Only
- Encoder-Decoder
- Hybrid
- Supervised Fine-tuning
- General Fine-tuning
- Multi-turn Instructions
- Instruction Following
Text Embedding
- Masked Language Modeling
- Causal Language Modeling
- Next Sentence Prediction
- Mixture of Experts
LLMs Cpabilities
Basic
Coding
World
Knowledge
Multilingual
Translation
Crosslingual Tasks
Crosslingual QA
Comprehension
Summarization
Simplification
Reading Comprehension
Emerging
In-context learning
Step by step
solving
Symbolic reference
Pos/Neg example
Instruction
following
Task definition
Turn based
Few-shot
Task definition
Reasoning
Logical
Common Sense
Symbolic
Arithmetic
Augmented
Self-improvement
Self-cirtisim
Self-refinement
Tool
utilization
Tool planning
Knowledge base utilization
Task decomposition
Interacting
with users
Assignment planning
Virtual acting
LLM Components
Tokenizations
Positional Encoding
LLM Architectures
Model Pre-training
Fine-tuning and Instruction Tuning
Alignment
Decoding Strategies
Adaptation
LLM Essentials
Attention In LLMs
- Self-Attention : Calculates attention using queries, keys, and values from the same block (encoder or decoder).
- Cross Attention: It is used in encoder-decoder architectures, where encoder outputs are the queries, and key-value pairs come from the decoder.
- Sparse Attention : To speedup the computation of Self-attention, sparse attention iteratively calculates attention in sliding windows for speed gains.
- Flash Attention : To speed up calculating attention using GPUs, flash attention employs input tiling to minimize the memory reads and writes between the GPU high bandwidth memory (HBM) and the on-chip SRAM.
NLP Fundamentals
Tokenization
- Wordpiece
- Byte pair encoding (BPE)
- UnigramLM
Encoding Positions
- Alibi
- RoPE
Fine Tuning
- Instruction-tuning
- Alignment-tuning
- Transfer Learning
Transformers Architectures
- Encoder Decoder : This architecture processes inputs through
the encoder and passes the intermediate representation to the
decoder to generate the output.
- Causal Decoder : A type of architecture that does not have an
encoder and processes and generates output using a decoder,
where the predicted token depends only on the previous time
steps
-0 Prefix Decoder : where the attention calculation is not
strictly dependent on the past information and the attention
is bidirectional
- Mixture-of-Experts: It is a variant of transformer architecture
with parallel independent experts and a router to route tokens
to experts.
Language Modeling
- Full Language Modeling
- Prefix Language Modeling
- Masked Language Modeling
- Unified Language Modeling
Prompting
- Zero-Shot Prompting
- In-context Learning
- Single and Multi -Turn Instructions
Background
Attention in LLMs
Architecture
Language Modeling
LLMs Adaptation Stages
Pre-Training
Fine-Tuning
Alignment-tuning
RLHF
Transfer Learning
Instruction-tuning
Prompting
Zero-Shot
In-context
Reasoning in LLMs
Single-Turn Instructions
Multi-Turn Instructions
Fine-Tuning
Fine-Tuning I
Large Labeled
Dataset is Avaiable
Fine-Tuning II
Our Dataset is
Different from the
Pre-Trained Data
PEFT
Limited Computational
Resource
Distributed LLM Training
Data Parallelism
Replicates the entire model a
cross devices, easy to implement
but limited by memory constraints.
Model Parallelism
Combines aspects of tensor and
pipeline parallelism for high scalability
but requires complex implementation.
Pipeline Parallelism
Divides the model itself into stages
(layers) and assigns each stage to a
different device, reduces memory
usage but introduces latency.
Tensor Parallelism
Shards a single tensor within
a layer across devices, efficient
for computation but requires
careful communication management.
Hybrid Parallelism
Combine pipeline and tensor
parallelism for optimal performance
based on the model architecture
and available resources.
Optimizer Parallelism:
Focuses on partitioning optimizer
state and gradients to reduce memory
consumption on individual devices.
PaLM Family
Med-PaLM
Med-PaLM2
Med-PaLM M
Flan-PaLM
PaLM
PaLM2
PaLM-E
U-PaLM
Transformer
AI Strategy
Applying LLMs
(Required Large Data
Mostly QA)
API (e.g. OpenAI),
Some UCs
Run-Time
Cost (Short Term)
Cost (Long Term)
Output Quality
AI Wow
Some Use Cases
LLMs on Cloud
All UCs
Run-Time
Cost (Short Term)
Cost (Long Term)
Output Quality
AI Wow
MLOps /CI-CD (Level 2)
All Use Cases
Pretrained Models (CPU) Local
(Moderate Data, Predictive Modeling)
Run-Time
Cost
Output Quality
AI Wow
MLOps /CI-CD
Some Use Cases
Traditional Models(CPU) Local
(With Less Data, Predictive Modeling)
Run-Time
Cost (Short Term)
Cost (Long Term)
Output Quality
AI Wow
MLOps /CI-CD
All Use Cases
AI/ML Projects Types
Strategic Categorization
Organizational goals, market positioning
and industry-specific needs
Optimization and Efficiency Projects
Customer Experience Enhancement
Risk Management and Compliance
Product and Service Innovation
Data-Driven Decision Support
Social Impact and Sustainability
Training and Development
Technical Categorization
Predictive Modeling
Time-Series forcasting
Supervised/Unsupervised
Signal
Text Mining and Natural Language Processing (NLP)
Recommendation Systems
Generative AI (LLMs)
Computer Vision
Speech Recognition and Audio Analysis
Domain-Specific
Healthcare and Medicine
Finance and Banking
Manufacturing and Logistics
E-commerce
Insurance
Transportation and Logistics
Energy and Utilities
Education
Agriculture
Public Safety
Entertainment and Media
Technology and Software Development
Innovation and R&D Projects
ML Scenarios & Tasks
Supervised Learning
Uses labeled datasets to train algorithms
to predict outcomes and recognize patterns
Classification
Binary
MultiClass
MultiLabel
Regression
Unsupervised Learning
The model is given raw, unlabeled data and has to infer
its own rules and structure the information.
Clustering
Dimensionality Reduction
PCA, t-SNE
Semi-Supervised Learning
This combines both labeled and unlabeled data to improve learning accuracy. It’s often used in cases where obtaining a large amount of labeled data is expensive or time-consuming.
Reinforcement Learning
Involves training an agent to make a sequence of decisions by learning from interactions with an environment. The agent receives rewards or penalties and aims to maximize cumulative rewards. (game playing, robotics, and autonomous vehicles)
Multi-Task Learning
Involves training a model on multiple related tasks
simultaneously, sharing representations between
tasks to improve generalization
Self-Supervised Learning
A form of unsupervised learning where
the data itself provides the supervision.
Transfer Learning
Involves leveraging knowledge from one task to
improve learning in a related but different task.
This is particularly useful when there is limited
labeled data in the target domain.
Active Learning
Meta-Learning (Learning to Learn)
Federated Learning
Basic LLMs Tasks
Question Answering
Conversational AI
Text Summarization
Language Translation
Paraphrasing
Ethical and Bias Evaluation
Content Personalization
Sentiment Analysis
Semantic Search
Text-to-Text Transformation
Information Extraction
Content Generation and Correction
Business Sectors
e-Business
Finance and Banking
Sales and Marketing
Customer Relationship Management (CRM)
Regulatory Compliance
Education and Training
Healthcare
Research and Development
Human Resources and Talent Management
Supply Chain Management
Knowledge Management
Manufacturing
Technology and Software Development
UC -Data
Similarity Search
Semantic Search
Tag Analysis and/or Generation
Tag Based Search
Reviews Summarization
Reviews Sentiment Labeling
Popularity Based
Recommendation
Category Recommendation
Semi-Personalized Rec
Personalized Rec
QA Types
Methodological Distinctions
(Based on Answer Generation)
Abstractive QA
Generative QA
Extractive QA
Retrieval-Augmented QA
Rule Based QA
Knowledge Based QA
QA Types Based on Interaction
Conversational QA
Contextual QA (Clarification-Based)
Yes/No QA
Multiple-Choice QA
Special Types of QA
LLM-Based Agent Barin
LLM as a Main Part
LLM-Based Agent
Perception
Context Integration
Input Modalities
Preprocessing
Brain
Core LLM Capabilities
Memory Capabilities
and Retrival
Reasoning &
Planning Layer
Transferability &
Generalization
Knowledge Integration
Tool Interface
Action
Response Generation
Environment Interaction
Feedback Loop
AI Agent System Abilities
Self-learning and
Continuous
Improvement
Perceiving and
Predictive
Modeling
Planning and
Decision Making
Execution and
Interaction
Personal and
Collaborative
Planning / Estimation
AI Agent + Knowlege Extraction
Maturity Level #1
Modeling/ Prototyping Phase
OpenAI Services
Edge
Open LLMs
Edge Decentralized
Inference / Deploymnt Phase
Edge
Plan
Modeling Feat Set 1
Deploy Feat Set 1
Modeling Feat Set 2
Deploy Feat Set 2
Modeling Feat Set 3
Deploy Feat Set 3