LLMs Learning Path

Fundamentals

From Seq-to-Seq and RNN
to Attention and Transformers

Vasilev :Chapters 2, 3, 6, 7, 8

Attention Variants Papers

Fine-Tuning (QA Chatbot)

Gemma-2

LLama 3.2

Mistral 7b

Zephyr

Low Latency Deployment Chatbot

Jan

HF

Ollama

GPT4All

Fine-Tuning, PEFT, Quantazations

Promp-Engineering

Langchain

Agentic CoQA

Sentiment , 220000 Reviews
seconds ~ hours

- BytePairEncoding
- WordPieceEncoding
- SentencePieceEncoding

- Absolute Positional Embeddings
- Relative Positional Embeddings
- Rotary Position Embeddings
- Relative Positional Bias

- Decoder-Only
- Encoder-Decoder
- Hybrid

- Supervised Fine-tuning
- General Fine-tuning
- Multi-turn Instructions
- Instruction Following

Text Embedding

- Masked Language Modeling
- Causal Language Modeling
- Next Sentence Prediction
- Mixture of Experts

LLMs Cpabilities

Basic

Coding

World
Knowledge

Multilingual

Translation

Crosslingual Tasks

Crosslingual QA

Comprehension

Summarization

Simplification

Reading Comprehension

Emerging

In-context learning

Step by step
solving

Symbolic reference

Pos/Neg example

Instruction
following

Task definition

Turn based

Few-shot

Task definition

Reasoning

Logical

Common Sense

Symbolic

Arithmetic

Augmented

Self-improvement

Self-cirtisim

Self-refinement

Tool
utilization

Tool planning

Knowledge base utilization

Task decomposition

Interacting
with users

Assignment planning

Virtual acting

LLM Components

Tokenizations

Positional Encoding

LLM Architectures

Model Pre-training

Fine-tuning and Instruction Tuning

Alignment

Decoding Strategies

Adaptation

LLM Essentials

Attention In LLMs

- Self-Attention : Calculates attention using queries, keys, and values from the same block (encoder or decoder).
- Cross Attention: It is used in encoder-decoder architectures, where encoder outputs are the queries, and key-value pairs come from the decoder.
- Sparse Attention : To speedup the computation of Self-attention, sparse attention iteratively calculates attention in sliding windows for speed gains.
- Flash Attention : To speed up calculating attention using GPUs, flash attention employs input tiling to minimize the memory reads and writes between the GPU high bandwidth memory (HBM) and the on-chip SRAM.

NLP Fundamentals

Tokenization

- Wordpiece
- Byte pair encoding (BPE)
- UnigramLM

Encoding Positions

- Alibi
- RoPE

Fine Tuning

- Instruction-tuning
- Alignment-tuning
- Transfer Learning

Transformers Architectures

- Encoder Decoder : This architecture processes inputs through
the encoder and passes the intermediate representation to the
decoder to generate the output.
- Causal Decoder : A type of architecture that does not have an
encoder and processes and generates output using a decoder,
where the predicted token depends only on the previous time
steps
-0 Prefix Decoder : where the attention calculation is not
strictly dependent on the past information and the attention
is bidirectional
- Mixture-of-Experts: It is a variant of transformer architecture
with parallel independent experts and a router to route tokens
to experts.

Language Modeling

- Full Language Modeling
- Prefix Language Modeling
- Masked Language Modeling
- Unified Language Modeling

Prompting

- Zero-Shot Prompting
- In-context Learning
- Single and Multi -Turn Instructions

Background

Attention in LLMs

Architecture

Language Modeling

LLMs Adaptation Stages

Pre-Training

Fine-Tuning

Alignment-tuning

RLHF

Transfer Learning

Instruction-tuning

Prompting

Zero-Shot

In-context

Reasoning in LLMs

Single-Turn Instructions

Multi-Turn Instructions

Fine-Tuning

Fine-Tuning I

Large Labeled
Dataset is Avaiable

Fine-Tuning II

Our Dataset is
Different from the
Pre-Trained Data

PEFT

Limited Computational
Resource

Distributed LLM Training

Data Parallelism
Replicates the entire model a
cross devices, easy to implement
but limited by memory constraints.

Model Parallelism
Combines aspects of tensor and
pipeline parallelism for high scalability
but requires complex implementation.

Pipeline Parallelism
Divides the model itself into stages
(layers) and assigns each stage to a
different device, reduces memory
usage but introduces latency.

Tensor Parallelism
Shards a single tensor within
a layer across devices, efficient
for computation but requires
careful communication management.

Hybrid Parallelism
Combine pipeline and tensor
parallelism for optimal performance
based on the model architecture
and available resources.

Optimizer Parallelism:
Focuses on partitioning optimizer
state and gradients to reduce memory
consumption on individual devices.

PaLM Family

Med-PaLM

Med-PaLM2

Med-PaLM M

Flan-PaLM

PaLM

PaLM2

PaLM-E

U-PaLM

Transformer

AI Strategy

Applying LLMs
(Required Large Data
Mostly QA)

API (e.g. OpenAI),
Some UCs

Run-Time

Cost (Short Term)

Cost (Long Term)

Output Quality

AI Wow

Some Use Cases

LLMs on Cloud
All UCs

Run-Time

Cost (Short Term)

Cost (Long Term)

Output Quality

AI Wow

MLOps /CI-CD (Level 2)

All Use Cases

Pretrained Models (CPU) Local
(Moderate Data, Predictive Modeling)

Run-Time

Cost

Output Quality

AI Wow

MLOps /CI-CD

Some Use Cases

Traditional Models(CPU) Local
(With Less Data, Predictive Modeling)

Run-Time

Cost (Short Term)

Cost (Long Term)

Output Quality

AI Wow

MLOps /CI-CD

All Use Cases

AI/ML Projects Types

Strategic Categorization
Organizational goals, market positioning
and industry-specific needs

Optimization and Efficiency Projects

Customer Experience Enhancement

Risk Management and Compliance

Product and Service Innovation

Data-Driven Decision Support

Social Impact and Sustainability

Training and Development

Technical Categorization

Predictive Modeling

Time-Series forcasting

Supervised/Unsupervised

Signal

Text Mining and Natural Language Processing (NLP)

Recommendation Systems

Generative AI (LLMs)

Computer Vision

Speech Recognition and Audio Analysis

Domain-Specific

Healthcare and Medicine

Finance and Banking

Manufacturing and Logistics

E-commerce

Insurance

Transportation and Logistics

Energy and Utilities

Education

Agriculture

Public Safety

Entertainment and Media

Technology and Software Development

Innovation and R&D Projects

ML Scenarios & Tasks

Supervised Learning
Uses labeled datasets to train algorithms
to predict outcomes and recognize patterns

Classification

Binary
MultiClass
MultiLabel

Regression

Unsupervised Learning
The model is given raw, unlabeled data and has to infer
its own rules and structure the information.

Clustering

Dimensionality Reduction

PCA, t-SNE

Semi-Supervised Learning
This combines both labeled and unlabeled data to improve learning accuracy. It’s often used in cases where obtaining a large amount of labeled data is expensive or time-consuming.

Reinforcement Learning
Involves training an agent to make a sequence of decisions by learning from interactions with an environment. The agent receives rewards or penalties and aims to maximize cumulative rewards. (game playing, robotics, and autonomous vehicles)

Multi-Task Learning
Involves training a model on multiple related tasks
simultaneously, sharing representations between
tasks to improve generalization

Self-Supervised Learning
A form of unsupervised learning where
the data itself provides the supervision.

Transfer Learning
Involves leveraging knowledge from one task to
improve learning in a related but different task.
This is particularly useful when there is limited
labeled data in the target domain.

Active Learning

Meta-Learning (Learning to Learn)

Federated Learning

Basic LLMs Tasks

Question Answering

Conversational AI

Text Summarization

Language Translation

Paraphrasing

Ethical and Bias Evaluation

Content Personalization

Sentiment Analysis

Semantic Search

Text-to-Text Transformation

Information Extraction

Content Generation and Correction

Business Sectors

e-Business

Finance and Banking

Sales and Marketing

Customer Relationship Management (CRM)

Regulatory Compliance

Education and Training

Healthcare

Research and Development

Human Resources and Talent Management

Supply Chain Management

Knowledge Management

Manufacturing

Technology and Software Development

UC -Data

Similarity Search

Semantic Search

Tag Analysis and/or Generation

Tag Based Search

Reviews Summarization

Reviews Sentiment Labeling

Popularity Based
Recommendation

Category Recommendation

Semi-Personalized Rec

Personalized Rec

QA Types

Methodological Distinctions
(Based on Answer Generation)

Abstractive QA

Generative QA

Extractive QA

Retrieval-Augmented QA

Rule Based QA

Knowledge Based QA

QA Types Based on Interaction

Conversational QA

Contextual QA (Clarification-Based)

Yes/No QA

Multiple-Choice QA

Special Types of QA

LLM-Based Agent Barin
LLM as a Main Part

LLM-Based Agent

Perception

Context Integration

Input Modalities

Preprocessing

Brain

Core LLM Capabilities

Memory Capabilities
and Retrival

Reasoning &
Planning Layer

Transferability &
Generalization

Knowledge Integration

Tool Interface

Action

Response Generation

Environment Interaction

Feedback Loop

AI Agent System Abilities

Self-learning and
Continuous
Improvement

Perceiving and
Predictive
Modeling

Planning and
Decision Making

Execution and
Interaction

Personal and
Collaborative

Planning / Estimation
AI Agent + Knowlege Extraction
Maturity Level #1

Modeling/ Prototyping Phase

OpenAI Services

Edge

Open LLMs

Edge Decentralized

Inference / Deploymnt Phase

Edge

Plan

Modeling Feat Set 1

Deploy Feat Set 1

Modeling Feat Set 2

Deploy Feat Set 2

Modeling Feat Set 3

Deploy Feat Set 3

Prompt Engineering

New Tasks no Extensive
Training

Zero-shot Prompting [Radford et al., 2019]

Few-shot Prompting [Brown et al., 2020]

Reasoning and Logic

Chain-of-Thought (CoT) Prompting [Wei et al., 2022]

Automatic Chain-of-Thought (Auto-CoT) [Zhang et al., 2022]

Self-Consistency [Wang et al., 2022]

Tree-of-Thoughts (ToT) Prompting [Yao et al., 2023a]

Least-to-Most Prompting [Denny Zhou et al. 2023]

Graph-of-Thought (GoT) Prompting [Yao et al., 2023b]

Reduce Hallucination

Retrieval Augmented Generation (RAG) [Lewis et al., 2020]

ReAct Prompting [Yao et al., 2022]

User Interaction

Active-Prompt [Diao et al., 2023]

Fine-Tuning and Optimization

Automatic Prompt Engineer (APE) [Zhou et al., 2022]

Code Generation and Execution

Program of Thoughts (PoT) Prompting [Chen et al., 2022]

Structured Chain-of-Thought
(SCoT) Prompting [Li et al., 2023c]

Chain of Code (CoC) Prompting [Li et al., 2023b]

Optimization and Efficiency

Optimization by Prompting [Yang et al., 2023]

Floating topic