Application of Next Generation Sequencing (NGS)

Objectives

Background on NGS

Sequencing strategies

Introduce the Linux environment

Background

What is NGS?

a broad term encompasses several modern sequencing technologies

can be largely divided into two based on the length of the sequencing output

short-reads

Illumina-based platform

MiniSeq

MiSeq

NextSeq

HiSeq

NovaSeq

MGIseq

G50

G400

Ion Torrent

Gene Studio S5

Genexus System

long-reads

PacBio

RS II

Sequel

Sequel II

Oxford Nanopore

MinION

GridION

PromethION

A major advantage of NGS compared with PCR is that prior knowledge of the target organism (target-specific primers) is not required.

How is it commonly used in clinical settings?

Risk assessment and screening

e.g. Non-Invasive Prenatal Testing (NIPT)

common trisomy syndromes in fetuses

e.g. Carrier screening for recessive genetic disorders

cystic fibrosis, mendelian inherited disorders

e.g. Germline cancer risk testing

Diagnosis

Diagnosis of specific clinical presentations of suspected genetic diseases

e.g. Neuromuscular disorder

Diagnosis of infectious diseases

e.g. Initial whole-genome sequencing and analysis of the host genetic contribution to COVID-19 severity and susceptibility

e.g. SARS-CoV-2

Prognosis

Therapy selection

What kind of experiments can you do?

Genomics

Whole Genome Sequencing (WGS)

Whole Exome Sequencing (WES)

Targeted Gene

Transcriptomics

Total RNA

mRNA

Targeted

non-coding RNA

Epigenomics

Methylation

ChIPSeq

Many more

Common pipeline

Sequencing strategies

Read depth coverage

What is it and why this is important?

During library prep, the genome is fragmented into short random fragments.

Sequencing adapters are added to these fragments

purpose

Allow sequencer to recognize the fragments

Multiplexing sequencing

allow sequencing multiple samples in one run

PCR amplify the libraries. ONLY the fragments with the adapters attached are amplified

These random fragments are then sequenced

The reads are then aligned to a reference genome to create longers stretch of sequences

example

To make the tiling process a success

need to have many fragments that overlap between each other

the more overlaps, the higher the alignment confidence

sequencing error will always occurs

result in inability for the tiling process to occur properly

result in erroneous variant calling

but if we have many copies, correct reads will outweigh bad reads

result in high confidence variant calling

e.g. sometimes adapters attach to each other instead to fragment

"low confident" error

"low diversity" error

occurs at the beginning or end of a sequencing read

Depth plays important role for heterogenous samples

e.g. cancer

In a tumor sample, normal cells tend to be observed together with tumor cells

2 populations: Normal and Tumor

We do not know the ratio of each

With high depth, modern bioinformatic tools is able to understand the differences in the reads coverage

fewer reads

might be deletion

more reads

might be duplication

Comprehensiveness

WGS > WES > Gene Panel

Type of variants

WGS LR

SNPs/short InDels

Gene Panel = WES = WGS

Turnaround time

Gene Panel > WES > WGS

Cost

Gene Panel ~ WES > WGS

PRE-ALIGNMENT QC

The most important step

Common questions

Do I have enough sequencing reads?

Depth = (Read length x Number of reads) / Genome size

Did the sequencing work?

During sequencing

e.g. Sequencing Real Time Analysis

https://supportassets.illumina.com/content/dam/illumina-support/images/featured-training/sav-overview.png

Number of reads

platform specific

NextSeq

> 300M

HiSeq

> 100M

Miseq

>25M

Percentage >Q30

in 70% of reads

Error rate

illumina

less than 0.5%

Pacbio

HiFi/Sequel II

less than 1%

Sequel/RSII

less than 15%

Demultiplexing

all libraries should be well-balanced

Post-sequencing level

Reads quality assessment

FastQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

All are important but the most important ones are

Per base sequence quality

check quality of your sequence

Per base GC content

check if there is problem with library

MultiQC

an "advanced" version of FastQC

Can view QC of multiple samples at the same time

can also used as QC post alignment

Sample contamination/swap

contamination

reads contain mixture of DNA from different samples

swap

reads contain DNA from another sample

How could this happen?

many reasons

mislabeling of sample sheets

rotated sample plate

What are the effects?

e.g. in Trio analysis

mistakenly identify variants in VCF as de novo mutations, when the variants actually came from someone else

e.g. in Cancer analysis

calling contaminant germline variants as somatic

tools

verifyBamID

Picard CrosscheckFingerprints

Sample relatedness

check how different samples are related

especially important for Trio-based analysis

verify that the proband dataset is indeed the child of the parental dataset

tools

KING

ALIGNMENT

Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020)

Critical phase

all downstream analyses and interpretation result from the quality of alignment

Raw sequence data are aligned to the reference sequence

e.g. of reference genome

GRCh37/38

CHM13

output is a mapping file called SAM/BAM

which reads mapped where in the ref genome

Tools

BWA-MEM

for short-reads

Minimap2

for long-reads

Alignment QC

Identify PCR duplicates

What is PCR duplicates?

Duplicates resulting from error during PCR

if one small error was introduced during PCR process

This error will be amplified

left with lots of reads that contains these error

tools

Picard MarkDuplicates

Samblaster

additional pipeline

Pathway

once we identified the variants, we want to find out look at the pathway that it is involves in

Free/Open-source

BioCyc

HumanCyc

REACTOME

KEGG

Paid

Ingenuity Pathway Analysis

Pathway Studio

Human Genetics Reference

e.g. identify mode of inheritance (MOI) of the disease/phenotype

e.g. associated genes and known pathogenic variants

Genetics Home Reference

Raredisease.gov

OMIM

Other resources

Disease specific DB

Cancer

TCGA

COSMIC

co-localization GWAS and eQTL

QTLtools

COLOC

eQTL/sQTL database

GTEx

eQTL

locus affecting expression

sQTL

variant affecting splicing

eQTL catalogue

other DB

gnomAD

combine all publicly available WGS and WES

Good to see AF for a SNP or SV

dbSNP

all known SNPs as reported by GRCh, NCBI, HapMap and 1000 Genome Project

Good for seeing AF for SNPs

Gene Set Experiment Analysis

given a set of genes, expression data and list of phenotypes

identify statistically significant, concordant differences between two phenotypic states

Enrichment Map

Cytoscape

ClusterProfiler

Functional annotation

Rank

SIFT

POLYPHEN

CADD

DANN

Annotate & Rank

Annovar

Exomiser

PVP

Random Forest Classifier

AnnotSV

for SVs

Annotate

Variant Effect Predictor

SNPeffect

effect on protein structure & interactions

LS-SNP/PDB

Missense3D

SuSpect

SAAPdap

6,216 samples

69,830,209 SNPs

characteristics

throughput

very high, 25-100Gb

error rate

error rate <=1%

reads length

shorter, 100bp - 250bp

cost

cheaper than long-reads

characteristics

reads length

very long

Pacbio 10kb -15kb

ONT 5kb-10kb

error rate

Pacbio

1-10%

ONT

5-15%

cost

Pacbio

~$3000 per human sample

ONT

~$1800-2000 per human sample

throughput

low to medium

Pacbio

5-10Gb

ONT

~5Gb

How it works?

Sample Prep

During sample prep, the genome is fragmented into short random fragments.

Then sequencing adapters are added to these fragments

What is the purpose of these adapters?

Allow the fragments to bind to the nucleotide bases on the flowcell

Multiplexing sequencing

allow sequencing multiple samples in one run

save $$$

Cluster Generation

the fragments hybridize to the surface of the flow cell

DNA polymerase will bind to the hybridized fragment and create a complimentary strand

The original template/fragment is washed away

the newly created fragment then hybridize with the neighbouring oligo nucleotide bases attached to the flow cell

remember there are 2 types of oligo nucleotides on the flow cells

each is complementary to the starting adapters and the end adapters

this amplification process will be repeated

ONLY the fragments with the adapters attached are amplified

Sequencing

The fragments are then sequenced using fluorescent tagged nucleotide

what is Fluorescent probes?

molecules that absorb light of a specific wavelength and emit light of a different wavelength

How Pacbio Sequencing works?

How Oxford Nanopore Sequencing works?

Looking from the above

the probability that the intensity represent the incorrect base is stored as Phred score

e.g. if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.