LYC project
Annotation
ensembl pipeline
raw compute
repeat masking
ab inito prediction
others
similarity search
based on online LYC est
based on uniprot vertebrate proteins
integration
maker pipeline
use ensembl results as a start point
add transcriptome data
Evolution analyses
gene families
treefam method
Database preparation
download 6 fishes protein sequences from ensemble
together with lyc proteins, build wublast db
wublastp
blastp the db sequences to itself
use solar combine gene-to-gene blastp score
compute edge weight (g1*g2)/max(g1,g2)
call gene clusters
run hcluster_sg
ortholog groups
get protein seqs for each cluster
run muscle to get multi-alignment
get the cds seqs for the protein multi-alignment
run 'treebest best' on the cds for each cluster
run 'treebest nj' to infer orthlog relations
draw ortholog groups
draw venn diagram
based on the ortholog groups
family expansion/contraction
get single copy families
run modeltest and mrbayes to get the overal phylogenetic tree
run CAFE to get the expansion/contraction
GO/Pathway analyses
based on the CAFE results
evolution speed
run comel on the single copy gene families
draw Ka/Ks ratio graph between lyc and another fish
positive selection
run GO for the PSGs, relate them with some environmental factors
Sequencing
genomic DNA
pe libraries
300 bp
600 bp
mp libraries
3k
5k
9k
transcriptome
liver
egg
muscle
Genome Assembling
Reads cleaning
Based on quality score
Trim adaptor contamination
Delete PCR duplicates
K-mer correction
Build contigs
Based on reads from PE libraries
Use ABySS
Do not apply pairing
Test on a broad range of K-mers
choose the best N50
Build scaffolds
Based trimed reads (36 x 2) from both PE and MP libraries
Use SSPACE
Do not choose the best N50 (containing more linking errors)
Close gaps
GapCloser from SOAP2denovo
Transcriptome Assembling
mapping reads on genome with tophat
integrate gene set with cufflinks