LYC project

Annotation

ensembl pipeline

raw compute

repeat masking

ab inito prediction

others

similarity search

based on online LYC est

based on uniprot vertebrate proteins

integration

maker pipeline

use ensembl results as a start point

add transcriptome data

Evolution analyses

gene families

treefam method

Database preparation

download 6 fishes protein sequences from ensemble

together with lyc proteins, build wublast db

wublastp

blastp the db sequences to itself

use solar combine gene-to-gene blastp score

compute edge weight (g1*g2)/max(g1,g2)

call gene clusters

run hcluster_sg

ortholog groups

get protein seqs for each cluster

run muscle to get multi-alignment

get the cds seqs for the protein multi-alignment

run 'treebest best' on the cds for each cluster

run 'treebest nj' to infer orthlog relations

draw ortholog groups

draw venn diagram

based on the ortholog groups

family expansion/contraction

get single copy families

run modeltest and mrbayes to get the overal phylogenetic tree

run CAFE to get the expansion/contraction

GO/Pathway analyses

based on the CAFE results

evolution speed

run comel on the single copy gene families

draw Ka/Ks ratio graph between lyc and another fish

positive selection

run GO for the PSGs, relate them with some environmental factors

Sequencing

genomic DNA

pe libraries

300 bp

600 bp

mp libraries

3k

5k

9k

transcriptome

liver

egg

muscle

Genome Assembling

Reads cleaning

Based on quality score

Trim adaptor contamination

Delete PCR duplicates

K-mer correction

Build contigs

Based on reads from PE libraries

Use ABySS

Do not apply pairing

Test on a broad range of K-mers

choose the best N50

Build scaffolds

Based trimed reads (36 x 2) from both PE and MP libraries

Use SSPACE

Do not choose the best N50 (containing more linking errors)

Close gaps

GapCloser from SOAP2denovo

Transcriptome Assembling

mapping reads on genome with tophat

integrate gene set with cufflinks