Kategoriak: All - pipeline - genome - evolution

arabera di zhang 10 years ago

383

LYC project

The described project involves a series of computational biology tasks focused on assembling and analyzing genomic and transcriptomic data. It starts with transcriptome assembly using tools like Cufflinks for integrating gene sets and Tophat for mapping reads.

LYC project

LYC project

Transcriptome Assembling

integrate gene set with cufflinks
mapping reads on genome with tophat

Genome Assembling

Close gaps
GapCloser from SOAP2denovo
Build scaffolds
Do not choose the best N50 (containing more linking errors)
Use SSPACE
Based trimed reads (36 x 2) from both PE and MP libraries
Build contigs
choose the best N50
Test on a broad range of K-mers
Do not apply pairing
Use ABySS
Based on reads from PE libraries
Reads cleaning
K-mer correction
Delete PCR duplicates
Trim adaptor contamination
Based on quality score

Sequencing

transcriptome
muscle
egg
liver
genomic DNA
mp libraries

9k

5k

3k

pe libraries

600 bp

300 bp

Evolution analyses

positive selection
run GO for the PSGs, relate them with some environmental factors
evolution speed
draw Ka/Ks ratio graph between lyc and another fish
run comel on the single copy gene families
gene families
GO/Pathway analyses

based on the CAFE results

family expansion/contraction

run CAFE to get the expansion/contraction

run modeltest and mrbayes to get the overal phylogenetic tree

get single copy families

draw venn diagram

based on the ortholog groups

treefam method

ortholog groups

draw ortholog groups

run 'treebest nj' to infer orthlog relations

run 'treebest best' on the cds for each cluster

get the cds seqs for the protein multi-alignment

run muscle to get multi-alignment

get protein seqs for each cluster

call gene clusters

run hcluster_sg

wublastp

compute edge weight (g1*g2)/max(g1,g2)

use solar combine gene-to-gene blastp score

blastp the db sequences to itself

Database preparation

together with lyc proteins, build wublast db

download 6 fishes protein sequences from ensemble

Annotation

maker pipeline
add transcriptome data
use ensembl results as a start point
ensembl pipeline
integration
similarity search

based on uniprot vertebrate proteins

based on online LYC est

raw compute

others

ab inito prediction

repeat masking