gff3转mysql_PASA的使用

前文讲述了：PASA的安装，配置与主程序使用参数。本文则重点讲述PASA的几种平行的用法：

最简单地理解PASA的用法：PASA是通过调用 blat 或 gsnap 来将 transcripts sequences 比对到 genome 上，从而进行基因的结构注释(gff文件形式)。

1. 普通用法

使用transcripts sequences来进行基因预测，需要2个(或3个)输入的文件：

The genome sequence in a multiFasta file (ie. genome.fasta)

The transcript sequences in a multiFasta file (ie. transcripts.fasta)

Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)该文件是第二个文件中属于全长CDNAs的序列名的集合，txt文档。

其使用步骤为：

1.1 使用seqclean来对transcript sequences进行clean

通过污染数据库vectors.fasta来对transcript sequences可能被污染的数据进行清除。

$ seqclean transcripts.fasta -v vectors.fasta

1.2 运行PASA

$ cp $PASA_HOME/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config

$ perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=sample_mydb_pasa/' alignAssembly.config

$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta -t all_transcripts.fasta.clean -T -u all_transcripts.fasta -f FL_accs.txt --ALIGNERS blat,gmap --CPU 2

1.3 PASA结果

sample_mydb_pasa.assemblies.fasta : the PASA assemblies in FASTA format.

sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed : the PASA assembly structures.

sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out : descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.

sample_mydb_pasa.pasa_assemblies_described.txt : tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.

2. PASA结合RNA-seq数据来进行基因预测

由RNA-seq数据可以得到transcript sequences，以RNA-seq数据使用Trinity组装过后的转录组结果可以按第一种方法进行PASA分析。在此，更好的方法是：利用PASA，使用De novo转录组和genome来建立一个综合转录组数据库。

需要2个(或3个)的输入文件：

1. Trinity de novo RNA-Seq assemblies (ex. Trinity.fasta)

2. Trinity genome-guided RNA-Seq assemblies (ex. Trinity.GG.fasta)

3. (optionally)cufflinks transcript structures (ex. cufflinks.gtf).

使用步骤：

2.1 Concatenate the Trinity.fasta and Trinity.GG.fasta files into a single transcripts.fasta file.

Trinity.GG.fasta文件可以是 Trinity genome-guided 过程中产生的文件Trinity_GG.fasta, 如果使用最后PASA得到的assembly结果文件，则需要修改fasta文件的头。

cat Trinity.fasta Trinity.GG.fasta > transcripts.fasta

2.2 Create a file containing the list of transcript accessions that correspond to the Trinity de novo assembly (full de novo, not genome-guided).

$PASA_HOME/misc_utilities/accession_extractor.pl < Trinity.fasta > tdn.accs

2.3 Run PASA using RNA-Seq related options as described in the section above, but include the parameter setting –TDN tdn.accs. To (optionally) include cufflinks-generated transcript structures, further include the parameter setting –cufflinks_gtf cufflinks.gtf. Note, cufflinks may not be appropriate for gene-dense targets, such as in fungi; cufflinks excels when applied to vertebrate genomes, so best to include when applying to mouse or human.

$ $PASA_HOME/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -t transcripts.fasta -C -R -g genome_sample.fasta --ALIGNERS blat,gmap --CPU 2 --TDN tdn.accs --cufflinks_gtf cufflinks.gtf

2.4 After completing the PASA alignment assembly, generate the comprehensive transcriptome database via:

$PASA_HOME/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta --min_per_ID 95 --min_per_aligned 30

2.5 结果文件

compreh_init_build/compreh_init_build.fasta : the transcript sequences

compreh_init_build/compreh_init_build.geneToTrans_mapping : the gene/transcript mapping file (for use with RSEM, Trinotate, other tools)

compreh_init_build/compreh_init_build.bed : transcript structures in bed format

compreh_init_build/compreh_init_build.gff3 : transcript structures in gff3 format

compreh_init_build/compreh_init_build.details : classifications of transcripts according to genome mapping status.

2.6 使用感受：

建立一个综合的transcritome数据库，起始是没将de novo的transcripts用于PASA的运行，用于PASA运行是的Trinity_GG.fasta。最后的结果fasta文件是单独使用Trinity_GG.fasta运行PASA得到的gff文件，fasta文件是单独运行得到的assembly.fasta文件和Trinity.fasta相加的结果。个人觉得这样做没什么意义！虽然，文档写道这样做的意义是有利于下游基因差异表达的分析。还不如就使用Trinity_GG.fasta单独去运行PASA。

3. Annotation Comparisons and Annotation Updates

Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models

The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above.

步骤如下：

3.1 将protein-coding gene annotations上载到数据库

验证gff3文件的兼容性，如果会报告错误和警示，则需要修改gff3文件.

$ $PASA_HOME/misc_utilities/pasa_gff3_validator.pl orig_annotations_sample.gff3

将基因注释文件orig_annotations_sample.gff3上载到数据库。note that you should only feed protein-coding genes to PASA using the loader。

$ $PASA_HOME/scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3

继续检查一遍兼容性，上载之后，则不会有任何返回到标准输出。

3.2 Performing an annotation comparison and generating an updated gene set

运行PASA主程序，使用注释的config文件(将其中的MYSQLDB的值更改的和比对的config文件一致)

$PASA_HOME/scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean

3.3 运行结果

Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched. This file will be named ${mysql_db}.gene_structures_post_PASA_updates.$pid.gff3, where $pid is the process ID for this annotation comparison computation.

当然，也可以再次看看PASA的Web Portal.报告已经发生了改变。

4. Identification and Classification of All Alternative Splicing Variations

在PASA进行transcripts比对到genome时，可以进行可变剪切预测。加入选项–ALT_SPLICE。

5. Extraction of ORFs from PASA assemblies (reference ORFs for training gene predictors)

PASA可以从用于自动提取编码蛋白基因的基因结构，用于训练基因集合，来用于其它基因预测软件。To do this, the longest ORF (min 100 amino acids, configurable) is found within each PASA alignment assembly, allowing for 5′ and 3′ partials. The top 500 longest ORFs (configurable) are extracted and, using these in addition to the full set of genome sequences, log-likelihood scores for a hexamer-based Markov model are computed. All ORFs are scored using this Markov model, and those ORFs that are most likely coding are reported as best. The scoring method for ORF identification is similar to that used in the GeneID software and nicely described in the methods section of the GeneID publication.

在运行过PASA过后，则可以进行ORFs的提取：

$ export PERL5LIB $PERL5LIB:$PASAHOME/PerlLib/

$ $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi -M "$mysql_db_name:$mysql_server_name" -p "$user:$password" -g genome_sample.fasta

运行结果：

trainingSetCandidates.cds,.pep,.gff3:corresponds to all longest ORFs of minimum length extracted from the PASA alignment assemblies.

trainingSetCandidates.cds.top_500_longest :corresponds to the 500 top longest ORFs, thought to be most likely to represent correct protein-coding genes.

trainingSetCandidates.cds.scores :scores for all ORFs based on the Markov model.

best_candidates.cds,.pep,.gff3,.bed:the best candidates based on their coding scores compared to random sequences of the same nucleotide composition.

gff3转mysql_PASA的使用相关推荐

gff3转mysql_五月 | 2013 | 陈连福的生信博客
1. GBrowse的安装 GBrowse安装说明文档:http://gmod.org/wiki/GBrowse_2.0_Install_HOWTO GBrowse的安装很少有能顺利安装成功的.需要不 ...
gtf与gff3文件【格式】【转换】
GFF3 官方 General Feature Format Version 3 存储序列结构信息的一种数据格式.序列结构就是一个scaffold或者染色体上面每个位置都是什么序列元件. GFF每一行 ...
gff3转mysql_科学网-把GFF3文件导入MySQL数据库-闫双勇的博文
什么是GFF3?这个一种序列注释文件的格式,基因组注释数据常常会用这种格式来记录序列注释信息,关于这种格式的更多信息,可以在这里学习:http://www.sequenceontology.org/g ...
Genbank的gbff格式转gff3格式
使用方法: [以下操作适用于linux和 MacOS,windows暂未测试] 非软件计算机等专业或者搞不定运行环境的同学也可以参考另一篇稍微简单点的python脚本:Genbank的gbff格式转g ...
Genbank的gbff格式转gff3格式(补充)
前言之前写过的Genbank的gbff格式转gff3格式的运行环境对于很多非计算机行业的同学不是很友好,在帮助一位网友处理格式转换之后,整理了下面这个python转换的方式(注:本人的生物信息等知识 ...
用 Python 从 GFF3 格式文件中查找注释信息
作者:SunCOOL ,写python的农学生. 什么是GFF3格式文件 GFF全称为general feature format,3表示是第三个版本,这种格式主要是用来注释基因组,由tab分割,共9 ...
常用生物信息学格式介绍(fasta、fastq、gff2、gtf(gff2.5)、gff3、bed、sam、bam、vcf)
前言在各个行业都是有行业标准的,这样才能统一规范而方便后面的分析,在生物信息学领域中主要是各种大量序列数据.注释数据等,这些都是有特定的格式去表示,下面列举几种常见的格式.了解这些是进行后续生物信息 ...
GFF3提取fasta序列
awk awk '{print ">"$9"\n"$10}' file.gff3 Python fr = open('./file.gff3', 'r') ...
gff3转mysql_科学网—把GFF3文件导入MySQL数据库 - 闫双勇的博文
什么是GFF3?这个一种序列注释文件的格式,基因组注释数据常常会用这种格式来记录序列注释信息,关于这种格式的更多信息,可以在这里学习:http://www.sequenceontology.org/g ...

gff3转mysql_PASA的使用

gff3转mysql_PASA的使用相关推荐

最新文章

热门文章