测试数据下载

wget -c ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz

先验知识

代码解读

实现了哪些功能？

1）读取GTF文件

2）根据feature，将其分为3种数据类型进行保存，即gene、transcript、exon。

每一种数据类型，分别构建一种对象，每种对象所具有的属性定义如下，

Object gene包含的属性有：所属的染色体，起始位置，终止位置，正负链信息，gene id（对应GTF文件中的gene_id）
Object transcript包含的属性有：所属的染色体，起始位置，终止位置，transcript id（对应GTF文件中的transcript_id），parent id（对应GTF文件中的gene_id）
Object exon包含的属性有：所属的染色体，起始位置，终止位置，parent id（对应GTF文件中的transcript_id）

3）数据处理

统计每一条染色体上的gene总数
计算每一个gene的长度
统计每个gene的转录本数
记录每个exon的坐标

代码设计思路

将GTF文件的每一行看作一个对象来处理，构建不同类型的对象对信息进行存储，同时由于不同的对象之间存在信息交集，这就是类继承的应用之处。

Python代码

在正则表达式上，我进行了一定修改。

'''
Date: 2022.8.14
Description: Kind of a collection of tools to parse and analyze GTF
Main Function:
- count gene number of each chromosome
- count transcritp number of each gene
- calculate the total number of exon and intron of each transcript
- plot the result
'''
import sys
import reargs = sys.argv  # Set argv to grep the ipnut. argv is a list, e.g. [, , ]class Genome_info():def __init__(self):self.chr = ""self.start = 0self.end = 0class Gene(Genome_info):'''Note: Object Gene is inherited from Genome_infoGene has two attributes,1) orientation2) id (gene id)'''def __init__(self):Genome_info.__init__(self)self.orientation = ""self.id = ""class Transcript(Genome_info):'''Note: Object Transcript is inherited from Genome_infoTranscript has two attributes,1) id (transcript id)2) parent (gene id)'''def __init__(self):Genome_info.__init__(self)self.id = ""self.parent = ""class Exon(Genome_info):'''Note: Object Exon is inherited from Genome_infoTranscript has one attributes,2) parent (?)'''def __init__(self):Genome_info.__init__(self)self.parent = ""'''
Define a main function to handle gtf file
'''
def main(args):list_chr = []list_gene = {}list_transcript = {}list_exon = []# l_n = 0with open(args[1]) as fp_gtf:for line in fp_gtf:if line.startswith("#"):continue# print ("in %s" % l_n)# l_n += 1lines = line.strip("\n").split("\t")chr = lines[0]type = lines[2]start = int(lines[3])end = int(lines[4])orientation = lines[6]attr = lines[8]if not re.search(r'protein_coding', attr):   # 只针对protein_coding进行统计分析continueif not chr in list_chr :list_chr.append(chr)if type == "gene":gene = Gene()id = re.search(r'gene_id "([^;]+)";?', attr).group(1)  # Original version, could be adapted to r'gene_id "(\S+)";'# id = re.search(r'gene_id "(\S+)";', attr).group(1)      # My adaptationgene.chr = chrgene.start = startgene.end = endgene.id = idgene.orientation = orientationlist_gene[id] = gene  # list_gene is a dict to depost gene id and its correspoding Gene Object# print(id)elif type == "transcript":transcript = Transcript()id = re.search(r'transcript_id "([^;]+)";?', attr).group(1)# id = re.search(r'transcript_id "(\S+)";', attr).group(1)      # My adaptationparent = re.search(r'gene_id "([^;]+)";?', attr).group(1)# id = re.search(r'gene_id "(\S+)";', attr).group(1)      # My adaptationif not parent in list_gene:continuetranscript.chr = chrtranscript.start = starttranscript.end = endtranscript.id = idtranscript.parent = parentlist_transcript[id] = transcriptelif type == "exon":exon = Exon()parent = re.search(r'transcript_id "([^;]+)";?', attr).group(1)# parent = re.search(r'transcript_id "(\S+)";', attr).group(1) # My adaptationif not parent in list_transcript:continueexon.chr = chrexon.start = startexon.end = endexon.parent = parentlist_exon.append(exon)chr_gene(list_gene)gene_len(list_gene)gene_transcript(list_transcript)transcript_exon(list_exon)exon_pos(list_exon)def chr_gene(list_gene):print("-------------------------------- Count gene number of each chromosome --------------------------------")count_gene = {}for info in list_gene.values():chr = info.chrif chr in count_gene:count_gene[info.chr] += 1else:count_gene[info.chr] = 1with open("chr_gene.txt", 'w') as fp_out:for chr, num in count_gene.items():print("\t".join([chr, str(num)]) + "\n")fp_out.write("\t".join([chr, str(num)]) + "\n")def gene_len(list_gene):print("-------------------------------- Calculate the length of each gene --------------------------------")with open("gene_len.txt", 'w') as fp_out:for gene_id, info in list_gene.items():len = info.end - info.start + 1fp_out.write("\t".join([gene_id, str(len)]) + "\n")# print("\t".join([gene_id, str(len)]) + "\n")def gene_transcript(list_transcript):print("-------------------------------- Count transcript number of each gene --------------------------------")count_transcript = {}for info in list_transcript.values():gene_id = info.parentif gene_id in count_transcript:count_transcript[gene_id] += 1else:count_transcript[gene_id] = 1with open("gene_transcript.txt", 'w') as fp_out:for gene_id, num in count_transcript.items():# print("\t".join([gene_id, str(num)]) + "\n")fp_out.write("\t".join([gene_id, str(num)]) + "\n")def transcript_exon(list_exon):print("-------------------------------- Count exon number of each transcript --------------------------------")count_exon = {}for exon in list_exon:transcript_id = exon.parentif transcript_id in count_exon:count_exon[transcript_id] += 1else:count_exon[transcript_id] = 1with open("transcript_exon.txt", 'w') as fp_out:for transcript_id, num in count_exon.items():# print("\t".join([transcript_id, str(num)]) + "\n")fp_out.write("\t".join([transcript_id, str(num)]) + "\n")def exon_pos(list_exon):print("-------------------------------- Save the position info of each exon --------------------------------")count_exon = {}for exon in list_exon:transcript_id = exon.parentif transcript_id in count_exon:count_exon[transcript_id] += ",%s-%s" % (str(exon.start), str(exon.end))else:count_exon[transcript_id] = "%s-%s" % (str(exon.start), str(exon.end))with open("exon_pos.txt", 'w') as fp_out:for transcript_id, pos in count_exon.items():print("\t".join([transcript_id, pos]) + "\n")fp_out.write("\t".join([transcript_id, pos]) + "\n")def gene_exon_pos(list_gene, list_transcript, list_exon):passif __name__ == "__main__":main(args)

代码运行

python save.py input.gtf

小记

暑假在编写脚本的过程中，感觉自己Python其实进步了不少，但是还没能把数据结构、面向对象编程很好的整合起来。

因为本科阶段确实没有什么自己编写脚本去分析数据的需求，就算自己创造需求了，也没有实际应用到工作当中，因此刻意练习是很重要的。

有意思的故事是，去年有幸听到了坦哥面试学生问的问题，“解释一下类”，

要是当时的我，碰到了这个问题，那我是真答不上来。

参考资料

[1] http://www.biotrainee.com/thread-626-1-1.html【作者：dongye】

「像读文献一样读代码」第一期：如何解析GTF文件进行统计分析？相关推荐

读文献先读图——主成分分析 PCA 图
上周五彩斑斓的气泡图有让你眼花缭乱吗? 本周,化繁为简的PCA图你值得拥有! 数据分析| 科研制图﹒PCA 图关键词:主成分分析.降维 1665 年的鼠疫牛顿停课在家提出了万有引力 ; 18 ...
基于蚂蚁金服「如何管理好10万行代码」搭建了 Vue 项目架构
此文是对蚂蚁金服文章的解读,所以要看懂此文一定要先去看原文:如何管理好10万行代码的前端单页面应用. 当时看到蚂蚁金服这篇文章有点茅塞顿开,只不过他们是基于 React 技术栈开发的,但是架构是一种思 ...
人人都能读懂的「以太坊2.0分片设计」
讨论 | 吴为龙.李画撰文 | 李画来源 | 碳链价值封图由 CSDN 下载于东方 IC 当我们在7-11买早餐的时候,如果只有一个收银员,就要排很长的队等待结帐:如果有两个收银员,立刻就会快一 ...
用 Electron 打造 Win/Mac 应用，从「代码」到可下载的「安装包」，可能比你想得麻烦一点...
首发于酷家乐前端博客,作者@摘星(segmentfault @StinsonZhao) 我们能从很多地方学习到怎么起一个 Electron 项目,有些还会介绍怎么打包或构建你的代码,但距离「真正地发行 ...
用 Electron 打造 Win/Mac 应用，从「代码」到可下载的「安装包」，可能比你想得麻烦一点... 1
2019独角兽企业重金招聘Python工程师标准>>> 首发于酷家乐前端博客我们能从很多地方学习到怎么起一个 Electron 项目,有些还会介绍怎么打包或构建你的代码,但距离「真 ...
王垠的「40 行代码」真如他说的那么厉害吗？
"我有什么资格说话呢?如果你要了解我的本事,真的很简单:我最精要的代码都放在 GitHub 上了.但是除非接受过专门的训练,你绝对不会理解它们的价值.你会很难想象,这样一片普通人看起来像是玩 ...
王垠的「40 行代码」
"我有什么资格说话呢?如果你要了解我的本事,真的很简单:我最精要的代码都放在 GitHub 上了.但是除非接受过专门的训练,你绝对不会理解它们的价值.你会很难想象,这样一片普通人看起来像是玩 ...
用 Electron 打造 Win/Mac 应用，从「代码」到可下载的「安装包」，可能比你想得麻烦一点
首发于酷家乐前端博客,作者@摘星(segmentfault @StinsonZhao) 我们能从很多地方学习到怎么起一个 Electron 项目,有些还会介绍怎么打包或构建你的代码,但距离「真正地发行 ...
【读文献笔记】图神经网络加速结构综述
[读文献笔记]图神经网络加速结构综述前言一.图神经网络来源 1.图神经网络用途 2.图神经网络特点 3.图神经网络主要阶段 4.图神经网络加速面临的挑战 5.本笔记内容包含内容二.图与图神经网络 ...

「像读文献一样读代码」第一期：如何解析GTF文件进行统计分析？