read_counts转FPKM（基于gtf和read_counts文件）（exon）

大家可以看最新版https://blog.csdn.net/qq_26012913/article/details/111939262?spm=1001.2014.3001.5501
首先我们要把gtf文件中的exon抓取出来

grep "exon" genome.gtf > genome_exon.gtf

然后提取genome_exon.gtf文件中的gene的exon的长度和得到我们想要的gene的长度

python count_genelen_from_gft.py genome_exon.gtf gene.len

这其中count_genelen_from_gft.py的代码如下：

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:a = i.split("\"")if flag == a[-2]:pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)elif flag == "fuck":flag = a[-2]pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)else:f2.write("{0}\t{1}\n".format(flag,sum(exon)))exon = []flag = a[-2]pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()

就此我们得到了单个基因的长度，存在gene.len文件中eg:
MIM04M24Gene00599 2898 MIM04M24Gene00600 1035 MIM04M24Gene08324 588 MIM04M24Gene08325 468 MIM04M26Gene00001 1770 MIM04M26Gene00002 930 MIM04M26Gene00003 594 MIM04M26Gene00004 426 MIM04M26Gene00005 1002 MIM04M26Gene00006 792 MIM04M26Gene00007 1125 MIM04M26Gene00008 4041 MIM04M26Gene00009 6537 MIM04M26Gene00010 309 MIM04M26Gene00011 1293 MIM04M26Gene00012 282 MIM04M26Gene00013 765 MIM04M26Gene00014 1680 MIM04M26Gene00015 1134 MIM04M26Gene00016 648
我们还要提取准备一下我们每个样本的mapped_reads数的文件，内容如下：

Total Mapped reads      reads number
A1A     18836863
A1B     15478037
A1C     19394549
A2A     19976617
A2B     15964986
A2C     19685810
A3A     18080220
A3B     16627794
A3C     20205794
A4A     16867356
A4B     16409921
A4C     19966924
A5A     17322230
A5B     15118648
A5C     19086094
A6A     17352130
A6B     16489332
A6C     19940296

然后我再展示一下我的read_counts矩阵文件，我的文件名为：raw_counts.matrix
文件内容eg：

        A1A     A1B     A1C     A2A     A2B     A2C     A3A     A3B     A3C     A4A     A4B     A4C     A5A     A5B     A5C     A6A     A6B     A6C
MIM04M24Gene00599       334     179     300     532     261     376     238     284     312     306     191     260     105     187     191     204     177
MIM04M24Gene00600       98      58      80      134     84      122     44      47      65      20      23      27      9       16      16      51      12
MIM04M24Gene08324       13      7       16      19      11      16      15      12      30      19      16      16      11      8       15      29      16
MIM04M24Gene08325       18      18      13      18      21      25      37      30      45      26      32      36      23      22      28      56      31
MIM04M26Gene00001       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

以这三个文件作为输入，我们就能通过脚本得到FPKM矩阵

python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

其中的Caculate_FPKM.py脚本内容贴下:

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r')
f2 = open(file2,'r')
f3 = open(file3,'r')
f4 = open(file4,'w')
a = []
arrf1 = []
dickf2 = {}
dickf3 = {}
for i in f1:i = i.strip("\n")if re.match('A',i):a = i.split("\t")arrf1.append(int(a[1]))else:continue
f1.close()
for i in f2:i = i.strip("\n")a = i.split("\t")dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:i = i.strip("\n")if re.match("M",i):a = i.split("\t")dickf3[a[0]] = a[1:19]else:f4.write(i)
f3.close()
for i in dickf3.keys():f4.write(i+"\t")for j in range(0,18):a = int(dickf3[i][j])#print(a)try:b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))except ZeroDivisionError:b = 0except KeyError:continuef4.write("{}".format(b))f4.write("\t")f4.write("\n")
f4.close()

最后我做一个完整的傻瓜式脚本，只要大家准备好gtf文件、mapped_reads文件、read_counts文件和两个python脚本到一个目录下跑就行了

总脚本如下：

grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

希望能对大家有所帮助，有困难可以给我发邮件1193226980@qq.com

read_counts转FPKM（基于gtf和read_counts文件）（exon）相关推荐

基于LZ77算法的文件解压缩项目缺陷分析
基于Huffman算法和LZ77算法的文件压缩(七) 基于Huffman算法和LZ77算法的文件压缩(六)已经讲解完文件压缩的过程,本文讲解文件解压缩的过程和大文件处理方式一.解压缩的流程 LZ77 ...
基于LZ77算法的文件压缩铺垫
基于Huffman算法和LZ77算法的文件压缩(四) 本文开始讲解LZ77算法,会用到哈希,哈希原理详解我们在基于Huffman算法和LZ77算法的文件压缩(一)当中总体介绍了Huffman算法和L ...
java socket 传输压缩文件_java基于socket传输zip文件功能示例
本文实例讲述了java基于socket传输zip文件的方法.分享给大家供大家参考,具体如下: 服务器端程序: import java.io.*; import java.net.*; import j ...
python分行_基于python实现对文件进行切分行
针对配置文件进行切分,重组,每隔30行为一段,进行重新生成功能. 代码如下 #!/usr/local/python/bin/python # coding=utf-8 import sys impor ...
python上传文件到onedrive_基于Python的onedrive文件本地化浏览系统–PyOne
基于Python的onedrive文件本地化浏览系统–PyOne PyOne是一款基于Python-Flask的onedrive文件本地化浏览系统,使用MongoDB储存文件列表,使用redis缓存数 ...
dom4j工具类_基于DOM4J的XML文件解析类
XML文件解析分四类方式:DOM解析:SAX解析:JDOM解析:DOM4J解析.其中前两种属于基础方法,是官方提供的平台无关的解析方式:后两种属于扩展方法,它们是在基础的方法上扩展出来的,只适用于ja ...
linux tcp文件分包_在Linux下基于TCP协议的文件传输程序.
[设计目的] 通过 Linux C 编程,设计一个基于 TCP/IP 的文件传输系统,实现网络文件的收发 [设计环境] Ubuntu 12.04 [设计方案] ( 1 )文件读写任意文件都可以二进制 ...
ofd转成html,基于HTML5的OFD文件在线显示的方法以及装置与流程
技术特征: 1.一种基于HTML5的OFD文件在线显示的方法,其特征在于:包括如下步骤: 步骤1.服务器端将OFD文件压缩包进行解压,并将解压后得到的OFD文档目录结构映射至HTML5客户端的URL: ...
基于Java的NetCDF文件解析
近期在做的项目中,需要使用Java语言进行NetCDF文件的解析. 然而,当在寻找资料时,发现基于Java语言的资料相较于Python少了很多,而且现有的基于Java解析NetCDF文件到CSV的资料 ...

read_counts转FPKM（基于gtf和read_counts文件）（exon）

read_counts转FPKM（基于gtf和read_counts文件）（exon）相关推荐

最新文章

热门文章