高通量测序分析工具Bedtools使用介绍

http://blog.genesino.com/2018/04/bedtools/

Bedtools是处理基因组信息分析的强大工具集合，其主要功能如下：

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.usage:    bedtools <subcommand> [options]The bedtools sub-commands include:[ Genome arithmetic ]intersect     Find overlapping intervals in various ways.求区域之间的交集，可以用来注释peak，计算reads比对到的基因组区域不同样品的peak之间的peak重叠情况。window        Find overlapping intervals within a window around an interval.closest       Find the closest, potentially non-overlapping interval.寻找最近但可能不重叠的区域coverage      Compute the coverage over defined intervals.计算区域覆盖度map           Apply a function to a column for each overlapping interval.genomecov     Compute the coverage over an entire genome.merge         Combine overlapping/nearby intervals into a single interval.合并重叠或相接的区域cluster       Cluster (but don't merge) overlapping/nearby intervals.complement    Extract intervals _not_ represented by an interval file.获得互补区域subtract      Remove intervals based on overlaps b/w two files.计算区域差集slop          Adjust the size of intervals.调整区域大小，如获得转录起始位点上下游3 K的区域flank         Create new intervals from the flanks of existing intervals.sort          Order the intervals in a file.排序，部分命令需要排序过的bed文件random        Generate random intervals in a genome.获得随机区域，作为背景集shuffle       Randomly redistrubute intervals in a genome.根据给定的bed文件获得随机区域，作为背景集sample        Sample random records from file using reservoir sampling.spacing       Report the gap lengths between intervals in a file.annotate      Annotate coverage of features from multiple files.[ Multi-way file comparisons ]multiinter    Identifies common intervals among multiple interval files.unionbedg     Combines coverage intervals from multiple BEDGRAPH files.[ Paired-end manipulation ]pairtobed     Find pairs that overlap intervals in various ways.pairtopair    Find pairs that overlap other pairs in various ways.[ Format conversion ]bamtobed      Convert BAM alignments to BED (& other) formats.bedtobam      Convert intervals to BAM records.bamtofastq    Convert BAM records to FASTQ records.bedpetobam    Convert BEDPE intervals to BAM records.bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.[ Fasta manipulation ]getfasta      Use intervals to extract sequences from a FASTA file.提取给定位置的FASTA序列maskfasta     Use intervals to mask sequences from a FASTA file.nuc           Profile the nucleotide content of intervals in a FASTA file.[ BAM focused tools ]multicov      Counts coverage from multiple BAMs at specific intervals.tag           Tag BAM alignments based on overlaps with interval files.[ Statistical relationships ]jaccard       Calculate the Jaccard statistic b/w two sets of intervals.计算数据集相似性reldist       Calculate the distribution of relative distances b/w two files.fisher        Calculate Fisher statistic b/w two feature files.[ Miscellaneous tools ]overlap       Computes the amount of overlap from two intervals.igv           Create an IGV snapshot batch script.用于生成一个脚本，批量捕获IGV截图links         Create a HTML page of links to UCSC locations.makewindows   Make interval "windows" across a genome.把给定区域划分成指定大小和间隔的小区间 (bin)groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")分组结算，不只可以用于bed文件。expand        Replicate lines based on lists of values in columns.split         Split a file into multiple files with equal records or base pairs.

安装bedtools

（Linux - Conda软件安装方法）

ct@ehbio:~$ conda install bedtools

获得测试数据集(http://quinlanlab.org/tutorials/bedtools/bedtools.html)

ct@ehbio:~$ mkdir bedtoolsct@ehbio:~$ cd bedtoolsct@ehbio:~$ url=https://s3.amazonaws.com/bedtools-tutorials/webct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgzct@ehbio:~/bedtools$ curl -O ${url}/cpg.bedct@ehbio:~/bedtools$ curl -O ${url}/exons.bedct@ehbio:~/bedtools$ curl -O ${url}/gwas.bedct@ehbio:~/bedtools$ curl -O ${url}/genome.txtct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed

交集 (intersect)

查看输入文件，bed格式，至少三列，分别是染色体，起始位置(0-based,

包括)，终止位置

(1-based，不包括)。第四列一般为区域名字，第五列一般为空，第六列为链的信息。更详细解释见http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。

自己做研究CpG岛信息可以从UCSC的Table Browser获得，具体操作见http://blog.genesino.com/2013/05/ucsc-usages/。

ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed==> cpg.bed <==chr1    28735   29810   CpG:_116
chr1    135124  135563  CpG:_30
chr1    327790  328229  CpG:_29==> exons.bed <==chr1    11873   12227   NR_046018_exon_0_0_chr1_11874_f 0   +
chr1    12612   12721   NR_046018_exon_1_0_chr1_12613_f 0   +
chr1    13220   14409   NR_046018_exon_2_0_chr1_13221_f 0   +

获得重叠区域(既是外显子，又是CpG岛的区域)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5chr1    29320   29370   CpG:_116chr1    135124  135563  CpG:_30chr1    327790  328229  CpG:_29chr1    327790  328229  CpG:_29chr1    327790  328229  CpG:_29

输出重叠区域对应的原始区域(与外显子存在交集的CpG岛)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5

chr1 28735 29810 CpG:_116 chr1 29320 29370

NR_024540_exon_10_0_chr1_29321_r 0 -
chr1 135124 135563 CpG:_30 chr1 134772 139696

NR_039983_exon_0_0_chr1_134773_r 0 -
chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028322_exon_2_0_chr1_324439_f 0 +
chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028325_exon_2_0_chr1_324439_f 0 +
chr1 327790 328229 CpG:_29 chr1 327035 328581

NR_028327_exon_3_0_chr1_327036_f 0 +

计算重叠碱基数

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10

chr1 28735 29810 CpG:_116 chr1 29320 29370

NR_024540_exon_10_0_chr1_29321_r 0 - 50
chr1 135124 135563 CpG:_30 chr1 134772 139696

NR_039983_exon_0_0_chr1_134773_r 0 - 439
chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028322_exon_2_0_chr1_324439_f 0 + 439
chr1 327790 328229 CpG:_29 chr1 324438 328581

NR_028325_exon_2_0_chr1_324439_f 0 + 439
chr1 327790 328229 CpG:_29 chr1 327035 328581

NR_028327_exon_3_0_chr1_327036_f 0 + 439
chr1 713984 714547 CpG:_60 chr1 713663 714068

NR_033908_exon_6_0_chr1_713664_r 0 - 84
chr1 762416 763445 CpG:_115 chr1 761585 762902

NR_024321_exon_0_0_chr1_761586_r 0 - 486
chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_015368_exon_0_0_chr1_762971_f 0 + 185
chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_047519_exon_0_0_chr1_762971_f 0 + 185
chr1 762416 763445 CpG:_115 chr1 762970 763155

NR_047520_exon_0_0_chr1_762971_f 0 + 185

计算第一个(-a)bed区域有多少个重叠的第二个(-b)bed文件中有多少个区域

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | headchr1    28735   29810   CpG:_116    1chr1    135124  135563  CpG:_30 1chr1    327790  328229  CpG:_29 3chr1    437151  438164  CpG:_84 0chr1    449273  450544  CpG:_99 0chr1    533219  534114  CpG:_94 0chr1    544738  546649  CpG:_171    0chr1    713984  714547  CpG:_60 1chr1    762416  763445  CpG:_115    10chr1    788863  789211  CpG:_28 9

另外还有-v取出不重叠的区域,

-f限定重叠最小比例，-sorted可以对按sort -k1,1 -k2,2n排序好的文件加速操作。

同时对多个区域求交集 (可以用于peak的多维注释)

# -names标注注释来源# -sorted: 如果使用了这个参数，提供的一定是排序好的bed文件ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed \-b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \| head -10000  | tail -10

chr1 27632676 27635124

NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27633213

27635013 5_Strong_Enhancer
chr1 27632676 27635124

NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27635013

27635413 7_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27632613 27632813 6_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27632813 27633213 7_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27633213 27635013 5_Strong_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

chromhmm chr1 27635013 27635413 7_Weak_Enhancer
chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f cpg

chr1 27648453 27649006 CpG:_63
chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f

chromhmm chr1 27648613 27649413 1_Active_Promoter
chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f cpg

chr1 27648453 27649006 CpG:_63
chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f

chromhmm chr1 27648613 27649413 1_Active_Promoter

合并区域

bedtools merge输入的是按sort -k1,1 -k2,2n排序好的bed文件。

只需要输入一个排序好的bed文件，默认合并重叠或邻接区域。

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5chr1    11873   12227chr1    12612   12721chr1    13220   14829chr1    14969   15038chr1    15795   15947

合并区域并输出此合并后区域是由几个区域合并来的

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5chr1    11873   12227   1chr1    12612   12721   1chr1    13220   14829   2chr1    14969   15038   1chr1    15795   15947   1

合并相距90 nt内的区域，并输出是由哪些区域合并来的

# -c: 指定对哪些列进行操作# -o: 与-c对应，表示对指定列进行哪些操作# 这里的用法是对第一列做计数操作，输出这个区域是由几个区域合并来的# 对第4列做收集操作，记录合并的区域的名字，并逗号分隔显示出来ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4chr1    11873   12227   1   NR_046018_exon_0_0_chr1_11874_fchr1    12612   12721   1   NR_046018_exon_1_0_chr1_12613_fchr1    13220   15038   3   NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_rchr1    15795   15947   1   NR_024540_exon_2_0_chr1_15796_r

计算互补区域

给定一个全集，再给定一个子集，求另一个子集。比如给定每条染色体长度和外显子区域，求非外显子区域。给定基因区，求非基因区。给定重复序列，求非重复序列等。

重复序列区域的获取也可以用上面提供的链接

http://blog.genesino.com/2013/05/ucsc-usages/。

ct@ehbio:~/bedtools$ head genome.txt chr1    249250621chr10   135534747chr11   135006516chr11_gl000202_random   40103chr12   133851895chr13   115169878chr14   107349540chr15   102531392ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5chr1    0   11873chr1    12227   12612chr1    12721   13220chr1    14829   14969chr1    15038   15795

基因组覆盖广度和深度

计算基因组某个区域是否被覆盖，覆盖深度多少。有下图多种输出格式，也支持RNA-seq数据，计算junction-reads覆盖。

genome.txt里面的内容就是染色体及对应的长度。

# 对单行FASTA，可如此计算# 如果是多行FASTA，则需要累加ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{\if($0~/>/) {seq_name=$0;sub(">","",seq_name);} \else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt chr1    60001chr2    54001chr3    54001chr4    60001ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga \-g ../bio/genome.txt | head# 这个warning很有意思，因为BAM中已经有这个信息了，就不需要提供了**********WARNING: Genome (-g) files are ignored when BAM input is provided. *****# bedgraph文件，前3列与bed相同，最后一列表示前3列指定的区域的覆盖度。chr1    0   11  0chr1    11  17  1chr1    17  20  2chr1    20  31  3chr1    31  36  4chr1    36  43  6chr1    43  44  7chr1    44  46  8chr1    46  48  9chr1    48  54  10

两个思考题：

怎么计算有多少基因组区域被测到了？

怎么计算平均测序深度是多少？

数据集相似性

bedtools jaccard计算的是给定的两个bed文件之间交集区域(intersection)占总区域(union-intersection)的比例(jaccard)和交集的数目(n_intersections)。

ct@ehbio:~/bedtools$ bedtools jaccard \-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \-b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bedintersection    union-intersection  jaccard n_intersections81269248    160493950   0.50637 130852

小思考：1. 如何用bedtools其它工具算出这个结果？2.

如果需要比较的文件很多，怎么充分利用计算资源？

一个办法是使用for循环,

双层嵌套。这种用法也很常见，不管是单层还是双层for循环，都有利于简化重复运算。

ct@ehbio:~/bedtools$ for i in *.merge.bed; do \for j in *.merge.bed; do \bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$i\t$j\t/"; \done; done >total.similarity

另一个办法是用parallel，不只可以批量，更可以并行。

root@ehbio:~# yum install parallel.noarch# parallel 后面双引号("")内的内容为希望用parallel执行的命令，# 整体写法与Linux下命令写法一致。# 双引号后面的 三个相邻冒号 (:::)默认用来传递参数的，可多个连写。# 每个三冒号后面的参数会被循环调用，而在命令中的引用则是根据其出现的位置，分别用{1}, {2}# 表示第一个三冒号后的参数，第二个三冒号后的参数。## 这个命令可以替换原文档里面的整合和替换, 相比于原文命令生成多个文件，这里对每个输出结果# 先进行了比对信息的增加，最后结果可以输入一个文件中。#ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \| sed 's/^/{1}\t{2}\t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed`  >totalSimilarity.2# 上面的命令也有个小隐患，并行计算时的输出冲突问题，可以修改为输出到单个文件,再cat到一起ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \| sed 's/^/{1}\t{2}\t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed` ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2# 替换掉无关信息ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' \-e 's/.hg19//' totalSimilarity.2

原文档的命令，稍微有些复杂，利于学习不同命令的组合。使用时推荐使用上面的命令。

ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} \| awk 'NR>1' | cut -f 3 \> {1}.{2}.jaccard" \::: `ls *.merge.bed` ::: `ls *.merge.bed`

This command will create a single file containing the pairwise Jaccard

measurements from all 400 tests.

find . \| grep jaccard \| xargs grep "" \| sed -e s"/\.\///" \| perl -pi -e "s/.bed./.bed\t/" \| perl -pi -e "s/.jaccard:/\t/" \> pairwise.dnase.txt

A bit of cleanup to use more intelligible names for each of the samples.

cat pairwise.dnase.txt \
| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \
| sed -e 's/.hg19//g' \
> pairwise.dnase.shortnames.txt

Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow

the data to play nicely with R.

awk 'NF==3' pairwise.dnase.shortnames.txt \
| awk '$1 ~ /^f/ && $2 ~ /^f/' \
| python make-matrix.py \
> dnase.shortnames.distance.matrix

在广大粉丝的期待下，《生信宝典》联合《宏基因组》于2018年4月14在北京鼓楼推出《ChIP-seq分析专题培训》，为大家提供一条走进生信大门的捷径、为同行提供一个ChIP-seq实战分析学习和交流的机会、助力学员真正理解分析原理和完成实战分析，独创线下集中授课2天+自行练习5天+再集中讲解答疑2天+后期学习群的四段式教学，并提供学习视频，教、学、练、答结合，真正实现独立分析大数据。

关于学习生物信息学分析的重要性，请阅读《生物信息9天速成班—成为团队中不可或缺的人》。

ChIP-seq基本分析流程见流程。课程将在这个基础上，提供更深入地分析指导。

课程介绍

座位按报名并成功缴费顺序从前到后龙摆尾式排序。
赠送价值188元线上生信基础课程一门，目前的《应用Python处理生物信息数据和作图》、《生物信息作图系列R、Cytoscape及图形排版》和《生物信息中的Linux应用》任选其一。
获赠32G品牌定制U盘 (内含数据资料)。
多人(N，10>N>1)组团报名并同时缴费，每人还可获得价值N百元的礼品(京东购物卡)。