BIOM：生物观测矩阵——微生物组数据通用数据格式

文章目录

简介
biom工具安装
文件格式
小提示和常见问题
快速使用Quick Start
文件格式转换
biom文件添加样本分组和物种注释
BIOM表统计
Reference
附录1. Python中交互操作biom的函数
- 函数
猜你喜欢
写在后面

简介

http://biom-format.org/

BIOM格式是微生物组领域最常用的结果保存格式，优点是可将OTU或Feature表、样本属性、物种信息等多个表保存于同一个文件中，且格式统一，体积更小巧，目前被微生物组领域几乎所有主流软件所支持：

QIIME
MG-RAST
PICRUSt
Mothur
phyloseq
MEGAN
VAMPS
metagenomeSeq
Phinch
RDP Classifier
USEARCH
PhyloToAST
EBI Metagenomics
GCModeller
MetaPhlAn 2

BIOM格式于2012年Rob Knight首发于我国GigaScience杂志上，被引242次。

The Biological Observation Matrix (or BIOM, canonically pronounced biome) 是微生物组分析的核心数据类型。

我们主要了解以下三方面的内容：

BIOM文件格式的定义；
biom命令对文件格式的转换、添加元数据、总结等；
使用Python和R操作BIOM文件

biom工具安装

常用的biom操作工具是一个python包，可通过pip、conda等安装

# 安装依赖关系科学计算包
pip install numpy
# 安装biom包
pip install biom-format
# 安装biom2.0格式支持
pip install h5py
# 显示命令行
biom

更推荐，conda安装 python和r相应的操作包

相应bioconda包在 https://bioconda.github.io/recipes.html 查询名称和版本详细

# 安装Python包
conda install biom-format # 2.1.7
# 安装r的biom包
conda install r-biom
# 或安装r微生物组包，包括了r-biom
conda install bioconductor-microbiome

主要功能如下

sage: biom [OPTIONS] COMMAND [ARGS]...ptions:--version   版本Show the version and exit.-h, --help  帮助Show this message and exit.ommands:add-metadata       添加元数据 Add metadata to a BIOM table.convert            文本表格与biom互转 Convert to/from the BIOM table format.from-uc            转换uc为biom Create a BIOM table from a vsearch/uclust/usearch BIOM...head               跳过表头 Dump the first bit of a table.normalize-table    标准化 Normalize a BIOM table.show-install-info  提供安装信息 Provide information about the biom-format installation.subset-table       提取子集 Subset a BIOM table.summarize-table    统计摘要 Summarize sample or observation data in a BIOM table.table-ids          转储 Dump IDs in a table.validate-table     格式验证 Validate a BIOM-formatted file.

文件格式

http://biom-format.org/documentation/biom_format.html

BIOM目前分为1.0 JSON和2.0 HDF5两个版本；

1.0 JSON是编程语言广泛支持的格式，类似于散列的键值对结果。会根据数据松散程度，选择不同的存储结构来节省空间。

2.0 HDF5是二进制格式，被许多程序语言支持，读取更高效和节约空间。

小提示和常见问题

BIOM的目的是存储和处理大、松散的表；储存研究主要信息为单个文件；格式在不同软件间通用。

下面是OTU表常用存储的两种样式

紧密OTU表 A dense representation of an OTU table:

OTU ID PC.354  PC.355  PC.356
OTU0   0   0   4
OTU1   6   0   0
OTU2   1   0   7
OTU3   0   0   3

松散OTU表 A sparse representation of an OTU table:

PC.354 OTU1 6
PC.354 OTU2 1
PC.356 OTU0 4
PC.356 OTU2 7
PC.356 OTU3 3

OTU表经常会有90%的0，甚至99%为0。其中BIOM 1.0支持松散、紧密两种格式；BIOM2.x仅支持松散格式。

封装核心研究数据(OTU表、样本信息和OTU物种注释)至单个文件

快速使用Quick Start

本节讲指在python中交互操作biom格式文件，我不常用，具体见附录1.

文件格式转换

convert命令可以将文本格式的表格与biom格式间自由转换。

转换为制表符分隔的表格，方便在Excel等程序中查看；
转换松散或紧密格式的biom(biom1.0只支持紧密dense格式)

制表符分隔的表格通常称为经典格式表格，BIOM格式称为biom表格。

转换经典表格为HDF5或JSON格式

biom convert -i table.txt -o table.from_txt_json.biom --table-type="OTU table" --to-json
biom convert -i table.txt -o table.from_txt_hdf5.biom --table-type="OTU table" --to-hdf5

转换biom为经典格式

biom convert -i table.biom -o table.from_biom.txt --to-tsv

转换biom为经典格式，并在最后列包括物种注释信息

biom convert -i table.biom -o table.from_biom_w_taxonomy.txt --to-tsv --header-key taxonomy

转换biom为经典格式，并在最后列包括物种注释信息，并改名为ConsensusLineage

此功能对于一些软件要求指定的列名有很有用。

biom convert -i table.biom -o table.from_biom_w_consensuslineage.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"

带物种注释表格互转

biom convert -i table.biom -o table_tax.txt --to-tsv --header-key taxonomy
biom convert -i table_tax.txt -o new_table.biom --to-hdf5 --table-type="OTU table" --process-obs-metadata taxonomy
biom convert -i table_tax.txt -o new_table.biom --to-json --table-type="OTU table" --process-obs-metadata taxonomy

转换QIIME1.4早期表格为BIOM格式(不常用)

sed 's/Consensus Lineage/ConsensusLineage/' < otu_table.txt | sed 's/ConsensusLineage/taxonomy/' > otu_table.taxonomy.txt
biom convert -i otu_table.taxonomy.txt -o otu_table.from_txt.biom --table-type="OTU table" --process-obs-metadata taxonomy --to-hdf5

biom文件添加样本分组和物种注释

biom add-metadata -h # 显示帮助

Usage: biom add-metadata [OPTIONS]Add metadata to a BIOM table.Add sample and/or observation metadata to BIOM-formatted files. Seeexamples here: http://biom-format.org/documentation/adding_metadata.htmlExample usage:Add sample metadata to a BIOM table:$ biom add-metadata -i otu_table.biom -o table_with_sample_metadata.biom-m sample_metadata.txtOptions:-i, --input-fp FILE             输入文件The input BIOM table  [required]-o, --output-fp FILE            输出文件The output BIOM table  [required]-m, --sample-metadata-fp FILE   样本信息The sample metadata mapping file (will addsample metadata to the input BIOM table, ifprovided).--observation-metadata-fp FILE  OTU物种注释 The observation metadata mapping file (willadd observation metadata to the input BIOMtable, if provided).--sc-separated TEXT             元数据按分号分隔，如物种分类级 Comma-separated list of the metadata fieldsto split on semicolons. This is useful forhierarchical data such as taxonomy orfunctional categories.--sc-pipe-separated TEXT        元数据按竖线分隔，如lefse Comma-separated list of the metadata fieldsto split on semicolons and pipes ("|"). Thisis useful for hierarchical data such asfunctional categories with one-to-manymappings (e.g. x;y;z|x;y;w)).--int-fields TEXT                分号分隔的整数 Comma-separated list of the metadata fieldsto cast to integers. This is useful forinteger data such as "DaysSinceStart".--float-fields TEXT             分号分隔的符点数 Comma-separated list of the metadata fieldsto cast to floating point numbers. This isuseful for real number data such as "pH".--sample-header TEXT            指定样本属性列名 Comma-separated list of the sample metadatafield names. This is useful if a header lineis not provided with the metadata, if youwant to rename the fields, or if you want toinclude only the first n fields where n isthe number of entries provided here.--observation-header TEXT       OTU属性样名 Comma-separated list of the observationmetadata field names. This is useful if aheader line is not provided with themetadata, if you want to rename the fields,or if you want to include only the first nfields where n is the number of entriesprovided here.--output-as-json                输出JSON格式 Write the output file in JSON format.-h, --help                      帮助 Show this message and exit.

你的样本分组文件是这样格式的

head sample.txt

#SampleID       BarcodeSequence genotype
KO1     TAGCTT  KO
KO2     GGCTAC  KO
KO3     CGCGCG  KO

你的物种注释信息是这样的

head taxonomy.txt

#OTUID  taxonomy        confidence
OTU_325 k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Cryomorphaceae;g__;s__    0.880
OTU_324 k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__;g__;s__       1.000

添加样本分组信息

biom add-metadata -i table.biom -o table.w_smd.biom --sample-metadata-fp sample.txt

添加OTU注释

biom add-metadata -i table.biom -o table.w_omd.biom --observation-metadata-fp taxonomy.txt

添加样本和OTU注释

biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt

同时添加行列信息
可以指定注释的列格式，如整数integers (–int-fields)、浮点小数 (–float-fields)、或物种层级注释并用分号分隔 (–sc-separated)

biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt --sc-separated taxonomy --float-fields confidence

–observation-header和–sample-header可以重命名列名，

biom add-metadata -i min_sparse_otu_table.biom -o table.w_smd.biom --sample-metadata-fp sam_md.txt --sample-header SampleID,BarcodeSequence,DateOfBirthbiom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy,confidence

可以指定名称的列读入

biom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy --sc-separated taxonomy

BIOM表统计

biom summarize-table -h

统计每个样品

biom summarize-table -i table.w_md.biom -o table.w_md_summary.txt

示例结果如下：

Num samples: 27
Num observations: 975
Total count: 409647
Table density (fraction of non-zero values): 0.464Counts/sample summary:Min: 2352.0Max: 35955.0Median: 14851.000Mean: 15172.111Std. dev.: 10691.823Sample Metadata Categories: BarcodeSequence; genotypeObservation Metadata Categories: taxonomy; confidenceCounts/sample detail:
OE4: 2352.0
OE3: 2353.0
OE8: 3091.0
OE2: 3173.0

统计每个样本中的观察值数量unique observations per sample，即alpha多样性 richness

biom summarize-table -i table.w_md.biom --qualitative -o table.w_md_qual_summary.txt

结果如下：

Num samples: 27
Num observations: 975Observations/sample summary:Min: 222Max: 633Median: 486.000Mean: 452.704Std. dev.: 138.713Sample Metadata Categories: BarcodeSequence; genotypeObservation Metadata Categories: taxonomy; confidenceObservations/sample detail:
OE3: 222
OE4: 248
OE8: 261
OE1: 272
OE2: 278

Reference

The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.
Daniel McDonald, Jose C. Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, and J. Gregory Caporaso.
GigaScience 2012, 1:7. doi:10.1186/2047-217X-1-7

http://biom-format.org/

附录1. Python中交互操作biom的函数

函数

Python中只要有biom包，可在Python交互的命令行中读取

load_table(f) 函数读取biom文件

读取并展示biom的内置数据

>>> from biom import example_table
>>> print(example_table)
# Constructed from biom file
#OTU ID S1  S2  S3
O1  0.0 1.0 2.0
O2  3.0 4.0 5.0

从文件读取biom文件

from biom import load_table
table = load_table('otutab.biom')

Table函数

Table(data, observation_ids, sample_ids[, …])

import numpy as np
from biom.table import Table
data = np.arange(40).reshape(10, 4)
sample_ids = ['S%d' % i for i in range(4)]
observ_ids = ['O%d' % i for i in range(10)]
sample_metadata = [{'environment': 'A'}, {'environment': 'B'},
{'environment': 'A'}, {'environment': 'B'}]
observ_metadata = [{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Bacteroidetes']},
{'taxonomy': ['Bacteria', 'Bacteroidetes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']}]
table = Table(data, observ_ids, sample_ids, observ_metadata,
sample_metadata, table_id='Example Table')table # 表格信息print(table) # 输出表格print(table.ids()) # 显示样本名print(table.ids(axis='observation')) # 显示观测值名称print(table.nnz)  # 非零number of nonzero entries

我更喜欢命令行模型，对于Python中交互使用，更多代码详见 http://biom-format.org/documentation/table_objects.html

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外2600+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA