identity和similarity有什么区别,发现自己对这几个概念也不甚了了,于是做了点功课,如下。

第一反应 去查了BLAST的glossary

Identity
The extent to which two (nucleotide or amino acid) sequences are invariant.
Similarity
The extent to which nucleotide or protein sequences are related. The extent
 of similarity between two sequences can be based on percent sequence identity
 and/or conservation. In BLAST similarity refers to a positive matrix score.

但是BLAST的output里头没有similarity这一项,奇怪。

>sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL (PAI-2)
                 (MONOCYTE ARG- SERPIN).
                 Length = 415
      Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
      Identities = 38/89 (42%), Positives = 50/89 (56%)
     Query:     1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQ
                  +I +LL   S D DT +VLVNA+YFKG WKT F  +     PF V   
     Sbjct:   180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSA

然后找到下面这句话

Identities correspond to exact matches and positives are similarities based
 on the scoring matrix used. (来自BLAST tutorial)

可见positivies就是某种修正过的similarities了。结合起来一看就清楚了,

identities->exact matches
positives->similarities based the matirx

在比较nucleotide seq时认为ATCG四个碱基出现机会相等,任何两个之间相同就得一分,替换后都得零分,一个非常简单的Substitution Matrix,这个时候identities和similarities(BLAST中就是positives)是相同的,因为用了这个简单的Substitution Matrix后,计算方法两者是一样的。在比较protein seq时Substitution Matrix用的是BLOSUM,相同的氨基酸得分高,相似的氨基酸得分低,不相匹配的的零分,这个时候identities和positives的计算方法是不一样的,所以两者也就不一样了。

至于统计上的similarity和生物学意义上的homology 又不一样了。想到这里又Google下了homology和similarity,嗯,很大一行字,Similarity is NOT equal to Homology,单独做了个网页强调这两个不是一回事,值得好好注意哦。

(2010.10.1)又看到有人评论,自己看了一下,Similarity is NOT equal to Homology的网页链接失效了,通过waybackmachine找了回来贴在下面。

Similarity is NOT equal to Homology

IDENTITY - The extent to which two sequences are invariant.

SIMILARITY - The extent to which sequences are related. Similarity makes no statement about descent from a common ancestor. (Convergent versus Divergent evolution.)

HOMOLOGY - Sequence similarity that can be attributed to descent from a common ancestor.

There are Two Types of Homology

ORTHOLOGOUS - Homologous sequences in different species. These sequences usually retain the same function in the two species.

PARALOGOUS - Homologous sequences in the same species that arose by means of gene duplication. Divergence of function is more common between paralogues.

Why is this important?
Homology is a matter of opinion, not directly measurable or observable.
Similarity is a direct measurement and can be discussed in terms of percentages.

(See Reeck et al. Cell 50(5): 667 (1987)

另外,Score 与bits-Score的区别:

BLAST Score

The Statistics of Sequence Similarity Scores

BLAST Score
BLAST scores rely on extensive theory. We start by making the following assumptions:
The BLAST score is scoring local ungapped alignments. The theory of scoring here is well understood.
The database sequences are assumed to be evolutionary unrelated, i.e. independent of one another.
The alignment starts at specific positions along query and database record.
The score matrix must give, on the average, a negative (a,b) score. Were this not the case, long alignments would tend to have high score independently of whether the segment aligned were related, and the statistical theory would break down.

Figure 5.10: Random walk: The score for a match is +2 and the punishment for a missmatch is -1, As shown,the expectancy for the whole walk is negative. The probability that the Top Score will be larger than X decreases exponentially with x.

When searching a query of length m in a database of total length n one performs m*n random walk experiment, each with exponentially decreasing probability of achieving a score S. Thus, the E-value for score s is:

.

and K are constants:

- scaling factor
K - correction for dependency and bias of the scoring scheme.

Indeed the E-score is normalized by the length of the query and database: The same alignment would have different E-score if these length are different. Also the E-score is exponential, thus it is instructive to consider a normalization of the E-score into logarithmic scale, called the Bit - score.

The Bit-score B is computed from the E-score E by E=mn2-B. Obviously, the Bit-score is linear in the raw score s:

.
In contrast to raw scores, that have little meaning without k and

, the Bit-score is measured in standard units (see eg. [17]). Naturally, the meaning of the Bit-score depends on sizes of the query and the database.

Again, as mentioned before one can ask for the P-value (the probability of the observed number of records with a known E-value or lower).
Define the random variable Y to be the observed number of pairs achieveing E-value E or better(smaller).
Y is distributed Poisson with (E). The Probability of Ye to be r is

, and the probability of Ye to be 0 is equivilant to the probability that the (Best E-score < E)=exp (-E). Specifically the chance of finding zero alignments with score >= S is e-E so the probability of finding at least one such alignment is 1-e-E . This is the P-value associated with the score S (see eg. [17]). Note that this model assumes an I.I.D trial for each database position.

Identity, Positive, 和Similarity的区别相关推荐

  1. mysql identity sqlserver_mysql和sqlserver的区别

    1 show variables like 'version'查看版本2 3 1:4 DELIMITER $$ 开始 $$ DELIMITER 结束 其实就是告诉mysql解释器,该段命令是否已经结束 ...

  2. Operations on word vectors-v2 吴恩达老师深度学习课程第五课第二周编程作业1

    吴恩达老师深度学习课程第五课(RNN)第二周编程作业1, 包含答案 Operations on word vectors Welcome to your first assignment of thi ...

  3. 吴恩达深度学习5.2练习_Sequence Models_Operations on word vectors

    转载自吴恩达老师深度学习课程作业notebook Operations on word vectors Welcome to your first assignment of this week! B ...

  4. 序列模型第二周作业1:Operations on word vectors

    来自吴恩达深度学习系列视频:序列模型第二周作业1:Operations on word vectors.如果英文对你来说有困难,可以参照:[中文][吴恩达课后编程作业]Course 5 - 序列模型 ...

  5. Paper Reading 《SimCSE》

    Paper Reading: SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 尚未发表.Github. Paper. ...

  6. 左手坐标系vs右手坐标系

    在3D空间中,点由x.y.z三个数确定,它们分别代表到平面yz.xz.xy的距离.如下图所示: 有两种不同的3D坐标空间:左手坐标空间和右手坐标空间.最直观的方式区分一个坐标系统通过你的手!左手和右手 ...

  7. Oracle学习I —— Oracle介绍

    第一章 Oracle介绍 Oracle简介 Oracle数据库是由甲骨文公司开发的关系型数据库:它为各行业在各类环境下(服务器.虚拟机.微机环境下)可以快速搭建一种高效率.可靠性好.高吞吐量的数据库解 ...

  8. Assignment | 05-week2 -Part_1-Operations on word vectors

    该系列仅在原课程基础上课后作业部分添加个人学习笔记,如有错误,还请批评指教.- ZJ Coursera 课程 |deeplearning.ai |网易云课堂 CSDN:http://blog.csdn ...

  9. 深度度量学习(DML)中pair-based方法中的loss

    文章目录 前言 一.Constrative loss[1] 二.Triplet loss[2] Offline and online triplet mining 参考 三.Lifted Struct ...

最新文章

  1. java中字符串转化为Ascii码
  2. 90后招你惹你了?去你的佛系!
  3. 2020,Python 已死?
  4. java.lang.math.trunc,java – JPA/Hibernate返回BigDecimal不长
  5. 使用MybatisPlus在实体中添加数据库表中不存在的字段
  6. 学术会议html模板,学术会议的常用模板
  7. FEMTransfer软件实现Patran/Nastran/Abaqus/Ansys/Sesam(Genie)/Workbench/Femap/盈建科/PKPM仿真分析软件的有限元模型相互转换导入
  8. SpringBoot服务端集成腾讯云短信服务
  9. 关于程序化交易 这篇文章说透了
  10. Python爬取、可视化分析B站大司马视频40W+弹幕
  11. 读书笔记之《大型分布式网站架构设计与实践》
  12. Win10系统上搭建GIT本地服务器
  13. C stdlib.h
  14. php拆分excel,PHP如何切割excel大文件(附完整代码)
  15. c语言判断正整数x是否为同构数,c语言上机题库(阅读).doc
  16. 【C语言】PAT乙级:1005 继续(3n+1)猜想
  17. 官网Instagram集成
  18. 刚学python写了一个类,为何__del__ : 析构函数,释放对象时没有运行
  19. JS--JavaScript提交表单(submit事件)、重置表单、取消默认提交表单(单击按钮、回车)
  20. Linux--进程和计划任务管理 理论干货+实操(程序,进程,线程之间关系的详解,静态与动态查看进程方式,控制进程的 方式,一次性任务与周期性任务的设置)

热门文章

  1. 阿里云服务器和腾讯云不同的地方
  2. turtle之绘制美国队长的盾牌
  3. FFmpeg创作GIF表情包教程来了!赶紧说声多谢乌蝇哥?
  4. python 32bit? 64bit?
  5. 动车报销凭证怎么取?高铁票的报销凭证在哪里取?
  6. Arduino - 摇杆模块
  7. idea 使用markdown总结
  8. java架构师全套图解,使用/教程/实例
  9. Word中的手动换行符
  10. 作为一个计算机专业的学生,除了教材,这些书籍你读过多少?