编者按

RetroXpert--Decompose Retrosynthesis Prediction Like A Chemist是NeurIPS2020上的一篇关于逆合成的文章。其中，RetroXpert模型在目前基于模版的逆合成方向做端到端预测中Top-n accuracy 最高的模型。本文总结分析了文献的主要实验方法，预测结果以及预测可视化成果，以供读者更好的理解文献。

1.引言

This article was led by AI Lab in November 2020, and was published in NeurIPS 2020 with Texas University, Tsinghua University and Sun Yat-sen University. The design problem is mainly divided into two steps: firstly, a new graph neural network is used to identify the potential reaction center of the target molecule, and an intermediate synthon is generated, and then a robust reactant generation model is used to generate the synthon related The reactants. The difference between it and G2Gs is that G2Gs can only predict a special key disconnection at most. In addition, G2Gs independently produce multiple reactants, ignoring the relationship between multiple reactants. In order to overcome two difficulties, three methods of combining unsuccessful synths were proposed.

1）通过化学知识增强的边缘增强图注意网络(EGAT)来识别潜在的反应中心

2）进一步通过引入预测不成功的合成子来增强训练数据，这使RGN具有鲁棒性。

3）在标准的USPTO-50K数据集上，模型效果达到了70.4%（已知类型）和65.5%（类型未知）的top1。

1.1 实验方法

1.1.1 反应中心的识别

通过引入特征：N个节点X ∈ ，边的表示为E ∈ ,邻接矩阵来识别在反向合成过程中将断开的潜在反应键，然后产物P可以被分成一组中间合子,每个合成子都可以看作是一个反应物的子结构。如图1所示

图1 反应中心识别

引入EGAT(图注意力网络)：

Since products can be produced by different reactions, there can be multiple reaction centers for a given product, and each reaction center corresponds to a different reaction. MPNN only captures the local structure information of each node, and if there is no global information, it is difficult to distinguish multiple reaction centers. G2Gs can only predict one broken key information. Here, a graph-level auxiliary task is introduced to predict the total number of broken keys. Unlike GAT, GAT input nodes and graph-level embedding，如图2所示，EGAT将输入节点和边缘特征向量cat在一起，并且叠加多个EGAT层来得到最后一层的节点和边表示。

图2 EGAT模型

此外，通过来判断键的断裂与否。并且用交叉熵损失函数的负对数尽可能最小化真实值和预测值，来获取二者的loss。公式如下所示。

K是训练反应的总数，键的标签可以通过比较靶点和反应物的分子图获得。此外，用来作为对所有学习的节点表示的读出操作的输出。每一个目标的反应的最大断键数通过以下公式来表示（最大断键数表示隐藏在w矩阵当中）

1.1.2 通过合成子生成所需的反应物集

通过将生成的合成子和产物用link拼接反应类型的文本格式来作为RGN的输入生成不同的反应物，如图3所示。

图3 反应物的生成

1.2实验部分

1.2.1实验数据

The USPTO-50K data set was used, which used 1,808,937 original responses without response types. Use 8:1:1，Perform training, verification, and test splits.

1.2.2反应中心预测结果

· 通过消融实验去除了边的嵌入来观察计算节点和节点之间的相关性。

· 当反应类型未知时，改进更显著。表明反应类型在合成中起着重要的作用。同一类型的反应通常具有相似的反应模式（涉及原子、键和官能团），如果反应类型是先验知识，就更容易识别反应中心。

1.2.3反应物生成预测结果

通过消融实验进行数据增强，如图4所示：

图4 反应类型已知和未知的Topn结果

通过将EGAT未成功预测的同步子作为RGN[2]训练数据增强来增强RGN。在评估测试数据时，还从EGAT收集预测合成子来形成RGN输入序列，而不考虑反应中心识别是否成功。一般来说，仅使用训练或测试增强对两种情况（有反应类型和没有反应类型）后合成性能的影响都很小。而如果采用训练和测试增强，合成性能将显著提高。证明了利用训练数据增强使RGN具有鲁棒性。

1.2.4 预测可视化

辅助任务可以识别反应中心，原因是这两种颜色的键及其周围的结构非常相似。目前的浅网络只考虑局部信息，[3]无法区分真正的反应中心。在辅助任务的指导下，EGAT能够识别出真正的反应中心。

注：The pink color indicates the probability of disconnection predicted by the reaction center and the main task of EGAT. Blue indicates the position of the real disconnection.

1.3讨论与总结

1）目前逆向合成工作的一个主要共同局限性是缺乏合理的评价指标。而目前的评价度量只考虑给定的反应，之后会提出除了Topn以外的指标。为了解决smiles的立体化学和互变异构化差等问题，化学反应采用的数据库不仅限于USPTO数据库，可以考虑Reaxys数据库或内部数据。

2）反应类型在训练样本中非常重要。由于在反应物生成中所提出的增强策略由于引入的脏训练样本，如果没有给定的反应类型，RGN通常随着增强表现更差。然而，当给定的反应类型时，这种增强提高了其预测精度。

参考文献

[1] Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes with weisfeiler-lehman network. In Advances in Neural Information Processing Systems, 2017.

[2] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. OpenNMT: Opensource toolkit for neural machine translation. In Proc. ACL, 2017.

[3] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1263–1272. JMLR. org, 2017.

极链AI云平台现已经上传了RetroXpert模型，有复现需要的小伙伴可以点击链接，查看模型哦～

详情链接：模型详情页- retroexpert

最新NLP论文解析--附RetroXpert复现教程

编者按