ml-agents_使用ML-Agents的自玩功能来训练智能对手

ml-agents

In the latest release of the ML-Agents Toolkit (v0.14), we have added a self-play feature that provides the capability to train competitive agents in adversarial games (as in zero-sum games, where one agent’s gain is exactly the other agent’s loss). In this blog post, we provide an overview of self-play and demonstrate how it enables stable and effective training on the Soccer demo environment in the ML-Agents Toolkit.

在最新版的ML-Agents工具包(v0.14)中，我们添加了一项自玩功能，该功能可在对抗游戏(如零和游戏中，一个代理商的收益恰好是其他代理人的损失)。在此博客文章中，我们提供了自我竞赛的概述，并演示了它如何在ML-Agents Toolkit中的足球演示环境上实现稳定有效的培训。

The Tennis and Soccer example environments of the Unity ML-Agents Toolkit pit agents against one another as adversaries. Training agents in this type of adversarial scenario can be quite challenging. In fact, in previous releases of the ML-Agents Toolkit, reliably training agents in these environments required significant reward engineering. In version 0.14, we have enabled users to train agents in games via reinforcement learning (RL) from self-play, a mechanism fundamental to a number of the most high profile results in RL such as OpenAI Five and DeepMind’s AlphaStar. Self-play uses the agent’s current and past ‘selves’ as opponents. This provides a naturally improving adversary against which our agent can gradually improve using traditional RL algorithms. The fully trained agent can be used as competition for advanced human players.

Unity ML-Agents工具包的Tennis和Soccer示例环境使代理人相互对抗。在这种对抗性情况下培训代理人可能会非常具有挑战性。实际上，在ML-Agents Toolkit的早期版本中，在这些环境中可靠地培训代理需要大量的奖励工程。在 0.14版中，我们使用户能够通过自学的强化学习(RL)来训练游戏中的特工，这是RL中许多最引人注目的结果的基础机制，例如 OpenAI Five 和 DeepMind的AlphaStar 。自我玩法将坐席的当前和过去“自我”用作对手。这提供了一个自然地改进的对手，我们的代理可以使用传统的RL算法来逐步改进它。训练有素的经纪人可以用作高级选手的比赛。

Self-play provides a learning environment analogous to how humans structure competition. For example, a human learning to play tennis would train against opponents of similar skill level because an opponent that is too strong or too weak is not as conducive to learning the game. From the standpoint of improving one’s skills, it would be far more valuable for a beginner-level tennis player to compete against other beginners than, say, against a newborn child or Novak Djokovic. The former couldn’t return the ball, and the latter wouldn’t serve them a ball they could return. When the beginner has achieved sufficient strength, they move on to the next tier of tournament play to compete with stronger opponents.

自我游戏提供了类似于人类如何构成竞争的学习环境。例如，人类学习打网球会训练类似技能水平的对手，因为太强或太弱的对手都不利于学习比赛。从提高技能的角度来看，对于初学者水平的网球运动员来说，与其他初学者竞争比与新生婴儿或诺瓦克·德约科维奇竞争更有价值。前者不能退回球，后者不能为他们提供可以退回的球。当初学者获得足够的力量时，他们将进入下一级比赛，与实力更强的对手竞争。

In this blog post, we give some technical insight into the dynamics of self-play as well as provide an overview of our Tennis and Soccer example environments that have been refactored to showcase self-play.

在此博客文章中，我们提供了一些有关自我比赛动态的技术见解，并概述了重构后的网球和足球示例环境以展示自我比赛。

游戏中自我游戏的历史 (History of self-play in games)

The notion of self-play has a long history in the practice of building artificial agents to solve and compete with humans in games. One of the earliest uses of this mechanism was Arthur Samuel’s checker playing system, which was developed in the ’50s and published in 1959. This system was a precursor to the seminal result in RL, Gerald Tesauro’s TD-Gammon published in 1995. TD-Gammon used the temporal difference learning algorithm TD(λ) with self-play to train a backgammon agent that nearly rivaled human experts. In some cases, it was observed that TD-Gammon had a superior positional understanding to world-class players.

在构建人工代理以解决游戏中的人类并与人类竞争的实践中，自我游戏的概念由来已久。这种机制最早的用途之一是亚瑟·塞缪尔(Arthur Samuel)的棋盘游戏系统，该系统于50年代开发并于1959年发布。该系统是RL(Gerald Tesauro的TD-Gammon) 于1995年发表的开创性结果的先驱。 TD-Gammon使用具有自我玩法的时差学习算法TD (λ ) 来训练几乎可以与人类专家匹敌的步步高智能体。在某些情况下，我们发现TD-Gammon对世界一流玩家的位置了解更高。

Self-play has been instrumental in a number of contemporary landmark results in RL. Notably, it facilitated the learning of super-human Chess and Go agents, elite DOTA 2 agents, as well as complex strategies and counter strategies in games like wrestling and hide and seek. In results using self-play, the researchers often point out that the agents discover strategies which surprise human experts.

自我扮演在RL中取得了许多当代里程碑式的成就。值得注意的是，它促进了超人国际象棋和围棋特工，精英 DOTA 2 特工的学习，以及诸如摔跤，捉迷藏等游戏中复杂策略和反策略的学习。在使用自我游戏的结果中，研究人员经常指出，代理商发现了令人类专家惊讶的策略。

Self-play in games imbues agents with a certain creativity, independent of that of the programmers. The agent is given just the rules of the game and told when it wins or loses. From these first principles, it is up to the agent to discover competent behavior. In the words of the creator of TD-Gammon, this framework for learning is liberating “…in the sense that the program is not hindered by human biases or prejudices that may be erroneous or unreliable.” This freedom has led agents to uncover brilliant strategies that have changed the way human experts view certain games.

游戏中的自我游戏赋予代理商一定的创造力，而独立于程序员的创造力。代理仅被赋予游戏规则，并被告知获胜的时间。从这些首要原则出发，由代理发现主管行为。用TD-Gammon的创建者的话来说，这种学习框架正在“解放 ……在某种意义上，该程序不会受到可能有错误或不可靠的人为偏见或偏见的阻碍。” 这种自由导致特工发现了出色的策略，这些策略改变了人类专家查看某些游戏的方式。

对抗游戏中的强化学习 (Reinforcement Learning in adversarial games)

In a traditional RL problem, an agent tries to learn a behavior policy that maximizes some accumulated reward. The reward signal encodes an agent’s task, such as navigating to a goal state or collecting items. The agent’s behavior is subject to the constraints of the environment. For example, gravity, the presence of obstacles, and the relative influence the agent’s own actions have, such as applying force to move itself are all environmental constraints. These limit the viable agent behaviors and are the environmental forces the agent must learn to deal with to obtain a high reward. That is, the agent contends with the dynamics of the environment so that it may visit the most rewarding sequences of states.

在传统的RL问题中，代理试图学习使某些累积奖励最大化的行为策略。奖励信号编码代理的任务，例如导航到目标状态或收集物品。代理的行为受环境约束。例如，重力，障碍物的存在以及代理人自身行为所产生的相对影响(例如施加力使其自身移动)都是环境约束。这些限制了可行的代理行为，并且是代理必须学会处理以获得高报酬的环境力量。就是说，代理与环境的动态竞争，以便它可以访问最有价值的状态序列。

On the left is the typical RL scenario: an agent acts in the environment and receives the next state and a reward On the right is the learning scenario wherein the agent competes with an adversary who, from the agent’s perspective, is effectively part of the environment.

左侧是典型的RL场景：代理在环境中行动并接收下一个状态和奖励。右侧是学习场景，其中代理与对手竞争，从代理的角度来看，对手实际上是环境的一部分。

In the case of adversarial games, the agent contends not only with the environment dynamics, but also another (possibly intelligent) agent. You can think of the adversary as being embedded in the environment since its actions directly influence the next state the agent observes as well as the reward it receives.

在对抗性游戏的情况下，代理不仅要与环境动态竞争，而且还要与另一个(可能是智能的)代理竞争。您可以将对手视为嵌入在环境中，因为其行为直接影响代理观察到的下一个状态以及所获得的报酬。

The Tennis example environment from the ML-Agents Toolkit

ML-Agents工具包中的Tennis示例环境

Let’s consider the ML-Agents Tennis demo. The blue racquet (left) is the learning agent, and the purple racquet (right) is the adversary. To hit the ball over the net, the agent must consider the trajectory of the incoming ball and adjust it’s angle and speed accordingly to contend with gravity (the environment). However, just getting the ball over the net is only half the battle when there is an adversary. A strong adversary may return a winning shot causing the agent to lose. A weak adversary may hit the ball into the net. An equal adversary may return the ball, thereby continuing the game. In any case, the next state and reward are determined by both the environment and the adversary. However, in all three situations, the agent hit the same shot. This makes learning in adversarial games and training competitive agent behaviors a difficult problem.

让我们考虑一下ML-Agents Tennis演示。蓝色球拍(左)是学习者，紫色球拍(右)是对手。为了将球击中网，特工必须考虑传入球的轨迹，并相应地调整其角度和速度以与重力(环境)竞争。但是，只有在有对手的情况下，将球传到网上才是成功的一半。强大的对手可能会返回获胜的镜头，导致特工失败。弱小的对手可能会将球击入网中。平等的对手可以将球退回，从而继续比赛。在任何情况下，下一个状态和奖励都取决于环境和对手。但是，在这三种情况下，特工都击中了相同的球。这使得在对抗游戏中学习和训练竞争性代理行为成为一个难题。

The considerations around an appropriate opponent are not trivial. As demonstrated by the preceding discussion, the relative strength of the opponent has a significant impact on the outcome of an individual game. If an opponent is too strong, it may be too difficult for an agent starting from scratch to improve. On the other hand, if an opponent is too weak, an agent may learn to win, but the learned behavior may not be useful against a different or stronger opponent. Therefore, we need an opponent that is roughly equal in skill (challenging but not too challenging). Additionally, since our agent is improving with each new game, we need an equivalent increase in the opponent.

围绕合适对手的考虑并非无关紧要。如前面的讨论所示，对手的相对实力对个人比赛的结果有重大影响。如果对手太强壮，那么从头开始的特工就很难改善自己。另一方面，如果对手太弱了，坐席可能会学会获胜，但是学习到的行为可能对另一个或实力更强的对手没有用。因此，我们需要一个技术水平大致相等的对手(具有挑战性但又不太挑战性)。此外，由于我们的经纪人在每场新游戏中都在 不断进步 ，因此我们需要相应增加对手。

In self-play, a past snapshot or the current agent is the adversary embedded in the environment.

在自播放中，过去的快照或当前代理是嵌入在环境中的对手。

Self-play to the rescue! The agent itself satisfies both requirements for a fitting opponent. It is certainly roughly equal in skill (to itself) and also improves over time. In this case, it is the agent’s own policy that is embedded in the environment (see figure). For those familiar with curriculum learning, you can think of this as a naturally evolving (also referred to as an auto-curricula) curriculum for training our agent against opponents of increasing strength. Thus, self-play allows us to bootstrap an environment to train competitive agents for adversarial games!

自营救人！ 代理本身 满足了装修对手两方面的要求。当然，它的技术(本身)大致相等，并且随着时间的流逝而提高。在这种情况下，环境中嵌入的是代理自己的策略(请参见图)。对于那些熟悉课程学习的人，您可以将其视为自然发展的课程 (也称为自动课程 )，用于培训我们的代理人以对抗不断壮大的对手。因此，自我游戏使我们能够引导环境来训练竞争性代理商进行对抗性游戏！

In the following two subsections, we consider more technical aspects of training competitive agents, as well as some details surrounding the usage and implementation of self-play in the ML-Agents Toolkit. These two subsections may be skipped without loss to the main point of this blog post.

在以下两个小节中，我们将考虑培训竞争性代理商的更多技术方面，以及有关ML-Agents工具包中自玩游戏的使用和实现的一些详细信息。可以跳过这两个小节，而不会丢失本博客文章的要点。

实际考虑 (Practical considerations)

Some practical issues arise from the self-play framework. Specifically, overfitting to defeat a particular playstyle and instability in the training process that can arise from non-stationarity of the transition function (i.e., the constantly shifting opponent). The former is an issue because we want our agents to be general competitors and robust to different types of opponents. To illustrate the latter, in the Tennis environment, a different opponent will return the ball at a different angle and speed. From the perspective of the learning agent, this means the same decisions will lead to different next states as training progresses. Traditional RL algorithms assume stationary transition functions. Unfortunately, by supplying the agent with a diverse set of opponents to address the former, we may exacerbate the latter if we are not careful.

自玩框架会引起一些实际问题。具体而言，过度适应会在训练过程中击败特定的游戏风格和不稳定，这可能是由于过渡功能(即不断变化的对手)的不稳定引起的。前者是一个问题，因为我们希望我们的代理商成为一般竞争者，并对各种类型的对手都强大。为了说明后者，在网球环境中，另一个对手将以不同的角度和速度返回球。从学习主体的角度来看，这意味着随着培训的进行，相同的决定将导致不同的下一个状态。传统的RL算法采用平稳过渡函数。不幸的是，如果为代理人提供各种各样的对手以解决前者，如果我们不谨慎的话，我们可能会加剧后者。

To address this, we maintain a buffer of the agent’s past policies from which we sample opponents against which the learner competes for a longer duration. By sampling from the agent’s past policies, the agent will see a diverse set of opponents. Furthermore, letting the agent train against a fixed opponent for a longer duration stabilizes the transition function and creates a more consistent learning environment. Additionally, these algorithmic aspects can be managed with the hyperparameters discussed in the next section.

为了解决这个问题，我们保留了代理人 过去的政策 的缓冲，从中我们抽样学习者与之竞争更长的对手。通过从代理商过去的政策中取样，代理商将看到各种各样的对手。此外，让代理针对固定的对手训练更长的时间，可以稳定过渡功能并创建更一致的学习环境。此外，可以使用下一节中讨论的超参数来管理这些算法方面。

实施和使用细节 (Implementation and usage details)

With self-play hyperparameter selection, the main consideration is the tradeoff between the skill level and generality of the final policy, and the stability of learning. Training against a set of slowly changing or unchanging adversaries with low diversity results in a more stable learning process than training against a set of quickly changing adversaries with high diversity. The available hyperparameters control how often an agent’s current policy is saved to be used later as a sampled adversary, how often a new adversary is sampled, the number of opponents saved, and the probability of playing against the agent’s current self versus an opponent sampled from the pool. For usage guidelines of the available self-play hyperparameters, please see the self-play documentation in the ML-Agents GitHub repository.

使用自演式超参数选择时，主要考虑因素是最终决策的技能水平和普遍性与学习稳定性之间的权衡。与针对一组多样性低的快速变化或不变的对手进行训练相比，针对一组多样性高的快速变化的对手进行训练可以使学习过程更加稳定。可用的超参数控制将特工的当前策略保存到以后用作采样的对手的频率，采样新的对手的频率，保存的对手的数量以及与特工当前的自身和从中采样的对手进行对抗的概率游泳池。有关可用的自播放超参数的使用指南，请参阅 ML-Agents GitHub存储库中的自播放文档。

In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because the cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively. We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. In a given training run, this value should steadily increase. You can track this using TensorBoard along with other training metrics e.g. cumulative reward.

在对抗游戏中，累积的环境奖励可能不是跟踪学习进度的有意义的指标。这是因为累积奖励完全取决于对手的技能。处于特定技能水平的特工将分别获得或多或少的对较差或更好的特工的奖励。我们提供了 ELO评分系统的实现，该方法用于在零和游戏中计算给定人口中两个玩家之间的相对技能水平。在给定的训练中，该值应稳定增加。您可以使用 TensorBoard 以及其他培训指标(例如累积奖励) 来跟踪此情况。

自我比赛和足球环境 (Self-play and Soccer Environment)

In recent releases, we have not included an agent policy for our Soccer example environment because it could not be reliably trained. However, with self-play and some refactoring, we are now able to train non-trivial agent behaviors. The most significant change is the removal of “player positions” from the agents. Previously, there was an explicit goalie and striker, which we used to make the gameplay look reasonable. In the video below of the new environment, we actually notice role-like, cooperative behavior along these same lines of goalie and striker emerge. Now the agents learn to play these positions on their own! The reward function for all four agents is defined as +1.0 for scoring a goal and -1.0 for getting scored on with an additional per-timestep penalty of -0.0003 to encourage agents to score.

在最近的发行版中，我们没有为我们的Soccer示例环境提供代理策略，因为无法对其进行可靠的培训。但是，通过自我扮演和一些重构，我们现在能够训练非平凡的代理行为。最重要的变化是从代理中删除“玩家位置”。以前，有一个明确的 守门员 和前锋，我们用来使游戏玩法看起来合理。在下面的新环境下的视频，我们实际上通知的角色般，沿着守门员和前锋的这些相同的行合作行为出现。现在，代理商学会了自己扮演这些职位！将 所有四个 特工的奖励功能定义为：+1.0(用于进球)和-1.0(用于得分)，以及每步的额外惩罚(-0.0003)以鼓励特工得分。

We emphasize the point that training agents in the Soccer environment led to cooperative behavior without an explicit multi-agent algorithm or assigning roles. This result shows that we can train complicated agent behaviors with simple algorithms as long as we take care in formulating our problem. The key to achieving this is that agents can observe their teammates—that is, agents receive information about their teammate’s relative position as observations. By making an aggressive play toward the ball, the agent implicitly communicates to its teammate that it should drop back on defense. Alternatively, by dropping back on defense, it signals to its teammate that it can move forward on offense. The video above shows the agents picking up on these cues as well as demonstrating general offensive and defensive positioning!

我们强调了这样一个观点，即在足球环境中训练代理无需明确的多代理算法或分配角色即可导致合作行为。该结果表明，只要我们谨慎地制定问题，就可以使用简单的算法训练复杂的代理行为。实现此目的的关键是探员可以观察队友-即，探员会收到有关队友相对位置的信息。通过对球进行进攻性进攻，特工隐式地向队友传达其应放弃防守。或者，通过放弃防守，它向队友发出信号，表示可以继续进攻。上面的视频显示了特工们根据这些提示采取行动，并展示了一般的进攻和防守位置！

The self-play feature will enable you to train new and interesting adversarial behaviors in your game. If you do use the self-play feature, please let us know how it goes!

自玩功能可让您在游戏中训练新的有趣的对抗行为。如果您确实使用了自动播放功能，请告诉我们它的运行情况！

下一步 (Next steps)

If you’d like to work on this exciting intersection of machine learning and games, we are hiring for several positions, please apply!

如果您想从事这个令人兴奋的机器学习和游戏交汇处，我们正在招聘多个职位，请申请！

If you use any of the features provided in this release, we’d love to hear from you. For any feedback regarding the Unity ML-Agents Toolkit, please fill out the following survey and feel free to email us directly. If you encounter any bugs, please reach out to us on the ML-Agents GitHub issues page. For any general issues and questions, please reach out to us on the Unity ML-Agents forums.

如果您使用此版本中提供的任何功能，我们很乐意收到您的来信。有关Unity ML-Agents工具包的任何反馈，请填写以下调查表，并随时直接给我们发送电子邮件。如果您遇到任何错误，请在 ML-Agents GitHub问题页面上与我们联系。如有任何一般性问题，请在 Unity ML-Agents论坛上与我们联系。

翻译自: https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/

ml-agents