深度学习和dqn

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

深度Q学习方面的改进:双重DQN,优先体验重播和固定Q目标 (Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表。

In our last article about Deep Q Learning with Tensorflow, we implemented an agent that learns to play a simple version of Doom. In the video version, we trained a DQN agent that plays Space invaders.

在上一篇有关使用Tensorflow进行深度Q学习的文章中 ,我们实现了一个学习播放简单版《毁灭战士》的代理。 在视频版本中, 我们训练了一个DQN代理,该代理可以播放“太空侵略者” 。

However, during the training, we saw that there was a lot of variability.

但是,在培训期间,我们看到了很多可变性。

Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:

深度Q学习在2014年推出。自那时以来,已经进行了很多改进。 因此,今天我们将看到四种可以显着改善DQN代理商的培训和结果的策略:

  • fixed Q-targets固定的Q目标
  • double DQNs双DQN
  • dueling DQN (aka DDQN)决斗DQN(又名DDQN)
  • Prioritized Experience Replay (aka PER)优先体验重播(又称PER)

We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.

我们将实施一个学习玩《毁灭战士的致命走廊》的特工。 我们的AI必须导航至基本目标(背心),并通过杀死敌人来确保它们同时生存。

固定Q目标 (Fixed Q-targets)

理论 (Theory)

We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).

我们在“深度Q学习”一文中看到,当我们要计算TD误差(又称损失)时,我们计算TD目标(Q_target)和当前Q值(Q的估计)之差。

But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

但是我们对真正的TD目标一无所知。 我们需要估计一下。 使用Bellman方程,我们看到TD目标只是在该状态下采取该操作的奖励,再加上下一个状态的折后最高Q值。

However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.

但是,问题在于我们使用相同的参数(权重)来估计目标值 Q值。 结果,TD目标与我们正在更改的参数(w)之间存在很大的相关性。

Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.

因此,这意味着在训练的每个步骤中, 我们的Q值都会移动,但目标值也会移动。 因此,我们离目标越来越近,但目标也在移动。 就像追逐一个移动的目标! 这导致训练中的大振荡。

It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).

就像您是牛仔(Q估计值)并且想要赶牛(Q目标)一样,您必须靠近(减少误差)。

At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).

在每个时间步长处,您都尝试接近牛,它也会在每个时间步长处移动(因为您使用相同的参数)。

This leads to a very strange path of chasing (a big oscillation in training).

这导致了一个非常奇怪的追逐路径(训练中的巨大波动)。

Instead, we can use the idea of fixed Q-targets introduced by DeepMind:

相反,我们可以使用DeepMind引入的固定Q目标的想法:

  • Using a separate network with a fixed parameter (let’s call it w-) for estimating the TD target.使用带有固定参数(称为w-)的单独网络来估算TD目标。
  • At every Tau step, we copy the parameters from our DQN network to update the target network.在Tau的每个步骤中,我们都从DQN网络中复制参数以更新目标网络。

Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.

由于此过程,我们将获得更稳定的学习,因为目标函数会保持一段时间不变。

实作 (Implementation)

Implementing fixed q-targets is pretty straightforward:

实现固定的q目标非常简单:

  • First, we create two networks (DQNetwork, TargetNetwork)

    首先,我们创建两个网络( DQNetworkTargetNetwork )

  • Then, we create a function that will take our DQNetwork parameters and copy them to our TargetNetwork

    然后,我们创建一个函数,该函数将使用我们的DQNetwork参数并将其复制到我们的TargetNetwork

  • Finally, during the training, we calculate the TD target using our target network. We update the target network with the DQNetwork every tau step (tau is an hyper-parameter that we define).

    最后,在训练过程中,我们使用目标网络计算TD目标。 我们在每个tau步骤中都使用DQNetwork更新目标网络( tau是我们定义的超参数)。

双DQN (Double DQNs)

理论 (Theory)

Double DQNs, or double Learning, was introduced by Hado van Hasselt. This method handles the problem of the overestimation of Q-values.

Hado van Hasselt引入了双重DQN(双重学习)。 该方法解决了Q值过高的问题。

To understand this problem, remember how we calculate the TD Target:

要了解此问题,请记住我们如何计算TD目标:

By calculating the TD target, we face a simple problem: how are we sure that the best action for the next state is the action with the highest Q-value?

通过计算TD目标,我们面临一个简单的问题:我们如何确定下一个状态的最佳动作是具有最高Q值的动作?

We know that the accuracy of q values depends on what action we tried and what neighboring states we explored.

我们知道q值的准确性取决于我们尝试了什么操作以及我们探索了哪些相邻状态。

As a consequence, at the beginning of the training we don’t have enough information about the best action to take. Therefore, taking the maximum q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

结果,在培训开始时,我们没有足够的信息来采取最佳措施。 因此,将最大q值(嘈杂的信号)作为采取的最佳措施可能会导致误报。 如果经常非最佳动作一个比最佳最佳动作更高的Q值,学习将会很复杂。

The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:

解决方案是:计算Q目标时,我们使用两个网络将动作选择与目标Q值生成分离。 我们:

  • use our DQN network to select what is the best action to take for the next state (the action with the highest Q value).请使用我们的DQN网络选择对下一个状态采取的最佳操作(具有最高Q值的操作)。
  • use our target network to calculate the target Q value of taking that action at the next state.使用我们的目标网络来计算在下一个状态下执行该操作的目标Q值。

Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.

因此,Double DQN可以帮助我们减少对q值的高估,因此可以帮助我们更快地训练并获得更稳定的学习。

实作 (Implementation)

决斗DQN(又名DDQN) (Dueling DQN (aka DDQN))

理论 (Theory)

Remember that Q-values correspond to how good it is to be at that state and taking an action at that state Q(s,a).

请记住,Q值对应于处于该状态并在该状态Q(s,a)采取行动的程度。

So we can decompose Q(s,a) as the sum of:

因此,我们可以将Q(s,a)分解为:

  • V(s): the value of being at that state

    V(s) :处于该状态的值

  • A(s,a): the advantage of taking that action at that state (how much better is to take this action versus all other possible actions at that state).

    A(s,a) :在该状态下执行该操作的优势(与在该状态下执行所有其他操作相比,采取该操作要好得多)。

With DDQN, we want to separate the estimator of these two elements, using two new streams:

使用DDQN,我们希望使用两个新流将这两个元素的估计量分开:

  • one that estimates the state value V(s)

    估计状态值V(s)的一个

  • one that estimates the advantage for each action A(s,a)

    估计每个动作A(s,a)优势的一个

And then we combine these two streams through a special aggregation layer to get an estimate of Q(s,a).

然后,我们通过一个特殊的聚合层将这两个流组合起来, 以获得Q(s,a)的估计值。

Wait? But why do we need to calculate these two elements separately if then we combine them?

等待? 但是,如果我们将它们组合在一起,为什么还要分别计算这两个元素呢?

By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).

通过将估计去耦,我们的DDQN可以直观地了解哪些状态有价值(或不重要), 无需了解每个动作在每个状态下的影响(因为它还在计算V(s))。

With our normal DQN, we need to calculate the value of each action at that state. But what’s the point if the value of the state is bad? What’s the point to calculate all actions at one state when all these actions lead to death?

使用正常的DQN,我们需要计算该状态下每个动作的值。 但是,如果状态值不好,那有什么意义呢? 当所有这些动作导致死亡时,在一个状态下计算所有动作有什么意义?

As a consequence, by decoupling we’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision. And, in most states, the choice of the action has no effect on what happens.

结果,通过去耦,我们能够计算V(s)。 这对于其行为不会以相关方式影响环境的州尤其有用。 在这种情况下,不必计算每个动作的值。 例如,向右或向左移动仅在存在碰撞风险时才重要。 而且,在大多数状态下,动作的选择对发生的事情没有影响。

It will be clearer if we take the example in the paper Dueling Network Architectures for Deep Reinforcement Learning.

如果我们采用《 深度强化学习的决斗网络体系结构 》一文中的示例,将会更加清楚。

We see that the value network streams pays attention (the orange blur) to the road, and in particular to the horizon where the cars are spawned. It also pays attention to the score.

我们看到,价值网络流将注意力(道路的橙色模糊)放在道路上,尤其是在产生汽车的地平线上。 它也注意分数。

On the other hand, the advantage stream in the first frame on the right does not pay much attention to the road, because there are no cars in front (so the action choice is practically irrelevant). But, in the second frame it pays attention, as there is a car immediately in front of it, and making a choice of action is crucial and very relevant.

另一方面,右边第一帧中的优势流并没有过多关注道路,因为前面没有汽车(因此,动作选择实际上是无关紧要的)。 但是,在第二帧中,它引起了注意,因为它前面紧跟着一辆汽车,因此做出动作的选择至关重要且非常相关。

Concerning the aggregation layer, we want to generate the q values for each action at that state. We might be tempted to combine the streams as follows:

关于聚合层,我们希望为该状态下的每个动作生成q值。 我们可能会尝试按以下方式合并流:

But if we do that, we’ll fall into the issue of identifiability, that is — given Q(s,a) we’re unable to find A(s,a) and V(s).

但是,如果这样做,我们将陷入 可识别性的问题 ,即,给定Q(s,a),我们无法找到A(s,a)和V(s)。

And not being able to find V(s) and A(s,a) given Q(s,a) will be a problem for our back propagation. To avoid this problem, we can force our advantage function estimator to have 0 advantage at the chosen action.

给定Q(s,a)而无法找到V(s)和A(s,a)将是我们反向传播的问题。 为了避免这个问题,我们可以强制我们的优势函数估算器在所选动作处具有0优势。

To do that, we subtract the average advantage of all actions possible of the state.

为此,我们减去该状态所有可能动作的平均优势。

Therefore, this architecture helps us accelerate the training. We can calculate the value of a state without calculating the Q(s,a) for each action at that state. And it can help us find much more reliable Q values for each action by decoupling the estimation between two streams.

因此,这种体系结构有助于我们加快培训速度。 我们可以计算状态的值,而无需计算该状态下每个动作的Q(s,a)。 通过将两个流之间的估计去耦,可以帮助我们为每个动作找到更可靠的Q值。

实作 (Implementation)

The only thing to do is to modify the DQN architecture by adding these new streams:

唯一要做的就是通过添加以下新流来修改DQN架构:

优先体验重播 (Prioritized Experience Replay)

理论 (Theory)

Prioritized Experience Replay (PER) was introduced in 2015 by Tom Schaul. The idea is that some experiences may be more important than others for our training, but might occur less frequently.

优先体验重播(PER)由Tom Schaul于2015年推出。 想法是,对于我们的培训而言,某些经验可能比其他经验更重要,但发生的频率可能较低。

Because we sample the batch uniformly (selecting the experiences randomly) these rich experiences that occur rarely have practically no chance to be selected.

因为我们对批次进行均匀采样(随机选择经验),所以这些很少发生的丰富经验几乎没有机会被选择。

That’s why, with PER, we try to change the sampling distribution by using a criterion to define the priority of each tuple of experience.

因此,对于PER,我们尝试通过使用准则定义每个体验元组的优先级来更改采样分布。

We want to take in priority experience where there is a big difference between our prediction and the TD target, since it means that we have a lot to learn about it.

我们希望获得优先经验,因为我们的预测与TD目标之间存在很大差异,因为这意味着我们需要学习很多知识。

We use the absolute value of the magnitude of our TD error:

我们使用TD误差幅度的绝对值:

And we put that priority in the experience of each replay buffer.

而且,我们将每个回放缓冲区的体验都放在了优先位置。

But we can’t just do greedy prioritization, because it will lead to always training the same experiences (that have big priority), and thus over-fitting.

但是我们不能只是贪婪地进行优先级排序,因为它会导致总是训练相同的体验(具有较高的优先级),从而导致过度拟合。

So we introduce stochastic prioritization, which generates the probability of being chosen for a replay.

因此,我们引入了随机优先级划分从而产生了被选择重播的可能性。

As consequence, during each time step, we will get a batch of samples with this probability distribution and train our network on it.

因此,在每个时间步骤中,我们都会得到一个 一批具有这种概率分布的样本,并以此训练我们的网络。

But, we still have a problem here. Remember that with normal Experience Replay, we use a stochastic update rule. As a consequence, the way we sample the experiences must match the underlying distribution they came from.

但是,我们这里仍然有问题。 请记住,在正常的“体验重播”中,我们使用随机更新规则。 因此, 我们对体验进行采样方式必须与它们来自的基础分布相匹配。

When we do have normal experience, we select our experiences in a normal distribution — simply put, we select our experiences randomly. There is no bias, because each experience has the same chance to be taken, so we can update our weights normally.

当我们有正常经验时,我们以正态分布选择我们的经验-简单地说,我们随机选择我们的经验。 没有偏见,因为每种经历都有相同的机会,因此我们可以正常地更新权重。

But, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce bias toward high-priority samples (more chances to be selected).

但是 ,由于我们使用优先级采样,因此放弃了纯随机采样。 结果,我们引入了对高优先级样本的偏见(更多机会被选择)。

And, if we update our weights normally, we take have a risk of over-fitting. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.

而且,如果我们正常地更新权重,则存在过度拟合的风险。 与低优先级的经验相比,具有高优先级的样本可能会多次用于训练(=偏见)。 因此,我们将仅使用我们认为非常有趣的一小部分经验来更新权重。

To correct this bias, we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples.

为了纠正这种偏见,我们使用重要性抽样权重(IS)通过减少常见样本的权重来调整更新。

The weights corresponding to high-priority samples have very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.

与高优先级样本相对应的权重几乎没有调整(因为网络会多次看到这些经历),而与低优先级样本相对应的权重将进行全面更新。

The role of b is to control how much these importance sampling weights affect learning. In practice, the b parameter is annealed up to 1 over the duration of training, because these weights are more important in the end of learning when our q values begin to converge. The unbiased nature of updates is most important near convergence, as explained in this article.

b的作用是控制这些重要性采样权重对学习的影响。 实际上,在训练过程中将b参数退火至1,因为当我们的q值开始收敛时 ,这些权重在学习结束时更为重要 更新的偏见自然是最重要的近收敛,因为这说明文章 。

实作 (Implementation)

This time, the implementation will be a little bit fancier.

这次,实现会更加理想。

First of all, we can’t just implement PER by sorting all the Experience Replay Buffers according to their priorities. This will not be efficient at all due to O(nlogn) for insertion and O(n) for sampling.

首先,我们不能仅仅通过根据所有体验重播缓冲区的优先级对它们进行排序来实现PER。 由于用于插入的O(nlogn)和用于采样的O(n),这将完全无效

As explained in this really good article, we need to use another data structure instead of sorting an array — an unsorted sumtree.

正如这篇非常好的文章中所解释的,我们需要使用另一种数据结构而不是对数组进行排序-未排序的sumtree。

A sumtree is a Binary Tree, that is a tree with only a maximum of two children for each node. The leaves (deepest nodes) contain the priority values, and a data array that points to leaves contains the experiences.

sumtree是二叉树,即每个节点最多只有两个孩子的树。 叶子(最深的节点)包含优先级值,指向叶子的数据数组包含体验。

Updating the tree and sampling will be really efficient (O(log n)).

更新树和采样将非常有效(O(log n))。

Then, we create a memory object that will contain our sumtree and data.

然后,我们创建一个将包含求和树和数据的内存对象。

Next, to sample a minibatch of size k, the range [0, total_priority] will be divided into k ranges. A value is uniformly sampled from each range.

接下来,要采样大小为k的小批量,将范围[0,total_priority]分为k个范围。 从每个范围均匀采样一个值。

Finally, the transitions (experiences) that correspond to each of these sampled values are retrieved from the sumtree.

最后,从求和树中检索与这些采样值中的每一个对应的转换(体验)。

It will be much clearer when we dive on the complete details in the notebook.

当我们深入研究笔记本中的完整细节时,它将更加清晰。

毁灭战士 (Doom Deathmatch agent)

This agent is a Dueling Double Deep Q Learning with PER and fixed q-targets.

该代理是具有PER和固定q目标的对决Double Deep Q学习。

We made a video tutorial of the implementation:

我们制作了一个有关实现的视频教程:

The notebook is here

笔记本在这里

That’s all! You’ve just created an smarter agent that learns to play Doom. Awesome! Remember that if you want to have an agent with really good performance, you need many more GPU hours (about two days of training)!

就这样! 您刚刚创建了一个更聪明的代理,可以学习玩《毁灭战士》。 太棒了! 请记住,如果您想拥有一个性能非常出色的代理,则需要更多的GPU小时(约两天的培训时间)!

However, with only 2–3 hours of training on CPU (yes CPU), our agent understood that they needed to kill enemies before being able to move forward. If they move forward without killing enemies, they will be killed before getting the vest.

但是,通过对CPU进行2至3个小时的训练 (是的,CPU),我们的经纪人知道他们需要杀死敌人才能继续前进。 如果他们前进而不杀死敌人,他们将在获得背心之前被杀死。

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment…and so on. Experiment, have fun!

不要忘记自己实现代码的每个部分。 尝试修改我给您的代码非常重要。 尝试添加纪元,更改体系结构,添加固定的Q值,更改学习率,使用更艰苦的环境……等等。 实验,玩得开心!

Remember that this was a big article, so be sure to really understand why we use these new strategies, how they work, and the advantages of using them.

请记住,这是一篇大文章,所以一定要真正理解我们为什么使用这些新策略,它们如何工作以及使用它们的优势。

In the next article, we’ll learn about an awesome hybrid method between value-based and policy-based reinforcement learning algorithms. This is a baseline for the state of the art’s algorithms: Advantage Actor Critic (A2C). You’ll implement an agent that learns to play Outrun !

在下一篇文章中,我们将学习一种基于价值的增强学习算法与基于策略的增强学习算法之间的混合方法。 是最新算法的基准 :“优势演员评论家(A2C)”。 您将实现一个学习玩Outrun的代理!

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章, 请单击“?”。 您可以根据自己喜欢该文章的次数在下面进行搜索,以便其他人可以在Medium上看到此内容。 并且不要忘记跟随我!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法,意见,问题,请在下面发表评论,或给我发送电子邮件:hello@simoninithomas.com或向我发送@ThomasSimonini信息 。

Keep learning, stay awesome!

继续学习,保持卓越!

使用Tensorflow进行深度强化学习课程? (Deep Reinforcement Learning Course with Tensorflow ?️)

? Syllabus

? 教学大纲

? Video version

? 视频版本

Part 1: An introduction to Reinforcement Learning

第1部分: 强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分: 通过Q-Learning更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分: 深度Q学习简介:让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+: 深度Q学习中的改进:双重DQN,优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分: Doom和Cartpole的策略梯度简介

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

第5部分: 优势演员评论家方法简介:让我们玩刺猬索尼克吧!

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分: 使用刺猬索尼克2和3的近距离策略优化(PPO)

Part 7: Curiosity-Driven Learning made easy Part I

第七部分: 好奇心驱动学习变得简单

翻译自: https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/

深度学习和dqn

深度学习和dqn_深度Q学习方面的改进:双重DQN决斗,优先体验重播和固定…相关推荐

  1. 深度强化学习-基于价值的强化学习-TD算法和Q学习(三)

    本文主要介绍TD算法和Q学习算法 目录 TD算法: Q学习算法: 同策略,异策略: TD算法: 即时间差分 (Temporal Difference):此处用举例子方法来帮助大家理解 1.假设我从天津 ...

  2. 强化学习q学习求最值_通过Q学习更深入地学习强化学习

    强化学习q学习求最值 by Thomas Simonini 通过托马斯·西蒙尼(Thomas Simonini) 通过Q学习更深入地学习强化学习 (Diving deeper into Reinfor ...

  3. 深度学习(四十)——深度强化学习(3)Deep Q-learning Network(2), DQN进化史

    Deep Q-learning Network(续) Nature DQN DQN最早发表于NIPS 2013,该版本的DQN,也被称为NIPS DQN.NIPS DQN除了提出DQN的基本概念之外, ...

  4. 强化学习 补充笔记(TD算法、Q学习算法、SARSA算法、多步TD目标、经验回放、高估问题、对决网络、噪声网络)

    学习目标: 深入了解马尔科夫决策过程(MDP),包含TD算法.Q学习算法.SARSA算法.多步TD目标.经验回放.高估问题.对决网络.噪声网络.基础部分见:强化学习 马尔科夫决策过程(价值迭代.策略迭 ...

  5. 人工智能 java 坦克机器人系列: 强化学习_Java坦克机器人系列强化学习

    <Java坦克机器人系列强化学习>由会员分享,可在线阅读,更多相关<Java坦克机器人系列强化学习(13页珍藏版)>请在人人文库网上搜索. 1.Java 坦克机器人系列 强化学 ...

  6. 强化学习(八) - 深度Q学习(Deep Q-learning, DQL,DQN)原理及相关实例

    深度Q学习原理及相关实例 8. 深度Q学习 8.1 经验回放 8.2 目标网络 8.3 相关算法 8.4 训练算法 8.5 深度Q学习实例 8.5.1 主程序 程序注释 8.5.2 DQN模型构建程序 ...

  7. 第六章 利用深度Q学习来实现最优控制的智能体

    文章目录 前言 改进的Q-learning代理 利用神经网络近似q函数 使用PyTorch来实现浅层Q网络 实现Shallow_Q_Learner Experience replay 实现the ex ...

  8. Keras深度学习实战——使用深度Q学习进行SpaceInvaders游戏

    Keras深度学习实战--使用深度Q学习进行SpaceInvaders游戏 0. 前言 1. 问题与模型分析 2. 使用深度 Q 学习进行 SpaceInvaders 游戏 相关链接 0. 前言 在& ...

  9. 强化学习笔记(4)-深度Q学习

    以下为学习<强化学习:原理与python实现>这本书的笔记. 在之前学习到的强度学习方法中,每次更新价值函数只更新某个状态动作对的价值估计.但是有些情况下状态动作对的数量非常大,不可能对所 ...

最新文章

  1. 【观点见解】解读大数据的5个误区
  2. 通讯传输--全双工和半双工
  3. 用python画一只可爱的皮卡丘_用python画一只可爱的皮卡丘实例
  4. cs231n---语义分割 物体定位 物体检测 物体分割
  5. 使用CoreImage教程
  6. Fragment+Viewpaager
  7. STM32基础定时器详解
  8. 图解TCPIP-HTTP
  9. 好好学习 天天编程—C语言之环境搭建(一)
  10. devexpress.xtraeditors.xtraform 类型初始值设定_远程智能电表的常见类型推荐--老王说表...
  11. 唐宇迪学习笔记17:支持向量机
  12. 射雕zero找不到服务器,神秘领域《射雕ZERO》未揭秘场景盘点
  13. WebService-WSDL报文解析
  14. T检验和p-value含义及计算公式
  15. JavaScript 中的继承:ES3、ES5 和 ES6
  16. closest()方法简介
  17. c语言中一般命名方式,C语言常见命名规则
  18. 关于SVN提交不成功问题
  19. OpenCV VideoWriter报错: FFMPEG: tag ‘MP4V‘ is not supported with codec id 12 and format mp4解决方法
  20. Wordpress 的删除和重新安装

热门文章

  1. 搜狗网址导航带学子享受美好假期
  2. 【阿里云服务器Ubuntu数据库MongoDB设置远程链接】
  3. 自动控制原理 matlab仿真
  4. Windows 下 如何关闭139端口及445端口等危险端口
  5. 别再自己手动抠图了,教你你用Python5行代码实现批量抠图
  6. numpy中flatten()和ravel()
  7. 一文看懂最新蓝牙5.2 LE Audio技术如何打破经典蓝牙音频垄断地位
  8. matlab关于disp的使用方法
  9. cmd的发送 mmc_MMC子系统
  10. java limit_Java stream.limit()用法及代码示例