CS224n课程Assignment2参考答案

$Assignment#2−solutionByJonariguezAssignment\#2 -solution\quad By\ Jonariguez$

所有的代码题目对应的代码已上传至github/CS224n/Jonariguez

解:
（提示使用keepdims参数会方便一些哦。）

def softmax(x):"""Compute the softmax function in tensorflow.You might find the tensorflow functions tf.exp, tf.reduce_max,tf.reduce_sum, tf.expand_dims useful. (Many solutions are possible, so you maynot need to use all of these functions). Recall also that many commontensorflow operations are sugared (e.g. x * y does a tensor multiplicationif x and y are both tensors). Make sure to implement the numerical stabilityfixes as in the previous homework!Args:x:   tf.Tensor with shape (n_samples, n_features). Note feature vectors arerepresented by row-vectors. (For simplicity, no need to handle 1-dinput as in the previous homework)Returns:out: tf.Tensor with shape (n_sample, n_features). You need to construct thistensor in this problem."""### YOUR CODE HERE"""跟作业1一样，要先减去每行的最大值，然后再做softmax用keepdims=True可以保持之前的形状，而不会变成行向量"""x_max = tf.reduce_max(x,axis=1,keepdims=True)x = tf.exp(x-x_max)x_sum = tf.reduce_sum(x,axis=1,keepdims=True)out = x/x_sum#out = x/tf.reshape(tf.reduce_sum(x,axis=1),(x.shape[0],1))### END YOUR CODEreturn out

解:
（积累知识：

tf.multiply()为元素级的乘法，要求形状相同。
tf.matmul()为矩阵乘法。
两者都要求两个矩阵的元素类型必须相同。
）

def cross_entropy_loss(y, yhat):"""Compute the cross entropy loss in tensorflow.The loss should be summed over the current minibatch.y is a one-hot tensor of shape (n_samples, n_classes) and yhat is a tensorof shape (n_samples, n_classes). y should be of dtype tf.int32, and yhat shouldbe of dtype tf.float32.The functions tf.to_float, tf.reduce_sum, and tf.log might prove useful. (Manysolutions are possible, so you may not need to use all of these functions).Note: You are NOT allowed to use the tensorflow built-in cross-entropyfunctions.Args:y:    tf.Tensor with shape (n_samples, n_classes). One-hot encoded.yhat: tf.Tensorwith shape (n_sample, n_classes). Each row encodes aprobability distribution and should sum to 1.Returns:out:  tf.Tensor with shape (1,) (Scalar output). You need to construct thistensor in the problem."""### YOUR CODE HERE"""y和y_hat的第一维都是n_samples，这是一个batch的大小，也即batch_size，那么对于每一个sample都要计算一个交叉熵(标量，实数)，然后再把这n_samples个交叉熵求和，最终也是个标量(实数)"""single_CE = tf.multiply(tf.log(yhat),tf.to_float(y))out = tf.negative(tf.reduce_sum(single_CE))### END YOUR CODEreturn out

解:
占位符(placeholder)和feed_dict可以在运行时动态地向计算图“喂”数据。（TensorFlow采用的是静态图）

def add_placeholders(self):"""Generates placeholder variables to represent the input tensors.These placeholders are used as inputs by the rest of the model buildingand will be fed data during training.Adds following nodes to the computational graphinput_placeholder: Input placeholder tensor of shape(batch_size, n_features), type tf.float32labels_placeholder: Labels placeholder tensor of shape(batch_size, n_classes), type tf.int32Add these placeholders to self as the instance variablesself.input_placeholderself.labels_placeholder"""### YOUR CODE HEREself.input_placeholder = tf.placeholder(tf.float32,shape=[self.config.batch_size,self.config.n_features],name='input_placeholder')self.labels_placeholder = tf.placeholder(tf.int32,shape=[self.config.batch_size,self.config.n_classes],name='labels_placeholder')### END YOUR CODE

def create_feed_dict(self, inputs_batch, labels_batch=None):"""Creates the feed_dict for training the given step.A feed_dict takes the form of:feed_dict = {<placeholder>: <tensor of values to be passed for placeholder>,....}If label_batch is None, then no labels are added to feed_dict.Hint: The keys for the feed_dict should be the placeholdertensors created in add_placeholders.Args:inputs_batch: A batch of input data.labels_batch: A batch of label data.Returns:feed_dict: The feed dictionary mapping from placeholders to values."""### YOUR CODE HERE"""feed_dict其实就是python里面字典类型注意：feed_dict的键是我们之前定义过的tf.placeholder对象，而不是tf.placeholder的str类型的名字"""feed_dict = {self.input_placeholder:inputs_batch,self.labels_placeholder:labels_batch}### END YOUR CODEreturn feed_dict

def add_prediction_op(self):"""Adds the core transformation for this model which transforms a batch of inputdata into a batch of predictions. In this case, the transformation is a linear layer plus asoftmax transformation:y = softmax(Wx + b)Hint: Make sure to create tf.Variables as needed.Hint: For this simple use-case, it's sufficient to initialize both weights Wand biases b with zeros.Args:input_data: A tensor of shape (batch_size, n_features).Returns:pred: A tensor of shape (batch_size, n_classes)"""### YOUR CODE HERE"""x是输入，即占位符input_placeholder而W和b是要定义的变量，也是要训练的变量pred = softmax(xW+b)"""with tf.variable_scope('softmax_classifier'):W = tf.Variable(tf.zeros([self.config.n_features,self.config.n_classes],dtype=tf.float32))b = tf.Variable(tf.zeros([self.config.n_classes],dtype=tf.float32))# print(W.name)# print(b.name)Z = tf.matmul(self.input_placeholder,W)+bpred = softmax(Z)### END YOUR CODEreturn pred

def add_loss_op(self, pred):"""Adds cross_entropy_loss ops to the computational graph.Hint: Use the cross_entropy_loss function we defined. This should be a veryshort function.Args:pred: A tensor of shape (batch_size, n_classes)Returns:loss: A 0-d tensor (scalar)"""### YOUR CODE HERE"""因为我们已经在q1_softmax.py中定义并实现了cross_entropy_loss()函数，所以这里可以直接调用self.labels_placeholder 是"喂"进来的真实标记pred    是我们预测的"""loss = cross_entropy_loss(self.labels_placeholder,pred)### END YOUR CODEreturn loss

def add_training_op(self, loss):"""Sets up the training Ops.Creates an optimizer and applies the gradients to all trainable variables.The Op returned by this function is what must be passed to the`sess.run()` call to cause the model to train. Seehttps://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizerfor more information.Hint: Use tf.train.GradientDescentOptimizer to get an optimizer object.Calling optimizer.minimize() will return a train_op object.Args:loss: Loss tensor, from cross_entropy_loss.Returns:train_op: The Op for training."""### YOUR CODE HERE#直接返回优化器即可train_op = tf.train.GradientDescentOptimizer(self.config.lr).minimize(loss)### END YOUR CODEreturn train_op

解:
TensorFlow的自动梯度，是指我们使用时只需要定义图的节点就好了，不用自己实现求解梯度，反向传播和求导由TensorFlow自动完成。

解：

stack	buffer	new dependency	transition
[ROOT,parsed,this]	[sentence,correctly]		SHIFT
[ROOT,parsed,this,sentence]	[correctly]		SHIFT
[ROOT,parsed,sentence]	[correctly]	sentence -> this	LEFT-ARC
[ROOT,parsed]	[correctly]	parsed -> sentence	RIGHT-ARC
[ROOT,parsed,correctly]	[]		SHIFT
[ROOT,parsed]	[]	parsed -> correctly	RIGHT-ARC
[ROOT]	[]	ROOT -> parsed	RIGHT-ARC

解：
共2n步

每个词都要进入stack中，故要有n步SHIFT操作。
最终stack中只剩ROOT，即每一次ARC会从stack中删掉一个词，故共有n步LEFT-ARC和RIGHT-ARC操作。

def __init__(self, sentence):"""Initializes this partial parse.Your code should initialize the following fields:self.stack: The current stack represented as a list with the top of the stack as thelast element of the list.self.buffer: The current buffer represented as a list with the first item on thebuffer as the first item of the listself.dependencies: The list of dependencies produced so far. Represented as a list oftuples where each tuple is of the form (head, dependent).Order for this list doesn't matter.The root token should be represented with the string "ROOT"Args:sentence: The sentence to be parsed as a list of words.Your code should not modify the sentence."""# The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.self.sentence = sentence### YOUR CODE HEREself.stack = ['ROOT']#注意不要用self.buffer=sentence，因为这样的话self.buffer是sentence的一个#引用，改变self.buffer的话就是直接改变sentence，这和题目要求不符(Do not use it in your code)self.buffer = [word for word in self.sentence]self.dependencies = []### END YOUR CODEdef parse_step(self, transition):"""Performs a single parse step by applying the given transition to this partial parseArgs:transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,and right-arc transitions."""### YOUR CODE HERE"""重申一下操作：S   从buffer的最左边取出一个word，放入stack最右边LA  将stack最右边的word作为head，第二个word作为dependentRA  将stack最右边的word作为dependent，第二个word作为headLA和RA操作都在self.dependencies中添加(head,dependent)元组利用list的pop()函数具有返回值的特性可以简化代码"""if transition=='S':self.stack.append(self.buffer.pop(0))elif transition=='LA':dependent = (self.stack[-1],self.stack.pop(-2))self.dependencies.append(dependent)else :dependent = (self.stack[-2],self.stack.pop(-1))self.dependencies.append(dependent)### END YOUR CODE

def minibatch_parse(sentences, model, batch_size):"""Parses a list of sentences in minibatches using a model.Args:sentences: A list of sentences to be parsed (each sentence is a list of words)model: The model that makes parsing decisions. It is assumed to have a functionmodel.predict(partial_parses) that takes in a list of PartialParses as input andreturns a list of transitions predicted for each parse. That is, after callingtransitions = model.predict(partial_parses)transitions[i] will be the next transition to apply to partial_parses[i].batch_size: The number of PartialParses to include in each minibatchReturns:dependencies: A list where each element is the dependencies list for a parsed sentence.Ordering should be the same as in sentences (i.e., dependencies[i] shouldcontain the parse for sentences[i])."""### YOUR CODE HEREstart_idx,end_idx=0,0PartialParses = [PartialParse(sentence) for sentence in sentences]dependencies = []while end_idx<len(sentences):end_idx = min(start_idx+batch_size,len(sentences))#拿到batch_size的PartialParse#对于每一个句子创建一个PartialParse解析器batch_PartialParses = PartialParses[start_idx:end_idx]#然后用模型预测，得出每一个句子的transition(transitions[i])#注意：model.predict(x)只会对x里面的每一个解析器推进一步。while len(batch_PartialParses)>0:transitions = model.predict(batch_PartialParses)for i in range(len(transitions)):batch_PartialParses[i].parse_step(transitions[i])#然后把那么句子已经解析完成的丢掉，保留还没有完成的。#那么那些是已经完成了的呢？那就是buffer==0 && stack==1batch_PartialParses = [parse for parse in batch_PartialParses if len(parse.buffer)>0 or len(parse.stack)>1]dependencies.extend([parse.dependencies for parse in PartialParses[start_idx:end_idx]])#注意更新start_idxstart_idx+=batch_size### END YOUR CODEreturn dependencies

def xavier_weight_init():"""Returns function that creates random tensor.The specified function will take in a shape (tuple or 1-d array) andreturns a random tensor of the specified shape drawn from theXavier initialization distribution.Hint: You might find tf.random_uniform useful."""def _xavier_initializer(shape, **kwargs):"""Defines an initializer for the Xavier distribution.Specifically, the output should be sampled uniformly from [-epsilon, epsilon] whereepsilon = sqrt(6) / <sum of the sizes of shape's dimensions>e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))This function will be used as a variable initializer.Args:shape: Tuple or 1-d array that species the dimensions of the requested tensor.Returns:out: tf.Tensor of specified shape sampled from the Xavier distribution."""### YOUR CODE HEREepsilon = tf.sqrt(6.0/tf.to_float(tf.reduce_sum(shape)))out = tf.Variable(tf.random_uniform(shape,minval=-epsilon,maxval=epsilon))### END YOUR CODEreturn out# Returns defined initializer function.return _xavier_initializer

解：
$Epdrop[hdrop]=Epdrop[γd∘h]=pdrop⋅0⃗+(1−pdrop)⋅γ⋅h=h\mathbb{E}_{p_{drop}}[\mathbf{h}_{drop}]=\mathbb{E}_{p_{drop}}[\gamma \mathbf{d}\circ \mathbf{h}]=p_{drop}\cdot \vec{0}+(1-p_{drop})\cdot\gamma\cdot\mathbf{h}=\mathbf{h}$

+(1−pdrop)⋅γ⋅h=h
即推导出：

γ=11−pdrop\gamma=\frac{1}{1-p_{drop}}

解:
因为其实 $m\mathbf{m}$ 是之前全部梯度(更新量)的加权平均，更能体现梯度的整体变化。因为这样减小了更新量的方差，避免了梯度振荡。
$β1\mathbf{\beta_1}$ 一般要接近1。

解：

更新量 $m\mathbf{m}$ : 对梯度(更新量)进行滑动平均
学习率 $v\mathbf{v}$ : 对梯度的平方进行滑动平均

梯度平均最小的参数的更新量最大，也就是说，在损失函数相对于它们的梯度很小的时候也能快速收敛。即在平缓的地方也能快递移动到最优解。

解：
我的结果为

Epoch 10 out of 10
924/924 [============================>.] - ETA: 0s - train loss: 0.0654
Evaluating on dev set - dev UAS: 88.37
New best dev UAS! Saving model in ./data/weights/parser.weights===========================================================================
TESTING
===========================================================================
Restoring the best model weights found on the dev set
Final evaluation on test set
- test UAS: 88.84

运行时间：15分钟。

题目解读
先明确题目中各个量的维度：
由题目可知， $x^{(t)}$ 是one-hot行向量，且隐藏层也是行向量的形式。
则可得：
$x(t)∈R1×∣V∣x^{(t)}\in \mathbb{R}^{1\times |V|}$
$h(t)∈R1×Dhh^{(t)}\in \mathbb{R}^{1\times D_h}$

$y^(t)\hat{y}^{(t)}$ 是输出，即每个单词的概率分布(softmax之后)，那么：
$y^(t)∈R1×∣V∣\hat{y}^{(t)}\in \mathbb{R}^{1\times |V|}$
然后我们就可以得到：
$L∈R∣V∣×dL\in \mathbb{R}^{|V|\times d}$
$e(t)∈R1×de^{(t)}\in \mathbb{R}^{1\times d}$
$I∈Rd×DhI\in \mathbb{R}^{d\times D_h}$
$H∈RDh×DhH\in \mathbb{R}^{D_h\times D_h}$
$b1∈R1×Dhb_1\in \mathbb{R}^{1\times D_h}$
$U∈RDh×∣V∣U\in \mathbb{R}^{D_h\times |V|}$
$b2∈R1×∣V∣b_2\in \mathbb{R}^{1\times |V|}$

其中 $d$ 是词向量的长度，也就是代码中的 $embed\_size$ 。
在清楚了上面各矩阵的维度之后的求导才会更清晰。

因为句子的长度不一，然后损失函数是针对一个单词所计算的，然后求和之后是对整个句子的损失，故要对损失函数求平均以得到每个单词的平均损失才行。

解：
由于标签 $y^{(t)}$ 是one-hot向量，假设 $y^{(t)}$ 的真实标记为 $k$
则：
$J(t)(θ)=CE(y(t),y^(t))=−logy^k(t)=log1y^k(t)J^{(t)}(\theta)=CE(y^{(t)},\hat{y}^{(t)})=-log\ \hat{y}_k^{(t)}=log\ \frac{1}{\hat{y}_k^{(t)}}$
$PP(t)(y(t),y^(t))=1y^k(t)PP^{(t)}(y^{(t)},\hat{y}^{(t)})=\frac{1}{\hat{y}_k^{(t)}}$
很容易得出：
$CE(y(t),y^(t))=logPP(t)(y(t),y^(t))CE(y^{(t)},\hat{y}^{(t)})=log\ PP^{(t)}(y^{(t)},\hat{y}^{(t)})$

很常见的结论，一定要知道的。

当 $∣ V ∣ = 10000$ 时，随机选择能选对的概率为 $1∣V∣=110000\frac{1}{|V|}=\frac{1}{10000}$ ， $p e r p l e x i t y$ (困惑度)为 $11∣V∣=10000\frac{1}{\frac{1}{|V|}}=10000$ 。 $CE=log10000≈9.21CE=log10000\thickapprox 9.21$ 。

解：
根据题目可知： $L_{x^{(t)}}=e^{(t)}$ 。
现在设：
$v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1$
$θ(t)=h(t)U+b2\theta^{(t)}=h^{(t)}U+b_2$

则前向传播为：
$e^{(t)}=x^{(t)}L$
$v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1$
$h^{(t)}=sigmoid(v^{(t)})$
$θ(t)=h(t)U+b2\theta^{(t)}=h^{(t)}U+b_2$
$y^(t)=softmax(θ(t))\hat{y}^{(t)}=softmax(\theta^{(t)})$
$J(t)=CE(y(t),y^(t))J^{(t)}=CE(y^{(t)},\hat{y}^{(t)})$

反向传播：
中间值：
$δ1(t)=∂J(t)∂θ(t)=y^(t)−y(t)\delta_1^{(t)}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}=\hat{y}^{(t)}-y^{(t)}$
$δ2(t)=∂J(t)∂v(t)=∂J(t)∂θ(t)⋅∂θ(t)∂h(t)⋅∂h(t)∂v(t)=(y^(t)−y(t))⋅UT∘h(t)∘(1−h(t))\delta_2^{(t)}=\frac{\partial J^{(t)}}{\partial v^{(t)}}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}\cdot\frac{\partial \theta^{(t)}}{\partial h^{(t)}}\cdot\frac{\partial h^{(t)}}{\partial v^{(t)}}=(\hat{y}^{(t)}-y^{(t)})\cdot U^{T}\circ h^{(t)}\circ (1-h^{(t)})$

则有：
$∂J(t)∂b2=∂J(t)∂θ(t)⋅∂θ(t)∂b2=δ1(t)\frac{\partial J^{(t)}}{\partial b_2}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}\cdot\frac{\partial \theta^{(t)}}{\partial b_2}=\delta_1^{(t)}$
$∂J(t)∂H∣t=∂J(t)∂v(t)∂v(t)∂H∣t=(h(t−1))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial H}\rvert_t = \frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial H}\rvert_t=(h^{(t-1)})^T\cdot\delta_2^{(t)}$

$∂J(t)∂I∣t=∂J(t)∂v(t)∂v(t)∂I∣t=(e(t))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial I}\rvert_t = \frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial I}\rvert_t=(e^{(t)})^T\cdot\delta_2^{(t)}$
$∂J(t)∂Lx(t)=∂J(t)∂e(t)=∂J(t)∂v(t)⋅∂v(t)∂e(t)=δ2(t)⋅IT\frac{\partial J^{(t)}}{\partial L_{x^{(t)}}} =\frac{\partial J^{(t)}}{\partial e^{(t)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial e^{(t)}}=\delta_2^{(t)}\cdot I^T$

$∂J(t)∂h(t−1)=∂J(t)∂v(t)⋅∂v(t)∂h(t−1)=δ2(t)⋅HT\frac{\partial J^{(t)}}{\partial h^{(t-1)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial h^{(t-1)}}=\delta_2^{(t)}\cdot H^T$

如果你对上面的反向传播中的求导有疑惑，那么请看下面的简单讲解

考虑如下求导：
$∂J∂x=∂J∂u1∂u1∂u2⋅⋅⋅⋅∂um∂v∂v∂x\frac{\partial J}{\partial x}=\frac{\partial J}{\partial u_1}\frac{\partial u_1}{\partial u_2}\cdot\cdot\cdot\cdot\frac{\partial u_{m}}{\partial v}\frac{\partial v}{\partial x}$

假设除了 $∂v∂x\frac{\partial v}{\partial x}$ ，前面的已经求出了
$∂J∂u1∂u1∂u2⋅⋅⋅⋅∂um∂v=δ\frac{\partial J}{\partial u_1}\frac{\partial u_1}{\partial u_2}\cdot\cdot\cdot\cdot\frac{\partial u_{m}}{\partial v}=\delta$

现在就差 $∂v∂x\frac{\partial v}{\partial x}$ 了。需要讨论两种情况：

其中， $v$ 是一个行向量 $r$ 乘上一个矩阵 $M$ ，然后对矩阵 $M$ 求导：
$∂v∂x=∂∂M(rM)\frac{\partial v}{\partial x}=\frac{\partial }{\partial M}(rM)$

结果为 $r^T$ 左乘前面一坨的求导结果 $δ\delta$ ，即：

$∂J∂x=rT⋅δ\frac{\partial J}{\partial x}=r^T\cdot\delta$

而具体到题目中就是：
$∂J(t)∂v(t)=δ2(t)\frac{\partial J^{(t)}}{\partial v^{(t)}}=\delta_2^{(t)}$
$∂v(t)∂H=∂∂H(h(t−1)H+e(t)I+b1)=∂∂H(h(t−1)H)=(h(t−1))T\frac{\partial v^{(t)}}{\partial H}=\frac{\partial }{\partial H}(h^{(t-1)}H+e^{(t)}I+b_1)=\frac{\partial }{\partial H}(h^{(t-1)}H)=(h^{(t-1)})^T$
所以：
$∂J(t)∂H∣t=∂J(t)∂v(t)∂v(t)∂H∣t=(h(t−1))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial H}\rvert_t=\frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial H}\rvert_t=(h^{(t-1)})^T\cdot\delta_2^{(t)}$

其中， $v$ 是一个行向量 $r$ 乘上一个矩阵 $M$ ，然后对行向量 $r$ 求导：

$∂v∂x=∂∂r(rM)\frac{\partial v}{\partial x}=\frac{\partial }{\partial r}(rM)$

结果为 $M^T$ 右乘前面一坨的求导结果 $δ\delta$ ，即：

$∂J∂x=δ⋅MT\frac{\partial J}{\partial x}=\delta\cdot M^T$

而具体到题目中就是：
$∂J(t)∂h(t−1)=∂J(t)∂v(t)⋅∂v(t)∂h(t−1)=δ2(t)⋅HT\frac{\partial J^{(t)}}{\partial h^{(t-1)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial h^{(t-1)}}=\delta_2^{(t)}\cdot H^T$

解：
RNN的反向传播是按时间的反向传播，对于时间步 $t$ 的损失函数 $J^{(t)}$ 要沿时间向前传播，故现在为了方便，定义时间步 $t$ 的损失函数 $J^{(t)}$ 对每一时间步的误差项 $δ(t)\delta^{(t)}$ :
$δ(t)=∂J(t)∂h(t)\delta^{(t)}=\frac{\partial J^{(t)}}{\partial h^{(t)}}$

现推导误差项的传播：
$δ(t)=∂J(t)∂h(t)=(y^−y)⋅UT\delta^{(t)}=\frac{\partial J^{(t)}}{\partial h^{(t)}}=(\hat{y}-y)\cdot U^T$
$h^{(t)}=sigmoid(h^{(t-1)}H+e^{(t)}I+b_1)$
$∂h(t)∂h(t−1)=h(t)∘(1−h(t))∘HT\frac{\partial h^{(t)}}{\partial h^{(t-1)}}=h^{(t)}\circ (1-h^{(t)})\circ H^T$
故可得递推式：
$δ(t−1)=∂J(t)∂h(t−1)=∂J(t)∂h(t)⋅∂h(t)∂h(t−1)=δ(t)∘h(t)∘(1−h(t))⋅HT\delta^{(t-1)}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}=\frac{\partial J^{(t)}}{\partial h^{(t)}}\cdot\frac{\partial h^{(t)}}{\partial h^{(t-1)}}=\delta^{(t)}\circ h^{(t)}\circ(1-h^{(t)})\cdot H^T$
即可得：
$∂J(t)∂Lx(t−1)=∂J(t)∂h(t−1)⋅∂h(t−1)∂Lx(t−1)=δ(t−1)⋅IT∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial L_{x^{(t-1)}}}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial L_{x^{(t-1)}}}=\delta^{(t-1)}\cdot I^T\circ h^{(t-1)}\circ(1-h^{(t-1)})$

$∂J(t)∂I∣t−1=∂J(t)∂h(t−1)⋅∂h(t−1)∂I∣t−1=(e(t−1))T⋅δ(t−1)∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial I}\rvert_{t-1}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial I}\rvert_{t-1}=(e^{(t-1)})^T\cdot\delta^{(t-1)}\circ h^{(t-1)}\circ(1-h^{(t-1)})$

$∂J(t)∂H∣t−1=∂J(t)∂h(t−1)⋅∂h(t−1)∂H∣t−1=(h(t−2))T⋅δ(t−1)∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial H}\rvert_{t-1}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial H}\rvert_{t-1}=(h^{(t-2)})^T\cdot\delta^{(t-1)}\circ h^{(t-1)}\circ(1-h^{(t-1)})$

注意，上述过程用到了 $s i g m o i d$ 函数的导数:
$σ′(x)=σ(x)∘(1−σ(x))\sigma'(x)=\sigma(x)\circ(1-\sigma(x))$

前向传播的复杂度分别为：
$e(t)=x(t)L⟶O(∣V∣)e^{(t)}=x^{(t)}L \longrightarrow O(|V|)$
$v(t)=h(t−1)H+e(t)I+b1⟶O(Dh2)+O(dDh)v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1 \longrightarrow O(D_h^2)+O(dD_h)$
$h(t)=sigmoid(v(t))⟶O(Dh)h^{(t)}=sigmoid(v^{(t)}) \longrightarrow O(D_h)$
$θ(t)=h(t)U+b2⟶O(∣V∣Dh)\theta^{(t)}=h^{(t)}U+b_2 \longrightarrow O(|V|D_h)$
$y^(t)=softmax(θ(t))⟶O(∣V∣)\hat{y}^{(t)}=softmax(\theta^{(t)}) \longrightarrow O(|V|)$
$J(t)=CE(y(t),y^(t))⟶O(∣V∣)J^{(t)}=CE(y^{(t)},\hat{y}^{(t)}) \longrightarrow O(|V|)$
综上，在有两阶的时候则只保留两阶的情况下，前向传播的复杂度为：

$O(D_h^2+dD_h+|V|D_h)$
同理，反向传播的复杂度为：
$O(D_h^2+dD_h+|V|D_h)$

上述是第一个时间步长的复杂度，而 $τ\tau$ 个时间步的话就是：

一次损失函数对 $h^{(t)}$ 的求导，复杂度为 $O(|V|D_h)$ ；
$τ\tau$ 次反向传播，复杂度为 $O(τ(Dh2+dDh))O(\tau(D_h^2+dD_h))$ ;

故， $τ\tau$ 个时间步的反向传播复杂度为：
$O(τ(Dh2+dDh)+∣V∣Dh)O(\tau(D_h^2+dD_h)+|V|D_h)$

而如果是对前 $τ\tau$ 个词，每次都进行 $τ\tau$ 步的反向传播，那么复杂度大概为：
$O(τ2(Dh2+dDh)+τ∣V∣Dh)O(\tau^2(D_h^2+dD_h)+\tau|V|D_h)$

CS224n课程Assignment2参考答案相关推荐

CS224n课程Assignment3参考答案
Assignment#3−solutionByJonariguezAssignment\#3 -solution\quad By\ JonariguezAssignment#3−solutionByJ ...
2020年春季学期信号与系统课程作业参考答案-第十五次作业
信号与系统课程第十五次作业参考答案 ※ 第一题已知x[n],h[n]x\left[ n \right],h\left[ n \right]x[n],h[n]长度分别是10, 25.设:y1[n]=x ...
2020年春季学期信号与系统课程作业参考答案-第十四次作业
信号与系统课程第十四次作业参考答案 ※ 第一题用闭式表达式写出下面有限长序列的离散傅里叶变换(DFT): (1) x[n]=δ[n]x\left[ n \right] = \delta \left[ ...
2020年春季学期信号与系统课程作业参考答案-第十三次作业
信号与系统课程第十三次作业参考答案 ※ 第一题如下图所示的反馈系统,回答以下各列问题: (1)写出系统的传递函数:H(s)=V2(s)V1(s)H\left( s \right) = {{V_2 \ ...
2020年春季学期信号与系统课程作业参考答案-第十二次作业
信号与系统第十二次作业参考答案 ※ 第一题利用Laplace变换求解下列微分方程: (1)d2dt2y(t)+2ddty(t)+y(t)=δ(t)+2δ′(t){{d^2 } \over {dt^2 ...
2020年春季学习信号与系统课程作业参考答案-第十一次作业
信号与系统第十一次作业参考答案 ※ 第一题利用三种逆变方法求下列X(z)X\left( z \right)X(z)的逆变换x[n]x\left[ n \right]x[n]. X(z)=10z(z− ...
2020年春季学期信号与系统课程作业参考答案-第十次作业
第十次作业参考答案 01第一题第一小题中的求解除了(14)(15)小题之外,其他的各题都可以在MATLAB中使用MATLAB的符号计算帮助求解,一边检查求解的结果正确性. 使用MATLAB求解第一小 ...
2020年春季学期信号与系统课程作业参考答案-第九次作业
第九次作业参考答案 01第一题已知x(t)x\left( t \right)x(t)和X(ω)X\left( \omega \right)X(ω)是一对傅里叶变换,xs(t)x_s \left( t ...
计算机网络技术主要包括计算机技术和什么,《计算机网络技术》第6章作业的参考答案...
<计算机网络技术>课程作业参考答案第六章应用层 6.2域名系统的主要功能是什么?域名系统中的本地域名服务器.根域名服务器.顶级域名服务器及权限域名服务器有何区别? 解析:域名系统中的服 ...

CS224n课程Assignment2参考答案

CS224n课程Assignment2参考答案相关推荐

最新文章

热门文章