Assignment#2−solutionByJonariguezAssignment\#2 -solution\quad By\ JonariguezAssignment#2solutionByJonariguez

所有的代码题目对应的代码已上传至github/CS224n/Jonariguez

解:
(提示使用keepdims参数会方便一些哦。 )

def softmax(x):"""Compute the softmax function in tensorflow.You might find the tensorflow functions tf.exp, tf.reduce_max,tf.reduce_sum, tf.expand_dims useful. (Many solutions are possible, so you maynot need to use all of these functions). Recall also that many commontensorflow operations are sugared (e.g. x * y does a tensor multiplicationif x and y are both tensors). Make sure to implement the numerical stabilityfixes as in the previous homework!Args:x:   tf.Tensor with shape (n_samples, n_features). Note feature vectors arerepresented by row-vectors. (For simplicity, no need to handle 1-dinput as in the previous homework)Returns:out: tf.Tensor with shape (n_sample, n_features). You need to construct thistensor in this problem."""### YOUR CODE HERE"""跟作业1一样,要先减去每行的最大值,然后再做softmax用keepdims=True可以保持之前的形状,而不会变成行向量"""x_max = tf.reduce_max(x,axis=1,keepdims=True)x = tf.exp(x-x_max)x_sum = tf.reduce_sum(x,axis=1,keepdims=True)out = x/x_sum#out = x/tf.reshape(tf.reduce_sum(x,axis=1),(x.shape[0],1))### END YOUR CODEreturn out

解:
(积累知识:

  • tf.multiply()为元素级的乘法,要求形状相同。
  • tf.matmul()为矩阵乘法。
    两者都要求两个矩阵的元素类型必须相同。
def cross_entropy_loss(y, yhat):"""Compute the cross entropy loss in tensorflow.The loss should be summed over the current minibatch.y is a one-hot tensor of shape (n_samples, n_classes) and yhat is a tensorof shape (n_samples, n_classes). y should be of dtype tf.int32, and yhat shouldbe of dtype tf.float32.The functions tf.to_float, tf.reduce_sum, and tf.log might prove useful. (Manysolutions are possible, so you may not need to use all of these functions).Note: You are NOT allowed to use the tensorflow built-in cross-entropyfunctions.Args:y:    tf.Tensor with shape (n_samples, n_classes). One-hot encoded.yhat: tf.Tensorwith shape (n_sample, n_classes). Each row encodes aprobability distribution and should sum to 1.Returns:out:  tf.Tensor with shape (1,) (Scalar output). You need to construct thistensor in the problem."""### YOUR CODE HERE"""y和y_hat的第一维都是n_samples,这是一个batch的大小,也即batch_size,那么对于每一个sample都要计算一个交叉熵(标量,实数),然后再把这n_samples个交叉熵求和,最终也是个标量(实数)"""single_CE = tf.multiply(tf.log(yhat),tf.to_float(y))out = tf.negative(tf.reduce_sum(single_CE))### END YOUR CODEreturn out

解:
占位符(placeholder)和feed_dict可以在运行时动态地向计算图“喂”数据。(TensorFlow采用的是静态图)

def add_placeholders(self):"""Generates placeholder variables to represent the input tensors.These placeholders are used as inputs by the rest of the model buildingand will be fed data during training.Adds following nodes to the computational graphinput_placeholder: Input placeholder tensor of shape(batch_size, n_features), type tf.float32labels_placeholder: Labels placeholder tensor of shape(batch_size, n_classes), type tf.int32Add these placeholders to self as the instance variablesself.input_placeholderself.labels_placeholder"""### YOUR CODE HEREself.input_placeholder = tf.placeholder(tf.float32,shape=[self.config.batch_size,self.config.n_features],name='input_placeholder')self.labels_placeholder = tf.placeholder(tf.int32,shape=[self.config.batch_size,self.config.n_classes],name='labels_placeholder')### END YOUR CODE
def create_feed_dict(self, inputs_batch, labels_batch=None):"""Creates the feed_dict for training the given step.A feed_dict takes the form of:feed_dict = {<placeholder>: <tensor of values to be passed for placeholder>,....}If label_batch is None, then no labels are added to feed_dict.Hint: The keys for the feed_dict should be the placeholdertensors created in add_placeholders.Args:inputs_batch: A batch of input data.labels_batch: A batch of label data.Returns:feed_dict: The feed dictionary mapping from placeholders to values."""### YOUR CODE HERE"""feed_dict其实就是python里面字典类型注意:feed_dict的键是我们之前定义过的tf.placeholder对象,而不是tf.placeholder的str类型的名字"""feed_dict = {self.input_placeholder:inputs_batch,self.labels_placeholder:labels_batch}### END YOUR CODEreturn feed_dict

def add_prediction_op(self):"""Adds the core transformation for this model which transforms a batch of inputdata into a batch of predictions. In this case, the transformation is a linear layer plus asoftmax transformation:y = softmax(Wx + b)Hint: Make sure to create tf.Variables as needed.Hint: For this simple use-case, it's sufficient to initialize both weights Wand biases b with zeros.Args:input_data: A tensor of shape (batch_size, n_features).Returns:pred: A tensor of shape (batch_size, n_classes)"""### YOUR CODE HERE"""x是输入,即占位符input_placeholder而W和b是要定义的变量,也是要训练的变量pred = softmax(xW+b)"""with tf.variable_scope('softmax_classifier'):W = tf.Variable(tf.zeros([self.config.n_features,self.config.n_classes],dtype=tf.float32))b = tf.Variable(tf.zeros([self.config.n_classes],dtype=tf.float32))# print(W.name)# print(b.name)Z = tf.matmul(self.input_placeholder,W)+bpred = softmax(Z)### END YOUR CODEreturn pred
def add_loss_op(self, pred):"""Adds cross_entropy_loss ops to the computational graph.Hint: Use the cross_entropy_loss function we defined. This should be a veryshort function.Args:pred: A tensor of shape (batch_size, n_classes)Returns:loss: A 0-d tensor (scalar)"""### YOUR CODE HERE"""因为我们已经在q1_softmax.py中定义并实现了cross_entropy_loss()函数,所以这里可以直接调用self.labels_placeholder 是"喂"进来的真实标记pred    是我们预测的"""loss = cross_entropy_loss(self.labels_placeholder,pred)### END YOUR CODEreturn loss

def add_training_op(self, loss):"""Sets up the training Ops.Creates an optimizer and applies the gradients to all trainable variables.The Op returned by this function is what must be passed to the`sess.run()` call to cause the model to train. Seehttps://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizerfor more information.Hint: Use tf.train.GradientDescentOptimizer to get an optimizer object.Calling optimizer.minimize() will return a train_op object.Args:loss: Loss tensor, from cross_entropy_loss.Returns:train_op: The Op for training."""### YOUR CODE HERE#直接返回优化器即可train_op = tf.train.GradientDescentOptimizer(self.config.lr).minimize(loss)### END YOUR CODEreturn train_op

解:
TensorFlow的自动梯度,是指我们使用时只需要定义图的节点就好了,不用自己实现求解梯度,反向传播和求导由TensorFlow自动完成。


解:

stack buffer new dependency transition
[ROOT,parsed,this] [sentence,correctly] SHIFT
[ROOT,parsed,this,sentence] [correctly] SHIFT
[ROOT,parsed,sentence] [correctly] sentence -> this LEFT-ARC
[ROOT,parsed] [correctly] parsed -> sentence RIGHT-ARC
[ROOT,parsed,correctly] [] SHIFT
[ROOT,parsed] [] parsed -> correctly RIGHT-ARC
[ROOT] [] ROOT -> parsed RIGHT-ARC


解:
共2n步

  • 每个词都要进入stack中,故要有n步SHIFT操作。
  • 最终stack中只剩ROOT,即每一次ARC会从stack中删掉一个词,故共有n步LEFT-ARC和RIGHT-ARC操作。

def __init__(self, sentence):"""Initializes this partial parse.Your code should initialize the following fields:self.stack: The current stack represented as a list with the top of the stack as thelast element of the list.self.buffer: The current buffer represented as a list with the first item on thebuffer as the first item of the listself.dependencies: The list of dependencies produced so far. Represented as a list oftuples where each tuple is of the form (head, dependent).Order for this list doesn't matter.The root token should be represented with the string "ROOT"Args:sentence: The sentence to be parsed as a list of words.Your code should not modify the sentence."""# The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.self.sentence = sentence### YOUR CODE HEREself.stack = ['ROOT']#注意不要用self.buffer=sentence,因为这样的话self.buffer是sentence的一个#引用,改变self.buffer的话就是直接改变sentence,这和题目要求不符(Do not use it in your code)self.buffer = [word for word in self.sentence]self.dependencies = []### END YOUR CODEdef parse_step(self, transition):"""Performs a single parse step by applying the given transition to this partial parseArgs:transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,and right-arc transitions."""### YOUR CODE HERE"""重申一下操作:S   从buffer的最左边取出一个word,放入stack最右边LA  将stack最右边的word作为head,第二个word作为dependentRA  将stack最右边的word作为dependent,第二个word作为headLA和RA操作都在self.dependencies中添加(head,dependent)元组利用list的pop()函数具有返回值的特性可以简化代码"""if transition=='S':self.stack.append(self.buffer.pop(0))elif transition=='LA':dependent = (self.stack[-1],self.stack.pop(-2))self.dependencies.append(dependent)else :dependent = (self.stack[-2],self.stack.pop(-1))self.dependencies.append(dependent)### END YOUR CODE

def minibatch_parse(sentences, model, batch_size):"""Parses a list of sentences in minibatches using a model.Args:sentences: A list of sentences to be parsed (each sentence is a list of words)model: The model that makes parsing decisions. It is assumed to have a functionmodel.predict(partial_parses) that takes in a list of PartialParses as input andreturns a list of transitions predicted for each parse. That is, after callingtransitions = model.predict(partial_parses)transitions[i] will be the next transition to apply to partial_parses[i].batch_size: The number of PartialParses to include in each minibatchReturns:dependencies: A list where each element is the dependencies list for a parsed sentence.Ordering should be the same as in sentences (i.e., dependencies[i] shouldcontain the parse for sentences[i])."""### YOUR CODE HEREstart_idx,end_idx=0,0PartialParses = [PartialParse(sentence) for sentence in sentences]dependencies = []while end_idx<len(sentences):end_idx = min(start_idx+batch_size,len(sentences))#拿到batch_size的PartialParse#对于每一个句子创建一个PartialParse解析器batch_PartialParses = PartialParses[start_idx:end_idx]#然后用模型预测,得出每一个句子的transition(transitions[i])#注意:model.predict(x)只会对x里面的每一个解析器推进一步。while len(batch_PartialParses)>0:transitions = model.predict(batch_PartialParses)for i in range(len(transitions)):batch_PartialParses[i].parse_step(transitions[i])#然后把那么句子已经解析完成的丢掉,保留还没有完成的。#那么那些是已经完成了的呢?那就是buffer==0 && stack==1batch_PartialParses = [parse for parse in batch_PartialParses if len(parse.buffer)>0 or len(parse.stack)>1]dependencies.extend([parse.dependencies for parse in PartialParses[start_idx:end_idx]])#注意更新start_idxstart_idx+=batch_size### END YOUR CODEreturn dependencies

def xavier_weight_init():"""Returns function that creates random tensor.The specified function will take in a shape (tuple or 1-d array) andreturns a random tensor of the specified shape drawn from theXavier initialization distribution.Hint: You might find tf.random_uniform useful."""def _xavier_initializer(shape, **kwargs):"""Defines an initializer for the Xavier distribution.Specifically, the output should be sampled uniformly from [-epsilon, epsilon] whereepsilon = sqrt(6) / <sum of the sizes of shape's dimensions>e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))This function will be used as a variable initializer.Args:shape: Tuple or 1-d array that species the dimensions of the requested tensor.Returns:out: tf.Tensor of specified shape sampled from the Xavier distribution."""### YOUR CODE HEREepsilon = tf.sqrt(6.0/tf.to_float(tf.reduce_sum(shape)))out = tf.Variable(tf.random_uniform(shape,minval=-epsilon,maxval=epsilon))### END YOUR CODEreturn out# Returns defined initializer function.return _xavier_initializer


解:
Epdrop[hdrop]=Epdrop[γd∘h]=pdrop⋅0⃗+(1−pdrop)⋅γ⋅h=h\mathbb{E}_{p_{drop}}[\mathbf{h}_{drop}]=\mathbb{E}_{p_{drop}}[\gamma \mathbf{d}\circ \mathbf{h}]=p_{drop}\cdot \vec{0}+(1-p_{drop})\cdot\gamma\cdot\mathbf{h}=\mathbf{h} Epdrop[hdrop]=Epdrop[γdh]=pdrop0

+(1pdrop)γh=h
即推导出:
γ=11−pdrop\gamma=\frac{1}{1-p_{drop}} γ=1pdrop1

解:
因为其实m\mathbf{m}m是之前全部梯度(更新量)的加权平均,更能体现梯度的整体变化。因为这样减小了更新量的方差,避免了梯度振荡。
β1\mathbf{\beta_1}β1一般要接近1。


解:

  • 更新量m\mathbf{m}m: 对梯度(更新量)进行滑动平均
  • 学习率v\mathbf{v}v: 对梯度的平方进行滑动平均

梯度平均最小的参数的更新量最大,也就是说,在损失函数相对于它们的梯度很小的时候也能快速收敛。即在平缓的地方也能快递移动到最优解。


解:
我的结果为

Epoch 10 out of 10
924/924 [============================>.] - ETA: 0s - train loss: 0.0654
Evaluating on dev set - dev UAS: 88.37
New best dev UAS! Saving model in ./data/weights/parser.weights===========================================================================
TESTING
===========================================================================
Restoring the best model weights found on the dev set
Final evaluation on test set
- test UAS: 88.84

运行时间:15分钟。


题目解读
先明确题目中各个量的维度:
由题目可知,x(t)x^{(t)}x(t)是one-hot行向量,且隐藏层也是行向量的形式
则可得:
x(t)∈R1×∣V∣x^{(t)}\in \mathbb{R}^{1\times |V|} x(t)R1×V
h(t)∈R1×Dhh^{(t)}\in \mathbb{R}^{1\times D_h} h(t)R1×Dh

y^(t)\hat{y}^{(t)}y^(t)是输出,即每个单词的概率分布(softmax之后),那么:
y^(t)∈R1×∣V∣\hat{y}^{(t)}\in \mathbb{R}^{1\times |V|} y^(t)R1×V
然后我们就可以得到:
L∈R∣V∣×dL\in \mathbb{R}^{|V|\times d} LRV×d
e(t)∈R1×de^{(t)}\in \mathbb{R}^{1\times d} e(t)R1×d
I∈Rd×DhI\in \mathbb{R}^{d\times D_h} IRd×Dh
H∈RDh×DhH\in \mathbb{R}^{D_h\times D_h} HRDh×Dh
b1∈R1×Dhb_1\in \mathbb{R}^{1\times D_h} b1R1×Dh
U∈RDh×∣V∣U\in \mathbb{R}^{D_h\times |V|} URDh×V
b2∈R1×∣V∣b_2\in \mathbb{R}^{1\times |V|} b2R1×V

其中ddd是词向量的长度,也就是代码中的embed_sizeembed\_sizeembed_size
在清楚了上面各矩阵的维度之后的求导才会更清晰。

因为句子的长度不一,然后损失函数是针对一个单词所计算的,然后求和之后是对整个句子的损失,故要对损失函数求平均以得到每个单词的平均损失才行。


解:
由于标签y(t)y^{(t)}y(t)是one-hot向量,假设y(t)y^{(t)}y(t)的真实标记为kkk
则:
J(t)(θ)=CE(y(t),y^(t))=−logy^k(t)=log1y^k(t)J^{(t)}(\theta)=CE(y^{(t)},\hat{y}^{(t)})=-log\ \hat{y}_k^{(t)}=log\ \frac{1}{\hat{y}_k^{(t)}} J(t)(θ)=CE(y(t),y^(t))=logy^k(t)=logy^k(t)1
PP(t)(y(t),y^(t))=1y^k(t)PP^{(t)}(y^{(t)},\hat{y}^{(t)})=\frac{1}{\hat{y}_k^{(t)}} PP(t)(y(t),y^(t))=y^k(t)1
很容易得出:
CE(y(t),y^(t))=logPP(t)(y(t),y^(t))CE(y^{(t)},\hat{y}^{(t)})=log\ PP^{(t)}(y^{(t)},\hat{y}^{(t)}) CE(y(t),y^(t))=logPP(t)(y(t),y^(t))

很常见的结论,一定要知道的。

∣V∣=10000|V|=10000V=10000时,随机选择能选对的概率为1∣V∣=110000\frac{1}{|V|}=\frac{1}{10000}V1=100001perplexityperplexityperplexity(困惑度)为11∣V∣=10000\frac{1}{\frac{1}{|V|}}=10000V11=10000CE=log10000≈9.21CE=log10000\thickapprox 9.21CE=log100009.21


解:
根据题目可知:Lx(t)=e(t)L_{x^{(t)}}=e^{(t)}Lx(t)=e(t)
现在设:
v(t)=h(t−1)H+e(t)I+b1v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1 v(t)=h(t1)H+e(t)I+b1
θ(t)=h(t)U+b2\theta^{(t)}=h^{(t)}U+b_2 θ(t)=h(t)U+b2

则前向传播为:
e(t)=x(t)Le^{(t)}=x^{(t)}L e(t)=x(t)L
v(t)=h(t−1)H+e(t)I+b1v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1 v(t)=h(t1)H+e(t)I+b1
h(t)=sigmoid(v(t))h^{(t)}=sigmoid(v^{(t)}) h(t)=sigmoid(v(t))
θ(t)=h(t)U+b2\theta^{(t)}=h^{(t)}U+b_2 θ(t)=h(t)U+b2
y^(t)=softmax(θ(t))\hat{y}^{(t)}=softmax(\theta^{(t)}) y^(t)=softmax(θ(t))
J(t)=CE(y(t),y^(t))J^{(t)}=CE(y^{(t)},\hat{y}^{(t)}) J(t)=CE(y(t),y^(t))

反向传播:
中间值:
δ1(t)=∂J(t)∂θ(t)=y^(t)−y(t)\delta_1^{(t)}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}=\hat{y}^{(t)}-y^{(t)} δ1(t)=θ(t)J(t)=y^(t)y(t)
δ2(t)=∂J(t)∂v(t)=∂J(t)∂θ(t)⋅∂θ(t)∂h(t)⋅∂h(t)∂v(t)=(y^(t)−y(t))⋅UT∘h(t)∘(1−h(t))\delta_2^{(t)}=\frac{\partial J^{(t)}}{\partial v^{(t)}}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}\cdot\frac{\partial \theta^{(t)}}{\partial h^{(t)}}\cdot\frac{\partial h^{(t)}}{\partial v^{(t)}}=(\hat{y}^{(t)}-y^{(t)})\cdot U^{T}\circ h^{(t)}\circ (1-h^{(t)}) δ2(t)=v(t)J(t)=θ(t)J(t)h(t)θ(t)v(t)h(t)=(y^(t)y(t))UTh(t)(1h(t))

则有:
∂J(t)∂b2=∂J(t)∂θ(t)⋅∂θ(t)∂b2=δ1(t)\frac{\partial J^{(t)}}{\partial b_2}=\frac{\partial J^{(t)}}{\partial \theta^{(t)}}\cdot\frac{\partial \theta^{(t)}}{\partial b_2}=\delta_1^{(t)} b2J(t)=θ(t)J(t)b2θ(t)=δ1(t)
∂J(t)∂H∣t=∂J(t)∂v(t)∂v(t)∂H∣t=(h(t−1))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial H}\rvert_t = \frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial H}\rvert_t=(h^{(t-1)})^T\cdot\delta_2^{(t)} HJ(t)t=v(t)J(t)Hv(t)t=(h(t1))Tδ2(t)

∂J(t)∂I∣t=∂J(t)∂v(t)∂v(t)∂I∣t=(e(t))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial I}\rvert_t = \frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial I}\rvert_t=(e^{(t)})^T\cdot\delta_2^{(t)} IJ(t)t=v(t)J(t)Iv(t)t=(e(t))Tδ2(t)
∂J(t)∂Lx(t)=∂J(t)∂e(t)=∂J(t)∂v(t)⋅∂v(t)∂e(t)=δ2(t)⋅IT\frac{\partial J^{(t)}}{\partial L_{x^{(t)}}} =\frac{\partial J^{(t)}}{\partial e^{(t)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial e^{(t)}}=\delta_2^{(t)}\cdot I^T Lx(t)J(t)=e(t)J(t)=v(t)J(t)e(t)v(t)=δ2(t)IT

∂J(t)∂h(t−1)=∂J(t)∂v(t)⋅∂v(t)∂h(t−1)=δ2(t)⋅HT\frac{\partial J^{(t)}}{\partial h^{(t-1)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial h^{(t-1)}}=\delta_2^{(t)}\cdot H^T h(t1)J(t)=v(t)J(t)h(t1)v(t)=δ2(t)HT

如果你对上面的反向传播中的求导有疑惑,那么请看下面的简单讲解

考虑如下求导:
∂J∂x=∂J∂u1∂u1∂u2⋅⋅⋅⋅∂um∂v∂v∂x\frac{\partial J}{\partial x}=\frac{\partial J}{\partial u_1}\frac{\partial u_1}{\partial u_2}\cdot\cdot\cdot\cdot\frac{\partial u_{m}}{\partial v}\frac{\partial v}{\partial x} xJ=u1Ju2u1vumxv

假设除了∂v∂x\frac{\partial v}{\partial x}xv,前面的已经求出了
∂J∂u1∂u1∂u2⋅⋅⋅⋅∂um∂v=δ\frac{\partial J}{\partial u_1}\frac{\partial u_1}{\partial u_2}\cdot\cdot\cdot\cdot\frac{\partial u_{m}}{\partial v}=\delta u1Ju2u1vum=δ

现在就差∂v∂x\frac{\partial v}{\partial x}xv了。需要讨论两种情况:

  1. 其中,vvv是一个行向量rrr乘上一个矩阵MMM,然后对矩阵MMM求导:
    ∂v∂x=∂∂M(rM)\frac{\partial v}{\partial x}=\frac{\partial }{\partial M}(rM) xv=M(rM)

结果为rTr^TrT 左乘 前面一坨的求导结果δ\deltaδ,即:

∂J∂x=rT⋅δ\frac{\partial J}{\partial x}=r^T\cdot\delta xJ=rTδ

而具体到题目中就是:
∂J(t)∂v(t)=δ2(t)\frac{\partial J^{(t)}}{\partial v^{(t)}}=\delta_2^{(t)} v(t)J(t)=δ2(t)
∂v(t)∂H=∂∂H(h(t−1)H+e(t)I+b1)=∂∂H(h(t−1)H)=(h(t−1))T\frac{\partial v^{(t)}}{\partial H}=\frac{\partial }{\partial H}(h^{(t-1)}H+e^{(t)}I+b_1)=\frac{\partial }{\partial H}(h^{(t-1)}H)=(h^{(t-1)})^T Hv(t)=H(h(t1)H+e(t)I+b1)=H(h(t1)H)=(h(t1))T
所以:
∂J(t)∂H∣t=∂J(t)∂v(t)∂v(t)∂H∣t=(h(t−1))T⋅δ2(t)\frac{\partial J^{(t)}}{\partial H}\rvert_t=\frac{\partial J^{(t)}}{\partial v^{(t)}}\frac{\partial v^{(t)}}{\partial H}\rvert_t=(h^{(t-1)})^T\cdot\delta_2^{(t)} HJ(t)t=v(t)J(t)Hv(t)t=(h(t1))Tδ2(t)

  1. 其中,vvv是一个行向量rrr乘上一个矩阵MMM,然后对行向量rrr求导:

∂v∂x=∂∂r(rM)\frac{\partial v}{\partial x}=\frac{\partial }{\partial r}(rM) xv=r(rM)

结果为MTM^TMT 右乘 前面一坨的求导结果δ\deltaδ,即:

∂J∂x=δ⋅MT\frac{\partial J}{\partial x}=\delta\cdot M^T xJ=δMT

而具体到题目中就是:
∂J(t)∂h(t−1)=∂J(t)∂v(t)⋅∂v(t)∂h(t−1)=δ2(t)⋅HT\frac{\partial J^{(t)}}{\partial h^{(t-1)}}= \frac{\partial J^{(t)}}{\partial v^{(t)}}\cdot \frac{\partial v^{(t)}}{\partial h^{(t-1)}}=\delta_2^{(t)}\cdot H^T h(t1)J(t)=v(t)J(t)h(t1)v(t)=δ2(t)HT


解:
RNN的反向传播是按时间的反向传播,对于时间步ttt的损失函数J(t)J^{(t)}J(t)要沿时间向前传播,故现在为了方便,定义时间步ttt的损失函数J(t)J^{(t)}J(t)对每一时间步的误差项δ(t)\delta^{(t)}δ(t):
δ(t)=∂J(t)∂h(t)\delta^{(t)}=\frac{\partial J^{(t)}}{\partial h^{(t)}} δ(t)=h(t)J(t)

现推导误差项的传播:
δ(t)=∂J(t)∂h(t)=(y^−y)⋅UT\delta^{(t)}=\frac{\partial J^{(t)}}{\partial h^{(t)}}=(\hat{y}-y)\cdot U^T δ(t)=h(t)J(t)=(y^y)UT
h(t)=sigmoid(h(t−1)H+e(t)I+b1)h^{(t)}=sigmoid(h^{(t-1)}H+e^{(t)}I+b_1) h(t)=sigmoid(h(t1)H+e(t)I+b1)
∂h(t)∂h(t−1)=h(t)∘(1−h(t))∘HT\frac{\partial h^{(t)}}{\partial h^{(t-1)}}=h^{(t)}\circ (1-h^{(t)})\circ H^T h(t1)h(t)=h(t)(1h(t))HT
故可得递推式:
δ(t−1)=∂J(t)∂h(t−1)=∂J(t)∂h(t)⋅∂h(t)∂h(t−1)=δ(t)∘h(t)∘(1−h(t))⋅HT\delta^{(t-1)}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}=\frac{\partial J^{(t)}}{\partial h^{(t)}}\cdot\frac{\partial h^{(t)}}{\partial h^{(t-1)}}=\delta^{(t)}\circ h^{(t)}\circ(1-h^{(t)})\cdot H^T δ(t1)=h(t1)J(t)=h(t)J(t)h(t1)h(t)=δ(t)h(t)(1h(t))HT
即可得:
∂J(t)∂Lx(t−1)=∂J(t)∂h(t−1)⋅∂h(t−1)∂Lx(t−1)=δ(t−1)⋅IT∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial L_{x^{(t-1)}}}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial L_{x^{(t-1)}}}=\delta^{(t-1)}\cdot I^T\circ h^{(t-1)}\circ(1-h^{(t-1)}) Lx(t1)J(t)=h(t1)J(t)Lx(t1)h(t1)=δ(t1)ITh(t1)(1h(t1))

∂J(t)∂I∣t−1=∂J(t)∂h(t−1)⋅∂h(t−1)∂I∣t−1=(e(t−1))T⋅δ(t−1)∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial I}\rvert_{t-1}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial I}\rvert_{t-1}=(e^{(t-1)})^T\cdot\delta^{(t-1)}\circ h^{(t-1)}\circ(1-h^{(t-1)}) IJ(t)t1=h(t1)J(t)Ih(t1)t1=(e(t1))Tδ(t1)h(t1)(1h(t1))

∂J(t)∂H∣t−1=∂J(t)∂h(t−1)⋅∂h(t−1)∂H∣t−1=(h(t−2))T⋅δ(t−1)∘h(t−1)∘(1−h(t−1))\frac{\partial J^{(t)}}{\partial H}\rvert_{t-1}=\frac{\partial J^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(t-1)}}{\partial H}\rvert_{t-1}=(h^{(t-2)})^T\cdot\delta^{(t-1)}\circ h^{(t-1)}\circ(1-h^{(t-1)}) HJ(t)t1=h(t1)J(t)Hh(t1)t1=(h(t2))Tδ(t1)h(t1)(1h(t1))

注意,上述过程用到了sigmoidsigmoidsigmoid 函数的导数:
σ′(x)=σ(x)∘(1−σ(x))\sigma&#x27;(x)=\sigma(x)\circ(1-\sigma(x)) σ(x)=σ(x)(1σ(x))


前向传播的复杂度分别为:
e(t)=x(t)L⟶O(∣V∣)e^{(t)}=x^{(t)}L \longrightarrow O(|V|)e(t)=x(t)LO(V)
v(t)=h(t−1)H+e(t)I+b1⟶O(Dh2)+O(dDh)v^{(t)}=h^{(t-1)}H+e^{(t)}I+b_1 \longrightarrow O(D_h^2)+O(dD_h)v(t)=h(t1)H+e(t)I+b1O(Dh2)+O(dDh)
h(t)=sigmoid(v(t))⟶O(Dh)h^{(t)}=sigmoid(v^{(t)}) \longrightarrow O(D_h)h(t)=sigmoid(v(t))O(Dh)
θ(t)=h(t)U+b2⟶O(∣V∣Dh)\theta^{(t)}=h^{(t)}U+b_2 \longrightarrow O(|V|D_h)θ(t)=h(t)U+b2O(VDh)
y^(t)=softmax(θ(t))⟶O(∣V∣)\hat{y}^{(t)}=softmax(\theta^{(t)}) \longrightarrow O(|V|)y^(t)=softmax(θ(t))O(V)
J(t)=CE(y(t),y^(t))⟶O(∣V∣)J^{(t)}=CE(y^{(t)},\hat{y}^{(t)}) \longrightarrow O(|V|)J(t)=CE(y(t),y^(t))O(V)
综上,在有两阶的时候则只保留两阶的情况下,前向传播的复杂度为:

O(Dh2+dDh+∣V∣Dh)O(D_h^2+dD_h+|V|D_h) O(Dh2+dDh+VDh)
同理,反向传播的复杂度为:
O(Dh2+dDh+∣V∣Dh)O(D_h^2+dD_h+|V|D_h) O(Dh2+dDh+VDh)

上述是第一个时间步长的复杂度,而τ\tauτ个时间步的话就是:

  • 一次损失函数对h(t)h^{(t)}h(t)的求导,复杂度为O(∣V∣Dh)O(|V|D_h)O(VDh)
  • τ\tauτ次反向传播,复杂度为O(τ(Dh2+dDh))O(\tau(D_h^2+dD_h))O(τ(Dh2+dDh));

故,τ\tauτ个时间步的反向传播复杂度为:
O(τ(Dh2+dDh)+∣V∣Dh)O(\tau(D_h^2+dD_h)+|V|D_h) O(τ(Dh2+dDh)+VDh)

而如果是对前τ\tauτ个词,每次都进行τ\tauτ步的反向传播,那么复杂度大概为:
O(τ2(Dh2+dDh)+τ∣V∣Dh)O(\tau^2(D_h^2+dD_h)+\tau|V|D_h) O(τ2(Dh2+dDh)+τVDh)

CS224n课程Assignment2参考答案相关推荐

  1. CS224n课程Assignment3参考答案

    Assignment#3−solutionByJonariguezAssignment\#3 -solution\quad By\ JonariguezAssignment#3−solutionByJ ...

  2. 2020年春季学期信号与系统课程作业参考答案-第十五次作业

    信号与系统课程第十五次作业参考答案 ※ 第一题 已知x[n],h[n]x\left[ n \right],h\left[ n \right]x[n],h[n]长度分别是10, 25.设:y1[n]=x ...

  3. 2020年春季学期信号与系统课程作业参考答案-第十四次作业

    信号与系统课程第十四次作业参考答案 ※ 第一题 用闭式表达式写出下面有限长序列的离散傅里叶变换(DFT): (1) x[n]=δ[n]x\left[ n \right] = \delta \left[ ...

  4. 2020年春季学期信号与系统课程作业参考答案-第十三次作业

    信号与系统课程第十三次作业参考答案 ※ 第一题 如下图所示的反馈系统,回答以下各列问题: (1)写出系统的传递函数:H(s)=V2(s)V1(s)H\left( s \right) = {{V_2 \ ...

  5. 2020年春季学期信号与系统课程作业参考答案-第十二次作业

    信号与系统第十二次作业参考答案 ※ 第一题 利用Laplace变换求解下列微分方程: (1)d2dt2y(t)+2ddty(t)+y(t)=δ(t)+2δ′(t){{d^2 } \over {dt^2 ...

  6. 2020年春季学习信号与系统课程作业参考答案-第十一次作业

    信号与系统第十一次作业参考答案 ※ 第一题 利用三种逆变方法求下列X(z)X\left( z \right)X(z)的逆变换x[n]x\left[ n \right]x[n]. X(z)=10z(z− ...

  7. 2020年春季学期信号与系统课程作业参考答案-第十次作业

    第十次作业参考答案 01第一题 第一小题中的求解除了(14)(15)小题之外,其他的各题都可以在MATLAB中使用MATLAB的符号计算帮助求解,一边检查求解的结果正确性. 使用MATLAB求解第一小 ...

  8. 2020年春季学期信号与系统课程作业参考答案-第九次作业

    第九次作业参考答案 01第一题 已知x(t)x\left( t \right)x(t)和X(ω)X\left( \omega \right)X(ω)是一对傅里叶变换,xs(t)x_s \left( t ...

  9. 计算机网络技术主要包括计算机技术和什么,《计算机网络技术》第6章作业的参考答案...

    <计算机网络技术>课程 作业参考答案 第六章应用层 6.2域名系统的主要功能是什么?域名系统中的本地域名服务器.根域名服务器.顶级域名服务器及权限域名服务器有何区别? 解析:域名系统中的服 ...

最新文章

  1. OpenStack快速入门
  2. SQL2K数据库开发十一之表操作创建UNIQUE约束
  3. Python基础教程(四):循环语句
  4. 【在线记事本】一个程序员的随笔(与技术无关)
  5. 上传文件返回数据提示下载
  6. vector容器中重写sort方法
  7. linux下查看文件及目录个数
  8. 视觉slam第一讲——
  9. 挑战程序设计竞赛是c语言编写的嘛,POJ 2115 C Looooops 题解《挑战程序设计竞赛》...
  10. android 签名打包 Invalid keystore format
  11. CentOS查找redis配置文件及防火墙相关命令
  12. 滑动窗口协议(GBN, SR)
  13. 微信隐藏代码大全(来源于网络)
  14. C语言求最大公约数的方法,辗转相除法,质因数分解法、短除法、更相减损法。
  15. rips php,审计PHP工具篇之 RIPS
  16. WIN7无线网卡开软AP的方法
  17. Android Studio 连真机提示No Device Found,adb.exe无法找到入口
  18. SQLyog 试用期过的解决办法
  19. Step7编程语言编程概述结构
  20. 婚恋交友网站开发搭建源码分享

热门文章

  1. Delphi数据库编程教程(九)
  2. matlab 关联规则挖掘,数据挖掘实验(六)Matlab实现Apriori算法【关联规则挖掘】...
  3. Python3实践项目一:生成0-5岁宝宝生长发育报告
  4. Axure软件使用总结
  5. 计算机辅助药物设计 目录,计算机辅助药物设计(下册)
  6. android 4g手机6,2016年中国手机用户已超13亿,6.46亿为4G用户
  7. 学习使用Postman+Newman
  8. 1.13 判断奇数偶数
  9. 我想我又读到了一首唯美的诗
  10. (转)一个房奴的精神大字报