Dense-Captioning Events in Videos

info

project page http://cs.stanford.edu/people/ranjaykrishna/densevid/

文章做了以下几个工作：

a new model:
- identify all events in a single pass of the video
- describing the detected events with natural language
- a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes.
捕捉事件之间的依赖关系：a new captioning module that uses contextual information from past and future events to jointly describe all events（采用上下文信息）
提供数据集 ActivityNet Captions

Dense-captioning events model

Goal : design an architecture that

jointly localizes temporal proposals of interest
and then describes each with natural language.

Input: sequence of videoframes
Output: a set of sentences(且包含起止时间)

Event proposal module

framework：先把视频序列输入C3D得到特征，送入proposal module（就是DAPs），得到proposal（包含起始时间、分数、hidden representation hih_i）,分数超过阈值的proposal就可以送入language model，通过hidden representation进行video captioning，输出对于每个event的描述。

对DAPs的更改：We do not modify the training of DAPs and only change the model at inference time by outputting K proposals at every time step, each proposing an event with offsets.

While traditional DAPs uses non-maximum suppression to eliminate overlapping outputs, we keep them separately and treat them as individual events。

Captioning module with context

从时间上下文获取信息，对于一个事件来说，把所有其他时间都划分为两类：past和future，如果是cocurrent的时间，那么在当前事件结束前就结束划分为past，否则future。past和future的表示如下：

hjh_j是其他时间的hidden representation
最终得到的特征表达(hpasti,hi,hfuturei)(h_i^{past},h_i,h_i^{future})送入LSTM，最终得到视频的描述。

实现细节

loss：两个loss，one for proposal，another for captioning model。总的loss：

L=λ1Lcap+λ2Lprop

L=\lambda_1L_{cap}+\lambda_2L_{prop}
其中 λ1=1.0\lambda_1=1.0， λ2=0.1\lambda_2=0.1

训练和优化：

train our full densecaptioning model by alternating between training the language model and the proposal module every 500 iterations.
first train the captioning module by masking all neighboring events for 10 epochs before adding in the context features.
initialize all weights using a Gaussian with standard deviation of 0:01.
stochastic gradient descent with momentum 0:9 to train.
learning rate ： 0.01 for the language model and 0.001 for the proposal module.
For efficiency, we do not finetune the C3D feature extraction.
training batch-size is set to 1
We cap all sentences to be a maximum sentence length of 30 words

PyTorch 0.1.10.
One mini-batch runs in approximately 15:84 ms on a Titan X GPU and it takes 2 days for the model to converge.