摘要：本文探讨零示例视频检索。在这种检索范例中，用户在没有示例的情况下，仅通过自然语句描述其即席查询需求。考虑到视频是帧序列而查询是词序列，我们需要建立一个有效的序列到序列的跨模态匹配。现有方法以基于概念为主，通过分别从查询和视频中提取相关概念，从而建立两种模态之间的关联。相比之下，本文采用了一种无需概念建模的方法，提出对偶深度编码网络，首次使用具有相似架构的多层编码网络同时对句子和视频进行量化编码和表示学习,在多个极具挑战性的评测集（MSR-VTT，TRECVID 2016和2017 Ad-hoc Video Search）上超过了现有结果。

Abstract: This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

报告2

报告题目：Multi-Level Context Ultra-Aggregation for Stereo Matching

论文作者：聂光宇，程明明，刘云，梁正发，范登平，刘越，王涌天

报告人：聂光宇，博士生，北京理工大学

摘要：利用多级上下文信息到成本量可以提高基于卷积神经网络的立体匹配方法的性能。近年来，三维卷积神经网络显示出在正则化代价体方面的优势，但受到匹配代价计算中的一元特征学习的限制。现有方法仅使用来自普通卷积层的特征或来自不同卷积层的特征的简单聚合来计算代价体。它们不足以进行立体匹配任务，这需要可以分辨的特征来识别立体图像对中的对应像素。在本文中，我们提出了一种使用多级上下文超聚合的一元特征聚合方法，实现特征的同级和跨级聚合。具体而言，将低分辨率图像作为子模块的输入用来捕获更多的上下文信息，来自每个层的含有丰富上下文信息的特征密集地融合到网络的主分支中。超聚合方法充分利用了具有更丰富上下文的多级功能，并从整体上执行图像到图像预测。我们介绍了用于代价计算的多级特征超聚合方案，并在PSM-Net上进行测试，并在Scene Flow数据集和KITTI2012 / 2015数据集上进行评估。实验结果表明，我们的方法优于现有技术方法，并且有效地提高了立体匹配的准确性。

Abstract: Exploiting multi-level context information to cost volume can improve the performance of CNNs based stereo matching methods. In recent years, 3-D Convolution Neural Networks (3-D CNNs) show the advantages in regularizing cost volume, but are limited by unary features learning in matching cost computation. Existing methods only use features from plain convolution layers or from a simple aggregation of features from different convolution layers to calculate the cost volume. They are not sufficient for stereo matching task, which requires discriminative features to identify the correspondent pixels in a rectified stereo image pair. In this paper, we propose a unary features descriptor using multilevel context ultra-aggregation (MCUA), which encapsulates all convolutional features into a more discriminative representation by intra- and inter-level features combination. Specifically, a child module that takes low-resolution images as input captures larger context information; the larger context information from each layer is densely connected to the main branch of the network. MCUA makes good usage of multi-level features with richer context, and performs the image-to-image prediction holistically. We introduce our MCUA scheme for cost volume calculation and test it on PSM-Net. We also evaluate our method on Scene Flow and KITTI2012/2015 stereo datasets. Experimental results show that our method outperforms state-of-the-art methods by a notable margin and effectively improves the accuracy of stereo matching.

报告3

报告题目：Unsupervised Open Domain Recognition by Semantic Discrepancy Minimization

论文作者：卓君宝，王树徽，崔书豪，黄庆明

报告人：卓君宝，博士生，中科院计算所

摘要：深度学习超强的特征学习能力结合迁移学习以解决目标任务标注数据稀缺问题，是一个极具价值和意义的问题。深度迁移学习现有的一些设定如域适配，零样本学习等仍具有局限性。现有域适配方法仍旧限制在封闭域内，泛零样本识别虽然假定目标域具有未知类且对未知类进行识别，但一般不假定训练集和测试集间存在域间差异。我们提出开放域目标识别的新设定，即假定带标注源域和无标注目标域间存在域间差异，且源域是目标域的一个子集，任务是对目标域上的每个类别的样本都进行正确分类。针对该问题，我们提出初步解决方案。对于存在未知类，我们通过WordNet构建图卷积神经网络，将已知类的分类规则传播给未知类别，并引入平衡约束来防止训练过程中未知类样本被分成已知类样本。此外，我们先对源域和目标域样本间求一个最优匹配，通过语义一致性来指导对源域与目标域（类别空间不对称）进行域适配。最后我们将分类网络和图卷积网络进行联合训练。实验证明所提方法的有效性。

Abstract: In this work, we explore the field of unsupervised open domain recognition, which is a more realistic scenario where the categories of labeled source domain is a subset of the categories of unlabeled target domain and domain discrepancy exits between the two domain. This is a very difficult setting as there exists both domain shift and semantic bias between source and target domain. Directly propagating the classifier trained on source domain to unseen categories of target domain via graph CNN is suboptimal. It is straightforward to reduce the domain discrepancy but it is hard to estimate the domain discrepancy of asymmetric label spaces. Directly reducing existing discrepancy measurement will result in negative transfer. Therefore, we propose semantic guided matching discrepancy that first search an optimal matching between source and target domain instances and then use semantic consistency of coarse matched pairs to filter noisy matching. On the other hand, we propose limited balance constraint to alleviate the semantic embedding bias. Semantic embedding bias means that without labels for unknown categories, the classification network will misclassify the unknown samples into known categories. Finally, we integrate graph CNN to jointly train the classification network and graph CNN for better preserving the semantic structure encoded in word vectors and knowledge graph. We collect two datasets for unsupervised open domain recognition problem and evaluation on these datasets show the effectiveness of our methods.

报告4

报告题目：Language-driven Temporal Activity Localization:

A Semantic Matching Reinforcement Learning Model

论文作者：王卫宁，黄岩，王亮

报告人：王卫宁，博士生，中科院自动化所

摘要：传统的行为检测研究大多是针对未分割的视频进行动作发生起止时间的定位和动作类别的判定，行为的类别是由单词或短语来描述，例如跳跃，旋转，潜水等。本文研究了一个更具挑战性和应用性但现有研究成果较少的任务，即用一个句子描述来定位视频中行为发生的位置。考虑到现有方法对视频的密集采样方式是非常耗时的，我们提出了一种基于RNN的增强学习模型，模型可以选择性的挑选一系列的视频帧进行观察，再将观察到的视频内容与句子描述用一种匹配的方式进行关联。然而直接匹配句子和视频内容性能较差，我们进一步将视频的语义概念与视频特征进行融合，将模型改进为一种语义关联的增强学习模型。我们的方法在3个数据库（TACoS，Charades-STA，Didemo）上超越了最好的结果，并且大大提升了检测的速度。

Abstract: Current studies on action detection in untrimmed videos are mostly designed for action classes, where an action is described at word level such as jumping, tumbling, swing, etc. This paper focuses on a rarely investigated problem of localizing an activity via a sentence query which would be more challenging and practical. Considering that current methods are generally time-consuming due to the dense frame-processing manner, we propose a recurrent neural network based reinforcement learning model which selectively observes a sequence of frames and associates the given sentence with video content in a matching-based manner. However, directly matching sentences with video content performs poorly due to the large visual-semantic discrepancy. Thus, we extend the method to a semantic matching reinforcement learning (SM-RL) model by extracting semantic concepts of videos and then fusing them with global context features. Extensive experiments on two benchmark datasets, TACoS and Charades-STA, show that our method achieves the state-of-the-art performance with a high detection speed, demonstrating both effectiveness and efficiency of our method.

报告5

报告题目：Deep Embedding Learning with Discriminative Sampling Policy

论文作者：段岳圻，陈磊，鲁继文，周杰

报告人：段岳圻，博士生，清华大学

摘要：深度度量学习的目标是学习更为精准的距离度量，其被广泛应用在各项视觉任务中。由于绝大部分简单训练样本难以提供有效的参数训练梯度，困难样本挖掘在模型训练中起到了重要的作用。然而，现有采样方法大多采用手工设计的穷举搜索方式，未能考虑训练样本的内在关联从而造成大量的冗余计算。本文提出了一种基于判别采样策略的深度度量学习框架，联合学习深度采样网络以提供高效的采样策略，以及深度度量网络获得精准的距离度量。相对与现有的基于手工设计的穷举方法，深度采样网络能够充分挖掘训练样本间的相关性，从而以较小的训练代价挖掘出有价值的训练数据。实验结果显示，本文提出的基于判别采样策略的深度度量学习框架在不同经典损失函数的约束下都获得了更快的收敛速度以及更强的判别能力。

Abstract: Deep embedding learning aims to learn a distance metric for effective similarity measurement, which has achieved promising performance in various tasks. As the vast majority of training samples produce gradients with magnitudes close to zero, hard example mining is usually employed to improve the effectiveness and efficiency of the training procedure. However, most existing sampling methods are designed by hand, which ignores the dependence between examples and suffer from exhaustive searching. In this paper, we propose a deep embedding with discriminative sampling policy (DE-DSP) learning framework by simultaneously training two models: a deep sampler network that learns effective sampling strategies, and a feature embedding that maps samples to the feature space. Rather than exhaustively calculating the hardness of all the examples for mining through forward-propagation, the deep sampler network exploits the strong prior of relations among samples to learn discriminative sampling policy in a more efficient manner. Experimental results demonstrate faster convergence and stronger discriminative power of our DE-DSP framework under different embedding objectives.

报告6

报告题目：Object-aware Aggregation with Bidirectional Temporal Graph

for Video Captioning

论文作者：张俊超，彭宇新

报告人：张俊超，博士生，北京大学

摘要：视频描述生成旨在自动生成描述视频内容的自然语言语句，近年来受到了广泛的关注。生成准确的自然语言描述不仅需要理解视频内容的全局信息，还需要捕捉视频中细粒度的对象信息。同时，视频表征的表达能力也是影响视频描述生成的重要因素。针对上述问题，本文提出了一种基于双向时序图的对象感知聚合方法，一方面提出了双向时序图模型，通过构建时序方向、逆时序方向的两个时序图捕捉每个对象的时序轨迹，另一方面提出了对象感知聚合模型，通过可学习的VLAD模型在为每个对象学习局部聚合特征。另外结合层次化的注意力机制，指导解码模型生成准确描述视频对象时序演化过程的语句。本方法在广泛使用的两个数据集上验证了有效性，在BLEU@4、METEOR、CIDEr三个指标上超过了现有结果。

Abstract: Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of different object instances. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.

精品赏析