|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2018
Abstract?We present an accurate stereo matching method using local expansion moves based on graph cuts. This new
move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively
combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as
many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways:
localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local
-expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions.
With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using
randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief
propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity
maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even
using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: continuous 3D Label Stereo Matching using Local Expansion moves.pdf
Human actions captured in video sequences are three dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but
invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing
the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark
datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of
similar model complexities.
Attached files: Lattice Long Short-Term Memory for Human Action Recognition.pdf
Stereo matching is a challenging problem with respect to weak texture, discontinuities,illumination difference and occlusions. Therefore, a deep learning framework is presented in this paper, which focuses on the rst and last stage of typical stereo methods: the matching cost computation and the
disparity renement. For matching cost computation, two patch-based network architectures are exploited to allow the trade-off between speed and accuracy, both of which leverage multi-size and multi-layer pooling unit with no strides to learn cross-scale feature representations. For disparity renement, unlike traditional handcrafted renement algorithms, we incorporate the initial optimal and sub-optimal disparity maps before outlier detection. Furthermore, diverse base learners are encouraged to focus on specic replacement tasks, corresponding to the smooth regions and details. Experiments on different datasets demonstrate the effectiveness of our approach, which is able to obtain sub-pixel accuracy and restore occlusions to a great extent. Specically, our accurate framework attains near-peak accuracy both in non-occluded and occluded region and our fast framework achieves competitive performance against the fast algorithms on Middlebury benchmark.
A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural nettwork.
Attached files: MATRIX CAPSULES WITH EM ROUTING.pdf