Home | Login
Lectures       Previous announcements
Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2018
2018-08-18
    We propose a face detection method based on skin color likelihood via a boosting algorithm which emphasizes skin color information while deemphasizing non-skin color information. A stochastic model is adapted to compute the similarity between a color region and the skin color. Both Haar-like features and Local Binary Pattern (LBP) features are utilized to build a cascaded classifier. The boosted classifier is implemented based on skin color emphasis to localize the face region from a color image. Based on our experiments, the proposed method shows good tolerance to face pose variation and complex background with significant improvements over classical boosting-based classifiers in terms of total error rate performance.
    Attached files: Face detection based on skin color likelihood (ori).pdf
2018-09-01
    Researches on deep neural networks with discrete parameters and their deployment in embedded systems have been active and promising topics. Although previous works have successfully reduced precision in inference, transferring both training and inference processes to low-bitwidth integers has not been demonstrated simultaneously. In this work, we develop a new method termed as ?WAGE? to discretize both training and inference, where weights (W), activations (A), gradients (G) and errors (E) among layers are shifted and linearly constrained to low-bitwidth integers. To perform pure discrete dataflow for fixed-point devices, we further replace batch normalization by a constant scaling layer and simplify other components that are arduous for integer implementation. Improved accuracies can be obtained on multiple datasets, which indicates that WAGE somehow acts as a type of regularization. Empirically, we demonstrate the potential to deploy training in hardware systems such as integer-based deep learning accelerators and neuromorphic chips with comparable accuracy and higher energy efficiency, which is crucial to future AI applications in variable scenarios with transfer and continual learning demands.
    Attached files: paper.pdf
2018-08-11
    Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.
    Attached files: Acquisition of Localization Confidence for Accurate Object Detection
2018-08-04
2018-07-28
    We propose AffordanceNet, a new deep learning approach to simultaneously detect multiple objects and their affordances from RGB images. Our AffordanceNet has two branches: an object detection branch to localize and classify the object, and an affordance detection branch to assign each pixel in the object to its most probable affordance label. The proposed framework employs three key components for effectively handling the multiclass problem in the affordance mask: a sequence of deconvolutional layers, a robust resizing strategy, and a multi-task loss function. The experimental results on the public datasets show that our AffordanceNet outperforms recent state-of-the-art methods by a fair margin, while its end-to-end architecture allows the inference at the speed of 150ms per image. This makes our AffordanceNet well suitable for real-time robotic applications. Furthermore, we demonstrate the effectiveness of AffordanceNet in different testing environments and in real robotic applications. The source code is available at https://github.com/nqanh/affordance-net.
2018-07-07
    A common approach for moving objects segmentation in a scene is to perform a background subtraction. Several methods have been proposed in this domain. However, they lack the ability of handling various diffi cult scenarios such as illumination changes, background or camera motion, camouflage effect, shadow etc. To address these issues, we propose a robust and flexible encoder-decoder type neural network based approach. We adapt a pretrained convolutional network, i.e. VGG-16 Net, under a triplet framework in the encoder part to embed an image in multiple scales into the feature space and use a transposed convolutional network in the decoder part to learn a mapping from feature space to image space. We train this network end-to-end by using only a few training samples. Our network takes an RGB image in three different scales and produces a foreground segmentation probability mask for the corresponding image. In order to evaluate our model, we entered the Change Detection 2014 Challenge (changedetection.net) and our method outperformed all the existing state-of-the-art methods by an average F-Measure of 0.9770. Our source code will be made publicly available at https://github.com/lim-anggun/FgSegNet.
    Attached files: fgSegNet_triplet.pdf
2018-06-30
    Abstract?The text data present in overlaid bands convey brief descriptions of news events in broadcast videos. The process of text extraction becomes challenging as overlay text is presented in widely varying formats and often with animation effects. We note that existing edge density based methods are well suited for our application on account of their simplicity and speed of operation. However, these methods are sensitive to thresholds and have high false positive rates. In this paper, we present a contrast enhancement based preprocessing stage for overlay text detection and a parameter free edge density based scheme for efficient text band detection. The second contribution of this paper is a novel approach for multiple text region tracking with a formal identification of all possible detection failure cases. The tracking stage enables us to establish the temporal presence of text bands and their linking over time. The third contribution is the adoption of Tesseract OCR for the specific task of overlay text recognition using web news articles. The proposed approach is tested and found superior on news videos acquired from three Indian English television news channels along with benchmark datasets.
    Attached files: Overlay text ectraction.pdf
2018-06-23
    The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these ?hard? keypoints. More specifically, our algorithm includes two stages: GlobalNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the ?simple? keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the ?hard? keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve state-of-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code and the detection results are publicly available for further research.
    Attached files: Cascaded Pyramid Network for Multi-Person Pose Estimation.pdf
2018-06-16
    Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
    Attached files: Lin_Feature_Pyramid_Networks_CVPR_2017_paper.pdf
2018-06-09
    This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto- fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting. Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks. The experimental results demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either evenly or unevenly distributed.
    Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
2018-06-02
    We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline obtains state-of-the-art results.
2018-05-26
    It plays an important role to accurately track multiple vehicles in intelligent transportation, especially in intelligent vehicles. Due to complicated traffic environments it is difficult to track multiple vehicles accurately and robustly, especially when there are occlusions among vehicles. To alleviate these problems, a new approach is proposed to track multiple vehicles with the combination of robust detection and two classifiers. An improved ViBe algorithm is proposed for robust and accurate detection of multiple vehicles. It uses the gray-scale spatial information to build dictionary of pixel life length to make ghost shadows and object??s residual shadows quickly blended into the samples of the background. The improved algorithm takes good post-processing method to restrain dynamic noise. In this paper, we also design a method using two classifiers to further attack the problem of failure to track vehicles with occlusions and interference. It classifies tracking rectangles with confidence values between two thresholds through combining local binary pattern with support vector machine (SVM) classifier and then using a convolutional neural network (CNN) classifier for the second time to remove the interference areas between vehicles and other moving objects. The two classifiers method has both time efficiency advantage of SVM and high accuracy advantage of CNN. Comparing with several existing methods, the qualitative and quantitative analysis of our experiment results showed that the proposed method not only effectively removed the ghost shadows, and improved the detection accuracy and real-time performance, but also was robust to deal with the occlusion of multiple vehicles in various traffic scenes.
    Attached files: A New Approach to Track Multiple Vehicles With.pdf 20180526-report-Yang Yu.pptx
2018-05-19
    This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto- fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting. Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks.The experimental results demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either evenly or unevenly distributed.
    Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
2018-05-12
    This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-theart detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.
    Attached files: 1708.05237.pdf
2018-05-05
    We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron.
    Attached files: MaskRCNN.pdf
2018-04-21
    We present an improved three-step pipeline for the stereo matching problem and introduce multiple novelties at each stage. We propose a new highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a hybrid loss that supports multilevel comparison of image patches. A novel post-processing step is then introduced, which employs a second deep convolutional neural network for pooling global information from multiple disparities. This network outputs both the image disparity map, which replaces the conventional ?winner takes all?strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a new technique that we call the reflective loss. Lastly, the learned confidence is employed in order to better detect outliers in the refinement step. The proposed pipeline achieves state of the art accuracy on the largest and most competitive stereo benchmarks, and the learned confidence is shown to outperform all existing alternatives. 1. Introduction
2018-04-07
    We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings
    Attached files: DenseCap-Fully Convolutional Localization Networks for Dense Captioning.pdf
2018-03-31
    Articulated human pose estimation is a fundamental yet challenging task in computer vision. The difficulty is particularly pronounced in scale variations of human body parts when camera view changes or severe foreshortening happens. Although pyramid methods are widely used to handle scale changes at inference time, learning feature pyramids in deep convolutional neural networks (DCNNs) is still not well explored. In this work, we design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different subsampling ratios in a multibranch network. Moreover, we observe that it is inappropriate to adopt existing methods to initialize the weights of multi-branch networks, which achieve superior performance than plain networks in many tasks recently. Therefore, we provide theoretic derivation to extend the current weight initialization scheme to multi-branch network structures. We investigate our method on two standard benchmarks for human pose estimation. Our approach obtains state-of-the-art results on both benchmarks. Code is available at https://github.com/bearpaw/PyraNet
    Attached files: Learning Feature Pyramids for Human Pose Estimation.pdf
2018-03-17
    We introduce the notion of semantic background subtraction, a novel framework for motion detection in video sequences. The key innovation consists to leverage object-level semantics to address the variety of challenging scenarios for background subtraction. Our framework combines the information of a semantic segmentation algorithm, expressed by a probability for each pixel, with the output of any background subtraction algorithm to reduce false positive detections produced by illumination changes, dynamic backgrounds, strong shadows, and ghosts. In addition, it maintains a fully semantic background model to improve the detection of camouflaged foreground objects. Experiments led on the CDNet dataset show that we managed to improve, significantly, almost all background subtraction algorithms of the CDNet leaderboard, and reduce the mean overall error rate of all the 34 algorithms (resp. of the best 5 algorithms) by roughly 50% (resp. 20%). Note that a C++ implementation of the framework is available at http://www.telecom.ulg.ac.be/semantic.
    Attached files: Braham2017Semantic.pdf
2018-03-10
    Accurate and fast detection of the moving targets from a moving camera are an important yet challenging problem, especially when the computational resources are limited. In this paper, we propose an effective, efficient, and robust method to accurately detect and segment multiple independently moving foreground targets from a video sequence taken by a monocular moving camera [e.g., onboard an unmannedaerial vehicle(UAV)]. Our proposed method advances the existing methods in a number of ways, where: 1) camera motion is estimated through tracking background keypoints using pyramidal Lucas?CKanade at every detection interval, for efficiency; 2) foreground segmentation is applied by integrating a local motion history function with spatio-temporal differencing over a sliding window for detecting multiple moving targets, while the perspective homography is used at image registration for effectiveness; and 3) the detection interval is adjusted dynamically based on a rule-of-thumb technique and considering camera setup parameters for robustness. The proposed method has been tested on a variety of scenarios using a UAV camera, as well as publically available data sets. Based on the reported results and through comparison with the existing methods, the accuracy of the proposed method in detecting multiple moving targets as well as its capability for realtime implementation has been successfully demonstrated. Our method is also robustly applicableto ground-level cameras for the ITS applications, as confirmed by the experimental results. More specifically, the proposed method shows promising performance compared with the literature in terms of quantitative metrics, while the run-time measures are significantly improved for realtime implementation.
    Attached files: Effective and Efficient Detection of Moving.pdf 20180310-report-Yang Yu.pptx
2018-03-03
    We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.
    Attached files: enas.pdf
2018-02-24
    Abstract. This paper presents an automatic segmentation system for characters in text color images cropped from natural images or videos based on a new neuronal architecture insuring fast processing and robustness against noise, variations in illumination, complex background and low resolution. An off-line training phase on a set of synthetic text color images, where the exact character positions are known, allows adjusting the neural parameters and thus building an optimal non linear filter which extracts the best features in order to robustly detect the border positions between characters. The proposed method is tested on a set of synthetic text images to precisely evaluate its performance according to noise, and on a set of complex text images collected from video frames and web pages to evaluate its performance on real images. The results are encouraging with a good segmentation rate of 89.12% and a recognition rate of 81.94% on a set of difficult text images collected from video frames and from web pages.
    Attached files: [22]An_Automatic_Method_for_Video_Character_Segmentati.pdf
2018-02-10
    The ability to identify and temporally segment fine- grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical ap- proaches decouple this problem by first extracting local spatiotemporal features from video frames and then feed- ing them into a temporal classifier that captures high- level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to ef- ficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment du- rations, and long-range dependencies, and are over a mag- nitude faster to train than competing LSTM-based Recur- rent Neural Networks. We apply these models to three chal- lenging fine-grained datasets and show large improvements over the state of the art.
    Attached files: Feb saturday seminar final.pptx cand1_Lea_Temporal_Convolutional_Networks_CVPR_2017_paper.pdf
2018-02-03
    We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways: localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local -expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions. With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
    Attached files: TPAMI-Contiuous 3D Label Stereo Matching using Local Expansion.pdf
2018-01-27
    Abstract?We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways: localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local -expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions. With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
    Attached files: continuous 3D Label Stereo Matching using Local Expansion moves.pdf
2018-01-20
    Human actions captured in video sequences are three dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities.
    Attached files: Lattice Long Short-Term Memory for Human Action Recognition.pdf
2018-01-13
    Stereo matching is a challenging problem with respect to weak texture, discontinuities,illumination difference and occlusions. Therefore, a deep learning framework is presented in this paper, which focuses on the rst and last stage of typical stereo methods: the matching cost computation and the disparity renement. For matching cost computation, two patch-based network architectures are exploited to allow the trade-off between speed and accuracy, both of which leverage multi-size and multi-layer pooling unit with no strides to learn cross-scale feature representations. For disparity renement, unlike traditional handcrafted renement algorithms, we incorporate the initial optimal and sub-optimal disparity maps before outlier detection. Furthermore, diverse base learners are encouraged to focus on specic replacement tasks, corresponding to the smooth regions and details. Experiments on different datasets demonstrate the effectiveness of our approach, which is able to obtain sub-pixel accuracy and restore occlusions to a great extent. Specically, our accurate framework attains near-peak accuracy both in non-occluded and occluded region and our fast framework achieves competitive performance against the fast algorithms on Middlebury benchmark.
2018-01-06
    A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural nettwork.
    Attached files: MATRIX CAPSULES WITH EM ROUTING.pdf
News | About us | Research | Lectures