|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2018
Convolutional neural networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they have not demonstrated sufﬁciently accurate and robust results for clinical use. In addition, they are limited by the lack of image-speciﬁc adaptation and the lack of generalizability to previously unseen object classes (a.k.a. zero-shot learning). To address these problems, we propose a novel deep learning-based interactive segmentation framework by incorporating CNNs into a bounding box and scribble based segmentation pipeline. We propose image-speciﬁc ﬁne tuning to make a CNN model adaptive to a speciﬁc test image, which can be either unsupervised (without additional user interactions)or supervised (with additional scribbles). We also propose a weighted loss function considering network and interaction-based uncertainty for the ﬁne tuning. We applied this framework to two applications:2D segmentation of multiple organ from fetal magnetic resonance (MR) slices, where only two types of these organswereannotatedfortrainingand3-Dsegmentationof brain tumor core (excluding edema) and whole brain tumor (including edema) from different MR sequences, where only the tumor core in one MR sequence was annotated for training. Experimental results show that: 1) our model is more robust to segment previously unseen objects than state-of-the-art CNNs; 2) image-speciﬁc ﬁne tuning with the proposed weighted loss function signiﬁcantly improves segmentation accuracy; and 3) our method leads to accurate results with fewer user interactions and less user time than traditional interactive segmentation methods.
Attached files: segmentation medical image.pdf
Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network
and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, kmeans, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.
Attached files: Mathilde_Caron_Deep_Clustering_for_ECCV_2018_paper.pdf
Abstract. Despite the large number of both commercial and academic
methods for Automatic License Plate Recognition (ALPR), most existing
approaches are focused on a specific license plate (LP) region (e.g. Eu-
ropean, US, Brazilian, Taiwanese, etc.), and frequently explore datasets
containing approximately frontal images. This work proposes a complete
ALPR system focusing on unconstrained capture scenarios, where the LP
might be considerably distorted due to oblique views. Our main contribu-
tion is the introduction of a novel Convolutional Neural Network (CNN)
capable of detecting and rectifying multiple distorted license plates in a
single image, which are fed to an Optical Character Recognition (OCR)
method to obtain the final result. As an additional contribution, we also
present manual annotations for a challenging set of LP images from differ-
ent regions and acquisition conditions. Our experimental results indicate
that the proposed method, without any parameter adaptation or fine
tuning for a specific scenario, performs similarly to state-of-the-art com-
mercial systems in traditional scenarios, and outperforms both academic
and commercial approaches in challenging ones.
Attached files: Sergio_Silva_License_Plate_Detection_ECCV_2018_paper.pdf
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous
one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
Attached files: 1708.02002.pdf
In this work, we tackle the problem of instance segmentation, the task of simultaneously solving object detection and semantic segmentation. Towards this goal, we present a model, called MaskLab, which produces three outputs: box detection, semantic segmentation, and direction prediction.
Building on top of the Faster-RCNN object detector, the predicted boxes provide accurate localization of object instances. Within each region of interest, MaskLab performs foreground/background segmentation by combining semantic and direction prediction. Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background, while the direction prediction, estimating each pixel?s direction towards its corresponding center, allows separating instances of the same semantic class. Moreover, we explore the effect of incorporating recent successful methods from both segmentation and detection (e.g., atrous convolution and hypercolumn). Our proposed model is evaluated on the COCO instance segmentation benchmark and shows comparable performance with other state-of-art models.
Attached files: 1712.04837v1.pdf
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions . Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators  have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Attached files: Distilling the Knowledge in a Neural Network.pdf
Optical Character Recognition (OCR) aims to recognize text in natural images.
Inspired by a recently proposed model for general image classification, Recurrent
Convolution Neural Network (RCNN), we propose a new architecture named
Gated RCNN (GRCNN) for solving this problem. Its critical component, Gated
Recurrent Convolution Layer (GRCL), is constructed by adding a gate to the
Recurrent Convolution Layer (RCL), the critical component of RCNN. The gate
controls the context modulation in RCL and balances the feed-forward information
and the recurrent information. In addition, an efficient Bidirectional Long Short-
Term Memory (BLSTM) is built for sequence modeling. The GRCNN is combined
with BLSTM to recognize text in natural images. The entire GRCNN-BLSTM
model can be trained end-to-end. Experiments show that the proposed model
outperforms existing methods on several benchmark datasets including the IIIT-5K,
Street View Text (SVT) and ICDAR.
Attached files: 6637-gated-recurrent-convolution-neural-network-for-ocr.pdf
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully super-vised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.
Attached files: Data Distillation Towards Omni-Supervised Learning.pdf
In the research area of computer vision and artificial intelligence, learning the relationships of objects is an important way to deeply understand images. Most of recent works detect visual relationship by learning objects and predicates respectively in feature level, but the dependencies between objects and predicates have not been fully considered. In this paper, we introduce deep structured learning for visual relationship detection. Specifically, we propose a deep structured model, which learns relationship by using feature-level prediction and label-level prediction to improve learning ability of only using feature-level predication. The feature-level prediction learns relationship by discriminative features, and the label-level prediction learns relationships by capturing dependencies between objects and predicates based on the learnt
relationship of feature level. Additionally, we use structured SVM (SSVM) loss function as our optimization goal, and decompose this goal into the subject, predicate, and object
optimizations which become more simple and more independent. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome (VG) dataset validate the effectiveness of our method, which outperforms state-of-the-art methods.
Attached files: 2018-Zhu-AAAI-Deep Structured Learning for Visual Relationship Detection.pdf
Background modeling and subtraction based on change detection are the first step in many high-level computer vision applications. Many background subtraction methods have been proposed in the recent past and their efforts mainly focus on two aspects: more advanced background models and more complex feature representations. Recently, hierarchical features learned from deep convolutional neural networks have been shown to be effective for many computer vision tasks, such as classication and recognition. However, few researchers try to learn the deep features to address the background subtraction problem. Therefore, in this paper, we propose a novel multiscale fully convolutional network (MFCN) architecture which takes advantage of different layer features for background subtraction. We show that the foreground detection accuracy can be greatly improved by using the deep features learned from the MFCN and instead of building highly complex background models, and the complexity of the background subtraction process can be easily solved during the subtraction operation itself. Experimental results on CDnet 2014 data set and SBM-RGBD data set show that the proposed MFCN-based method achieves state-of-the-art performance while operating at real time.
Attached files: multiscale CNN-foreground detection.pdf
We propose a background subtraction algorithm using hierarchical superpixel segmentation, spanning trees and optical ﬂow. First, we generate superpixel segmentation trees using a number of Gaussian Mixture Models (GMMs) by treating each GMM as one vertex to construct spanning trees. Next, we use the M-smoother to enhance the spatial consistency on the spanning trees and estimate optical ﬂow to extend the M-smoother to the temporal domain. Experimental results on synthetic and real-world benchmark datasets show that the proposed algorithm performs favorably for background subtraction in videos against the state-of-the-art methods in spite of frequent and sudden changes of pixel values.
Attached files: 20180916-report-Yang Yu.pptx Spatiotemporal GMM for Background.pdf
Person Re-identification (ReID) is an important yet challenging task in computer vision. Due to the diverse background clutters, variations on viewpoints and body poses, it is far from solved. How to extract discriminative and robust features invariant to background clutters is the core problem. In this paper, we first introduce the binary segmentation masks to construct synthetic RGB-Mask pairs as inputs, then we design a mask-guided contrastive attention model (MGCAM) to learn features separately from the body and background regions. Moreover, we propose a novel region level triplet loss to restrain the features learnt from different regions, i.e., pulling the features from the full image and body region close, whereas pushing the features from backgrounds away. We may be the first one to successfully introduce the binary mask into person ReID task and the first one to propose region-level contrastive learning. We evaluate the proposed method on three public datasets, including MARS, Market-1501 and CUHK03. Extensive experimental results show that the proposed method is effective and achieves the state-of-the-art results. Mask and code will be released upon request.
Attached files: Song_Mask-Guided_Contrastive_Attention_CVPR_2018_paper.pdf
We propose a face detection method based on skin color likelihood via a boosting algorithm which emphasizes skin color information while deemphasizing non-skin color information. A stochastic model is adapted to compute the similarity between a color region and the skin color. Both Haar-like features and Local Binary Pattern (LBP) features are utilized to build a cascaded classifier. The boosted classifier is
implemented based on skin color emphasis to localize the face region from a color image. Based on our experiments, the proposed method shows good tolerance to face pose variation and complex background with significant improvements over classical boosting-based classifiers in terms of total error rate performance.
Attached files: Face detection based on skin color likelihood (ori).pdf
Researches on deep neural networks with discrete parameters and their deployment in embedded systems have been active and promising topics. Although previous works have successfully reduced precision in inference, transferring both training and inference processes to low-bitwidth integers has not been demonstrated simultaneously. In this work, we develop a new method termed as ?WAGE? to discretize both training and inference, where weights (W), activations (A), gradients (G) and errors (E) among layers are shifted and linearly constrained to low-bitwidth integers. To perform pure discrete dataflow for fixed-point devices, we further replace batch normalization by a constant scaling layer and simplify other components that are arduous for integer implementation. Improved accuracies can be obtained on multiple datasets, which indicates that WAGE somehow acts as a type of regularization. Empirically, we demonstrate the potential to deploy training in hardware systems such as integer-based deep learning accelerators and neuromorphic chips with comparable accuracy and higher energy efficiency, which is crucial to future AI applications in variable scenarios with transfer and continual learning demands.
Attached files: paper.pdf
Modern CNN-based object detectors rely on bounding box
regression and non-maximum suppression to localize objects. While the
probabilities for class labels naturally reflect classification confidence,
localization confidence is absent. This makes properly localized bounding
boxes degenerate during iterative regression or even suppressed during
NMS. In the paper we propose IoU-Net learning to predict the IoU
between each detected bounding box and the matched ground-truth.
The network acquires this confidence of localization, which improves
the NMS procedure by preserving accurately localized bounding boxes.
Furthermore, an optimization-based bounding box refinement method
is proposed, where the predicted IoU is formulated as the objective.
Extensive experiments on the MS-COCO dataset show the effectiveness
of IoU-Net, as well as its compatibility with and adaptivity to several
state-of-the-art object detectors.
Attached files: Acquisition of Localization Confidence for Accurate Object Detection
We propose AffordanceNet, a new deep learning approach to simultaneously detect multiple objects and their affordances from RGB images. Our AffordanceNet has two branches: an object detection branch to localize and classify the object, and an affordance detection branch to assign each
pixel in the object to its most probable affordance label. The proposed framework employs three key components for effectively handling the multiclass problem in the affordance mask: a sequence of deconvolutional layers, a robust resizing strategy, and a multi-task loss function. The experimental results on the public datasets show that our AffordanceNet outperforms recent state-of-the-art methods by a fair margin, while its end-to-end architecture allows the inference at the speed of 150ms per image. This makes our AffordanceNet well suitable for real-time robotic applications. Furthermore, we demonstrate the effectiveness of AffordanceNet in different
testing environments and in real robotic applications. The source code is available at https://github.com/nqanh/affordance-net.
A common approach for moving objects segmentation in a scene is to perform a background subtraction. Several methods have been proposed in this domain. However, they lack the ability of handling various difficult scenarios such as illumination changes, background or camera motion, camouflage effect, shadow etc. To address these issues, we propose a robust and flexible encoder-decoder type neural network based approach. We adapt a pretrained convolutional network, i.e. VGG-16 Net, under a triplet framework in the encoder part to embed an image in multiple scales into the feature space and use a transposed convolutional network in the decoder part to learn a mapping from feature space to image space. We train this network end-to-end by using only a few training samples. Our network takes an RGB image in three different scales and produces a foreground segmentation probability mask for the corresponding image. In order to evaluate our model, we entered the Change Detection 2014 Challenge (changedetection.net) and our method outperformed all the existing state-of-the-art methods by an average F-Measure of 0.9770. Our source code will be made publicly available at https://github.com/lim-anggun/FgSegNet.
Attached files: fgSegNet_triplet.pdf
Abstract?The text data present in overlaid bands convey brief
descriptions of news events in broadcast videos. The process of
text extraction becomes challenging as overlay text is presented
in widely varying formats and often with animation effects. We
note that existing edge density based methods are well suited
for our application on account of their simplicity and speed of
operation. However, these methods are sensitive to thresholds
and have high false positive rates. In this paper, we present
a contrast enhancement based preprocessing stage for overlay
text detection and a parameter free edge density based scheme
for efficient text band detection. The second contribution of this
paper is a novel approach for multiple text region tracking with
a formal identification of all possible detection failure cases. The
tracking stage enables us to establish the temporal presence of
text bands and their linking over time. The third contribution
is the adoption of Tesseract OCR for the specific task of overlay
text recognition using web news articles. The proposed approach
is tested and found superior on news videos acquired from three
Indian English television news channels along with benchmark
Attached files: Overlay text ectraction.pdf
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which
targets to relieve the problem from these ?hard? keypoints. More specifically, our algorithm includes two stages: GlobalNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the ?simple? keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the ?hard? keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve state-of-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19%
relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code and the detection results are publicly available for further research.
Attached files: Cascaded Pyramid Network for Multi-Person Pose Estimation.pdf
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale,
pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN),
shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
Attached files: Lin_Feature_Pyramid_Networks_CVPR_2017_paper.pdf
This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto- fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting. Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks. The experimental results demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either evenly or unevenly distributed.
Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding.
The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of
deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline obtains state-of-the-art results.
It plays an important role to accurately track multiple vehicles in intelligent transportation, especially in intelligent vehicles. Due to complicated trafﬁc environments it is difﬁcult to track multiple vehicles accurately and robustly, especially when there are occlusions among vehicles. To alleviate these problems, a new approach is proposed to track multiple vehicles with the combination of robust detection and two classiﬁers. An improved ViBe algorithm is proposed for robust and accurate detection of multiple vehicles. It uses the gray-scale spatial information to build dictionary of pixel life length to make ghost shadows and object??s residual shadows quickly blended into the samples of the background. The improved algorithm takes good post-processing method to restrain dynamic noise. In this paper, we also design a method using two classiﬁers to further attack the problem of failure to track vehicles with occlusions and interference. It classiﬁes tracking rectangles with conﬁdence values between two thresholds through combining local binary pattern with support vector machine (SVM) classiﬁer and then using a convolutional neural network (CNN) classiﬁer for the second time to remove the interference areas between vehicles and other moving objects. The two classiﬁers method has both time efﬁciency advantage of SVM and high accuracy advantage of CNN. Comparing with several existing methods, the qualitative and quantitative analysis of our experiment results showed that the proposed method not only effectively removed the ghost shadows, and improved the detection accuracy and real-time performance, but also was robust to deal with the occlusion of multiple vehicles in various trafﬁc scenes.
Attached files: A New Approach to Track Multiple Vehicles With.pdf 20180526-report-Yang Yu.pptx
This study proposes an automatic reading approach for a pointer gauge based on computer vision. Moreover, the study aims to
highlight the defects of the current automatic-recognitionmethod of the pointer gauge and introduces amethod that uses a coarseto-
fine scheme and has superior performance in the accuracy and stability of its reading identification. First, it uses the region
growing method to locate the dial region and its center. Second, it uses an improved central projection method to determine the
circular scale region under the polar coordinate system and detect the scale marks. Then, the border detection is implemented
in the dial image, and the Hough transform method is used to obtain the pointer direction by means of pointer contour fitting.
Finally, the reading of the gauge is obtained by comparing the location of the pointer with the scalemarks.The experimental results
demonstrate the effectiveness of the proposed approach. This approach is applicable for reading gauges whose scalemarks are either
evenly or unevenly distributed.
Attached files: Machine Vision Based Automatic Detection Method of Indicating Values of a Pointer Gauge.pdf
This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-theart detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.
Attached files: 1708.05237.pdf
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously
generating a high-quality segmentation mask for each instance.The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox
object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the
COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron.
Attached files: MaskRCNN.pdf
We present an improved three-step pipeline for the stereo matching problem and introduce multiple novelties at each stage. We propose a new highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained
with a hybrid loss that supports multilevel comparison of image patches. A novel post-processing step is then introduced, which employs a second deep convolutional neural network for pooling global information from multiple disparities. This network outputs both the image disparity
map, which replaces the conventional ?winner takes all?strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a new technique
that we call the reflective loss. Lastly, the learned confidence is employed in order to better detect outliers in the refinement step. The proposed pipeline achieves state of the art accuracy on the largest and most competitive stereo benchmarks, and the learned confidence is shown to outperform all existing alternatives.
We introduce the dense captioning task, which requires a
computer vision system to both localize and describe salient
regions in images in natural language. The dense captioning
task generalizes object detection when the descriptions
consist of a single word, and Image Captioning when one
predicted region covers the full image. To address the localization
and description task jointly we propose a Fully Convolutional
Localization Network (FCLN) architecture that
processes an image with a single, efficient forward pass, requires
no external regions proposals, and can be trained
end-to-end with a single round of optimization. The architecture
is composed of a Convolutional Network, a novel
dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We
evaluate our network on the Visual Genome dataset, which
comprises 94,000 images and 4,100,000 region-grounded
captions. We observe both speed and accuracy improvements
over baselines based on current state of the art approaches
in both generation and retrieval settings
Attached files: DenseCap-Fully Convolutional Localization Networks for Dense Captioning.pdf
Articulated human pose estimation is a fundamental yet challenging task in computer vision. The difficulty is particularly pronounced in scale variations of human body parts when camera view changes or severe foreshortening happens. Although pyramid methods are widely used to handle scale changes at inference time, learning feature pyramids in deep convolutional neural networks (DCNNs) is still not well explored. In this work, we design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different subsampling ratios in a multibranch network. Moreover, we observe that it is inappropriate to adopt existing methods to initialize the weights of multi-branch networks, which achieve superior performance than plain networks in many tasks recently. Therefore, we provide theoretic derivation to extend the current
weight initialization scheme to multi-branch network structures. We investigate our method on two standard benchmarks for human pose estimation. Our approach obtains state-of-the-art results on both benchmarks. Code is available at https://github.com/bearpaw/PyraNet
Attached files: Learning Feature Pyramids for Human Pose Estimation.pdf
We introduce the notion of semantic background subtraction, a novel framework for motion detection in video sequences. The key innovation consists to leverage object-level semantics to address the variety of challenging scenarios for background subtraction. Our framework combines the information of a semantic segmentation algorithm, expressed by a probability for each pixel, with the output of any background subtraction algorithm to reduce false positive detections produced by illumination changes, dynamic backgrounds, strong shadows, and ghosts. In addition, it maintains a fully semantic background model to improve the detection of camouflaged foreground objects. Experiments led on the CDNet dataset show that we managed to improve, significantly, almost all background subtraction algorithms of the CDNet leaderboard, and reduce the mean overall error rate of all the 34 algorithms (resp. of the best 5 algorithms) by roughly 50% (resp. 20%). Note that a C++ implementation of the framework is available at http://www.telecom.ulg.ac.be/semantic.
Attached files: Braham2017Semantic.pdf
Accurate and fast detection of the moving targets from a moving camera are an important yet challenging problem, especially when the computational resources are limited. In this paper, we propose an effective, efﬁcient, and robust method to accurately detect and segment multiple independently moving foreground targets from a video sequence taken by a monocular moving camera [e.g., onboard an unmannedaerial vehicle(UAV)]. Our proposed method advances the existing methods in a number of ways, where: 1) camera motion is estimated through tracking background keypoints using pyramidal Lucas?CKanade at every detection interval, for efﬁciency; 2) foreground segmentation is applied by integrating a local motion history function with spatio-temporal differencing over a sliding window for detecting multiple moving targets, while the perspective homography is used at image registration for effectiveness; and 3) the detection interval is adjusted dynamically based on a rule-of-thumb technique and considering camera setup parameters for robustness. The proposed method has been tested on a variety of scenarios using a UAV camera, as well as publically available data sets. Based on the reported results and through comparison with the existing methods, the accuracy of the proposed method in detecting multiple moving targets as well as its capability for realtime implementation has been successfully demonstrated. Our method is also robustly applicableto ground-level cameras for the ITS applications, as conﬁrmed by the experimental results. More speciﬁcally, the proposed method shows promising performance compared with the literature in terms of quantitative metrics, while the run-time measures are signiﬁcantly improved for realtime implementation.
Attached files: Effective and Efficient Detection of Moving.pdf 20180310-report-Yang Yu.pptx
We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.
Attached files: enas.pdf
Abstract. This paper presents an automatic segmentation system for
characters in text color images cropped from natural images or videos
based on a new neuronal architecture insuring fast processing and robustness
against noise, variations in illumination, complex background
and low resolution. An off-line training phase on a set of synthetic text
color images, where the exact character positions are known, allows adjusting
the neural parameters and thus building an optimal non linear
filter which extracts the best features in order to robustly detect the border
positions between characters. The proposed method is tested on a
set of synthetic text images to precisely evaluate its performance according
to noise, and on a set of complex text images collected from video
frames and web pages to evaluate its performance on real images. The
results are encouraging with a good segmentation rate of 89.12% and a
recognition rate of 81.94% on a set of difficult text images collected from
video frames and from web pages.
Attached files: An_Automatic_Method_for_Video_Character_Segmentati.pdf
The ability to identify and temporally segment fine-
grained human actions throughout a video is crucial for
robotics, surveillance, education, and beyond. Typical ap-
proaches decouple this problem by first extracting local
spatiotemporal features from video frames and then feed-
ing them into a temporal classifier that captures high-
level temporal patterns. We describe a class of temporal
models, which we call Temporal Convolutional Networks
(TCNs), that use a hierarchy of temporal convolutions to
perform fine-grained action segmentation or detection. Our
Encoder-Decoder TCN uses pooling and upsampling to ef-
ficiently capture long-range temporal patterns whereas our
Dilated TCN uses dilated convolutions. We show that TCNs
are capable of capturing action compositions, segment du-
rations, and long-range dependencies, and are over a mag-
nitude faster to train than competing LSTM-based Recur-
rent Neural Networks. We apply these models to three chal-
lenging fine-grained datasets and show large improvements
over the state of the art.
Attached files: Feb saturday seminar final.pptx cand1_Lea_Temporal_Convolutional_Networks_CVPR_2017_paper.pdf
We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways: localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local -expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions. With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: TPAMI-Contiuous 3D Label Stereo Matching using Local Expansion.pdf
Abstract?We present an accurate stereo matching method using local expansion moves based on graph cuts. This new
move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively
combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are presented as
many -expansions defined for small grid regions. The local expansion moves extend traditional expansion moves by two ways:
localization and spatial propagation. By localization, we use different candidate -labels according to the locations of local
-expansions. By spatial propagation, we design our local -expansions to propagate currently assigned labels for nearby regions.
With this localization and spatial propagation, our method can efficiently infer MRF models with a continuous label space using
randomized search. Our method has several advantages over previous approaches that are based on fusion moves or belief
propagation; it produces submodular moves deriving a subproblem optimality; it helps find good, smooth, piecewise linear disparity
maps; it is suitable for parallelization; it can use cost-volume filtering techniques for accelerating the matching cost computations. Even
using a simple pairwise MRF, our method is shown to have best performance in the Middlebury stereo benchmark V2 and V3.
Attached files: continuous 3D Label Stereo Matching using Local Expansion moves.pdf
Human actions captured in video sequences are three dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but
invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing
the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark
datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of
similar model complexities.
Attached files: Lattice Long Short-Term Memory for Human Action Recognition.pdf
Stereo matching is a challenging problem with respect to weak texture, discontinuities,illumination difference and occlusions. Therefore, a deep learning framework is presented in this paper, which focuses on the rst and last stage of typical stereo methods: the matching cost computation and the
disparity renement. For matching cost computation, two patch-based network architectures are exploited to allow the trade-off between speed and accuracy, both of which leverage multi-size and multi-layer pooling unit with no strides to learn cross-scale feature representations. For disparity renement, unlike traditional handcrafted renement algorithms, we incorporate the initial optimal and sub-optimal disparity maps before outlier detection. Furthermore, diverse base learners are encouraged to focus on specic replacement tasks, corresponding to the smooth regions and details. Experiments on different datasets demonstrate the effectiveness of our approach, which is able to obtain sub-pixel accuracy and restore occlusions to a great extent. Specically, our accurate framework attains near-peak accuracy both in non-occluded and occluded region and our fast framework achieves competitive performance against the fast algorithms on Middlebury benchmark.
A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural nettwork.
Attached files: MATRIX CAPSULES WITH EM ROUTING.pdf