|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2017
Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires real-time inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully-convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI benchmark.
Attached files: SqueezeDet: Unified Small Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving.pdf
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a ?stacked hourglass? network based on the successive steps of pooling and upsampling that are done to produce a final set of
predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Attached files: 20171216.pdf Stacked Hourglass Networks for Human Pose Estimation.pdf
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
Attached files: PSPNet_cvpr17.pdf
This paper describes a lane-level localization algorithm based on a map-matching method for application to automated driving in urban environments. The lane-level localization implies localizing the vehicle with centimeterlevel accuracy. In order to achieve a satisfactory level of position accuracy with a low-cost GPS, a sensor fusion approach is essential for lane-level localization. The proposed sensor fusion approach for the lane-level localization of a vehicle uses an around view monitoring (AVM) module and vehicle sensors. The proposed algorithm consists of three parts: lane detection, position correction, and localization ﬁlter. In order to detect lanes, a commercialized AVM module is used. Since this module can acquire an image around the vehicle, it is possible to obtain accurate position information of the lanes. With this information, the vehicle position can be corrected by the iterative closest point (ICP) algorithm.Thisalgorithmestimatestherigidtransformation between the lane map and lanes obtained by AVM in realtime. The vehicle position corrected by this transformation isfusedwiththeinformationofvehiclesensorsbasedonan extended Kalman ﬁlter. For higher accuracy, the covariance of the ICP is estimated using Haralick??s method.
Attached files: 20171202-report-Yang Yu.pptx Lane-Level Localization Using an AVM Camera for an Automated Driving Vehicle in Urban Environments.pdf
In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras. We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem. Our approach decomposes the human body into regions (parts) which are discriminative for person matching, accordingly computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information. Unlike most existing deep learning algorithms that learn a global or spatial partition-based local representation, our approach performs human body partition, and thus is more robust to pose changes and various human spatial distributions in the person bounding box.
Our approach shows state-of-the-art results over standard datasets, Market-
1501, CUHK03, CUHK01 and VIPeR.
Attached files: Zhao_Deeply-Learned_Part-Aligned_Representations_ICCV_2017_paper.pdf
We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers
based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep
model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we
develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions
by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model
generates multiple channels of information from the input frames, and the final feature representation combines information from all
channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions
of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport
surveillance videos, and they achieve superior performance in comparison to baseline methods.
Attached files: 3d cnn.pdf
In this paper, we presents a new binarization
approach to extract text pixels from complex background in
video frames. The binarization computation is a crucial step for
video text recognition, which can greatly increase the recognition
accuracy of an OCR software. The proposed approach consists
of four phases. First, the text polarity is determined, i.e. light
text with dark background or dark text with light background.
Then the pixels in the given image are clustered into 𝐾 clusters
using the K-means algorithm in the RGB color space and the
text cluster is selected based on the text polarity. Further, the
MRF Model is exploited to get the binarization result. Finally, the
result is further refined by the Log-Gabor filter. The Experimental
results on a large dataset show that the significant gains have been
obtained according to the segmentation performance on the pixel
level as well as the OCR accuracy.
Attached files: A novel approach for binarization of overlay text.pdf
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
Attached files: PointNet Deep Learning on Point Sets for 3D Classification and Segmentation.pdf
Stereo matching is the key problem in many stereo
vison based 3D applications. One of the factors make local stereo
matching time-consuming is that every pixel has the same
disparity range as the pre-set one, which should be larger than all
possible disparities. An improper pre-set disparity range may lead
to redundant computation for some pixels and inadequate
computation for some others. In this paper, we propose an
accurate and fast local stereo matching method, which employs
fine disparity estimation and adaptive cost aggregation. The main
contributions of our work include two parts. Firstly, we use phasebased
correlation to estimate an initial disparity range for block
center pixels. Secondly, we estimate a more limited disparity range
for every in-block pixel according to its support weight to the block
center pixel. Our disparity estimation and block matching
techniques can not only reduce the disparity searching range for
every pixel, but also can eliminate some pseudo match pairs. Four
standard Middlebury stereo image pairs are tested to evaluate the
performance of the proposed algorithm. Experimental results
show that the proposed algorithm can reduce the matching time
by 37.4% on average with relatively higher accuracy compared to
Attached files: Block Based Dense Stereo Matching Using adaptive cost aggregation and limited disparity estimation.pdf Original ASW paper.pdf
It is a challenging task to recognize smoke from images due to large variance of smoke color, texture, and shapes. There are smoke detection methods that have been proposed, but most of them are based on hand-crafted features. To improve the performance of smoke detection, we propose a novel deep normalization and convolutional neural network (DNCNN) with 14 layers to implement automatic feature extraction and classification. In DNCNN, traditional convolutional layers are replaced with normalization and convolutional layers to accelerate the training process and boost the performance of smoke detection. To reduce overfitting caused by imbalanced and insufficient training samples, we generate more training samples from original training data sets by using a variety of data enhancement techniques. Experimental results show that our method achieved very low false alarm rates below 0.60% with detection rates above 96.37% on our smoke data sets.
Attached files: paper.pdf
We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem??s geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions
over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to
sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.
Generative models of 3D human motion are often restricted
to a small number of activities and can therefore not
generalize well to novel movements or applications. In this
work we propose a deep learning framework for human motion
capture data that learns a generic representation from
a large corpus of motion capture data and generalizes well
to new, unseen, motions. Using an encoding-decoding network
that learns to predict future 3D poses from the most
recent past, we extract a feature representation of human
motion. Most work on deep learning for sequence prediction
focuses on video and speech. Since skeletal data has
a different structure, we present and evaluate different network
architectures that make different assumptions about
time dependencies and limb correlations. To quantify the
learned features, we use the output of different layers for
action classification and visualize the receptive fields of the
network units. Our method outperforms the recent state
of the art in skeletal motion prediction even though these
use action specific training data. Our results show that
deep feedforward networks, trained from a generic mocap
database, can successfully be used for feature extraction
from human motion data and that this representation can
be used as a foundation for classification and prediction.
Attached files: Deep representation learning for human motion prediction and classification.pdf
This paper aims at high-accuracy 3D object detection in
autonomous driving scenario. We propose Multi-View 3D
networks (MV3D), a sensory-fusion framework that takes
both LIDAR point cloud and RGB images as input and predicts
oriented 3D bounding boxes. We encode the sparse
3D point cloud with a compact multi-view representation.
The network is composed of two subnetworks: one for 3D
object proposal generation and another for multi-view feature
fusion. The proposal network generates 3D candidate
boxes efficiently from the bird?s eye view representation
of 3D point cloud. We design a deep fusion scheme
to combine region-wise features from multiple views and
enable interactions between intermediate layers of different
paths. Experiments on the challenging KITTI benchmark
show that our approach outperforms the state-of-the-art by
around 25% and 30% AP on the tasks of 3D localization
and 3D detection. In addition, for 2D detection, our approach
obtains 14.9% higher AP than the state-of-the-art on
the hard data among the LIDAR-based methods.
Attached files: multi-view 3d object detection network for autonomous driving.pdf
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections?one between each layer and its subsequent layer?our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly
competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.
Attached files: Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf
Deep learning has received significant attention recently as a promising solution to many problems in the area of artificial intelligence. Among several deep learning architectures, convolutional neural networks (CNNs) demonstrate superior performance when compared to other machine learning methods in the applications of object detection and recognition. We use a CNN for image enhancement and the detection of driving lanes on motorways. In general, the process of lane detection consists of edge extraction and line detection. A CNN can be used to enhance the input images before lane detection by excluding noise and obstacles that are irrelevant to the edge detection result. However, training conventional CNNs requires considerable computation and a big dataset. Therefore, we suggest a new learning algorithm for CNNs using an extreme learning machine (ELM). The ELM is a fast learning method used to calculate network weights between output and hidden layers in a single iteration and thus, can dramatically reduce learning
time while producing accurate results with minimal training data. A conventional ELM can be applied to networks with a single hidden layer; as such, we propose a stacked ELM architecture in the CNN framework. Further, we modify the backpropagation algorithm to find the targets of hidden layers and effectively learn network weights while maintaining performance. Experimental results confirm that the proposed method is effective in reducing learning time and improving performance.
Attached files: 1-s2.0-S0893608016301885-main.pdf
In this paper, we present a complete change detection system named multimode background subtraction. The universal nature of system allows it to robustly handle multitude of challenges associated with video change detection, such as illumination changes, dynamic background, camera jitter, and moving camera. The system comprises multiple innovative mechanisms in background modeling, model update, pixel classification, and the use of multiple color spaces. The system first creates multiple background models of the scene followed by an initial foreground/background probability estimation for each pixel. Next, the image pixels are merged together to form mega-pixels, which are used to spatially denoise the initial probability estimates to generate binary masks for both RGB and YCbCr color spaces. The masks generated after processing these input images are then combined to separate foreground pixels from the background. Comprehensive evaluation of the proposed approach on publicly available test sequences from the CDnet and the ESI data sets shows superiority in the performance of our system over other state-of-the-art algorithms.
Person re-identification (ReID) is an important task in
wide area video surveillance which focuses on identifying
people across different cameras. Recently, deep learning
networks with a triplet loss become a common framework
for person ReID. However, the triplet loss pays main at-
tentions on obtaining correct orders on the training set. It
still suffers from a weaker generalization capability from
the training set to the testing set, thus resulting in inferior
performance. In this paper, we design a quadruplet loss,
which can lead to the model output with a larger inter-class
variation and a smaller intra-class variation compared to
the triplet loss. As a result, our model has a better general-
ization ability and can achieve a higher performance on the
testing set. In particular, a quadruplet deep network using
a margin-based online hard negative mining is proposed
based on the quadruplet loss for the person ReID. In
extensive experiments, the proposed network outperforms
most of the state-of-the-art algorithms on representative
datasets which clearly demonstrates the effectiveness of our
Attached files: Chen_Beyond_Triplet_Loss_CVPR_2017_paper.pdf
Abandoned objects detection is one of the most challenging tasks in intelligent video surveillance systems. In this paper we present a new method for detecting abandoned objects (AO) using edges instead of pixel intensities. Our main focus, is on reducing false alarms, while keeping a high positives detection rate. Based on edge information, the proposed method reduces errors rate considerably compared to pixel intensities based approaches. At first, static edges are detected by applying a temporal accumulation step using the foreground edges mask resulting from the edge-based background subtraction model. Then, edges clustering is applied on the obtained stable edges mask, using edges? position and stability in time to delineate the object bounding box. Finally, an efficient classification approach, relying on edges? position, orientation, and staticness based scores, is applied on the AO candidates. The proposed approach has been validated on several challenging benchmarks, and is compared to other works of the literature. The results shows that our method reduce false alarms rates greatly, while keeping a good detection accuracy of true positives.
This paper presents a new fall detection method of elderly people in a room environment based on shape analysis of 3D depth images captured by a Kinect sensor. Depth images are preprocessed by a median filter both for background and target. The silhouette of moving individual in depth images is achieved by a subtraction method for background frames. The depth images are converted to disparity map, which is obtained by the horizontal and vertical projection histogram statistics. The initial floor plane information is obtained by V disparity map, and the floor plane equation is estimated by the least square method. Shape information of human subject in depth images is analyzed by a set of moment functions. Coefficients of ellipses are calculated to determine the direction of the individual. The centroids of the human body are calculated and the angle between the human body and the floor plane is calculated. When both the distance from the centroids of the human body to the floor plane and the angle between the human body and the floor plane are lower than some thresholds, fall incident will be detected. Experiments with different falling direction are performed. Experimental results show that the proposed method can detect fall incidents effectively.
Attached files: 3d fall dete.pdf
The authors have conducted studies on recognizing
Arabic news captions to develop a system for video retrieval
to index and edit Arabic broadcast programs daily received and
stored in big database. This paper describes a dedicated OCR for
recognizing low resolution news captions in video images. News
caption recognition system consisting of text line extraction, word
segmentation and segmentation-recognition of words is developed
and the performance was experimentally evaluated using datasets
of frame images extracted from AlJazeera broadcasting programs.
Character recognition of moving news caption is difficult
due to combing noise yielded by the interlacing of scan lines. A
technique to detect and eliminate the combing noise to correctly
recognize the moving news caption is proposed. This paper also
proposes a technique based on inter-frame text difference to
detect transition frame of still news captions. The technique to
detect transition frames is necessary for efficient video retrieve
and play. The proposed technique is experimentally tested and
shown to be robust to quick motion of the background and is
able to detect the transition frame correctly with the F-measure
higher than 90%. When compared with the ABBY FineReader
11 R ⃝ commercial OCR the dedicated OCR improves the recall
of the Arabic characters in AlJazeera broadcasting news from
70.74% to 95.85% for non-interlaced moving news captions and
from 23.82% to 96.29% for interlaced moving news captions.
Attached files: Recognition and transition.pdf
Forest ﬁre is an serious hazard in many places around the world. For such threats, video-based smoke detection would be particularly important for early warning because smoke arises in any forest ﬁre and can be seen from a long distance. This paper presents a novel and robust approach for smoke detection that employs Deep Belief Networks. The proposed method is divided into three phases. In the preprocessing phase, the region of high motion is extracted by background subtraction method. During the next phase, smoke pixel intensities are extracted from the Red, Green and Blue and Luminance; Chroma:Blue; Chroma:Red color spaces for foreground regions. Subsequently, second feature which is based on texture is computed for detecting smoke regions in which Local Extrema Co-occurrence Pattern, an improved version of local binary patterns are extracted from diﬀerent foreground regions which compute not only texture of smoke but also intensity and color of smoke using Hue Saturation Value color space. Finally, Deep Belief Network is employed for classiﬁcation. The proposed method proves its accuracy and robustness when tested on diﬀerent varieties of scenarios whether wildﬁre-smoke video, hill base smoke video, indoor or outdoor smoke videos.
Attached files: seminar_2017-07-08.pdf
Markov random fields are widely used to model many computer vision problems that can be cast in an energy minimization framework composed of unary and pairwise potentials. While computationally tractable discrete optimizers such as Graph Cuts and belief propagation (BP) exist for multi-label discrete problems, they still face prohibitively high computational challenges when the labels reside in a huge or very densely sampled space. Integrating key ideas from PatchMatch of effective particle propagation and resampling, PatchMatch belief propagation (PMBP) has been demonstrated to have good performance in addressing continuous labeling problems and runs orders of magnitude
faster than Particle BP (PBP). However, the quality of the PMBP solution is tightly coupled with the local window size, over which the raw data cost is aggregated to mitigate ambiguity in the data constraint. This dependency heavily influences the overall complexity, increasing linearly with the window size. This paper proposes a novel algorithm called sped-up PMBP (SPM-BP) to tackle this critical computational bottleneck and speeds up PMBP by 50-100 times. The crux of SPM-BP is on unifying efficient filter-based cost aggregation and message passing with PatchMatch-based
particle generation in a highly effective way. Though simple in its formulation, SPM-BP achieves superior performance for sub-pixel accurate stereo and optical-flow on benchmark datasets when compared with more complex and taskspecific approaches.
Matching cost aggregation is one of the oldest and still popular methods for stereo correspondence. While effective and efficient, cost aggregation methods typically aggregate
the matching cost by summing/averaging over a user-specified, local support region. This is obviously only locally-optimal, and the computational complexity of the full-kernel implementation usually depends on the region size. In this paper, the cost aggregation problem is reexamined
and a non-local solution is proposed. The matching cost values are aggregated adaptively based on pixel similarity on a tree structure derived from the stereo image pair to preserve depth edges. The nodes of this tree are all the image pixels, and the edges are all the edges between the nearest neighboring pixels. The similarity between any two pixels is decided by their shortest distance on the tree. The proposed method is non-local as every node receives supports from all other nodes on the tree. As can be expected, the proposed non-local solution outperforms
all local cost aggregation methods on the standard (Middlebury) benchmark. Besides, it has great advantage in extremely low computational complexity: only a total of 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. It
is very close to the complexity of unnormalized box filtering using integral image which requires 6 addition/subtraction operations. Unnormalized box filter is the fastest local cost aggregation method but blurs across depth edges. The proposed method was tested on a MacBook Air laptop computer with a 1.8 GHz Intel Core i7 CPU and 4 GB memory. The average runtime on the Middlebury data sets is about 90 milliseconds, and is only about 1.25? slower than unnormalized box filter. A non-local disparity refinement method is also proposed based on the non-local cost aggregation
Attached files: A Non-Local Aggregation Method Stereo Matching.pdf
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a non-parametric representation, which we refer to as Part Affinity
Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that main-
tains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and
their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
Attached files: CVPR_Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.pdf
With the increasing number of machine learning methods used for segmenting images and analyzing videos, there has been a growing need for large datasets with pixel accurate ground truth. In this let- ter, we propose a highly accurate semi-automatic method for segmenting foreground moving objects pictured in surveillance videos. Given a limited number of user interventions, the goal of the method is to provide results sufficiently accurate to be used as ground truth. In this paper, we show that by manually outlining a small number of moving objects, we can get our model to learn the appearance of the background and the foreground moving objects. Since the background and foreground moving objects are highly redundant from one image to another (videos come from surveillance cameras) the model does not need a large number of examples to accurately fit the data. Our end-to-end model is based on a multi-resolution convolutional neural network (CNN) with a cascaded architecture. Tests performed on the largest publicly-available video dataset with pixel accurate groundtruth (changde- tection.net) reveal that on videos from 11 categories, our approach has an average F-measure of 0.95 which is within the error margin of a human being. With our model, the amount of manual work for ground truthing a video gets reduced by a factor of up to 40. Code is made publicly available at: https://github.com/zhimingluo/MovingObjectSegmentation
Attached files: Interactive deep learning method for segmenting moving object.pdf
In this work, a deep learning approach has been
developed to carry out road detection using only LIDAR data.
Starting from an unstructured point cloud, top-view images
encoding several basic statistics such as mean elevation and
density are generated. By considering a top-view representation,
road detection is reduced to a single-scale problem that can be
addressed with a simple and fast fully convolutional neural
network (FCN). The FCN is specifically designed for the task
of pixel-wise semantic segmentation by combining a large
receptive field with high-resolution feature maps. The proposed
system achieved excellent performance and it is among the
top-performing algorithms on the KITTI road benchmark.
Its fast inference makes it particularly suitable for real-time
Attached files: Fast LIDAR-based Road Detection Using Fully Convolutional Neural Networks.pdf
A robust vanishing point estimation method is proposed that uses a probabilistic voting procedure based on intersection points of line segments extracted from an input image. The proposed voting function is deﬁned with line segment strength that represents relevance of the extracted line segments. Next, candidate line segments for lanes are selected by considering geometric constraints. Finally, the host lane is detected by using the proposed score function, which is designed to remove outliers in the candidate line segments. Also, the detected host lane is reﬁned by using inter-frame similarity that considers location consistency of the detected host lane and the estimated vanishing point in consecutive frames. Furthermore, in order to reduce computational costs in the vanishing point estimation process, a method using a lookup table is proposed.
Attached files: A Robust Lane Detection Method Based on.pdf
When considering person re-identification (re-ID) as a
retrieval process, re-ranking is a critical step to improve
its accuracy. Yet in the re-ID community, limited effort
has been devoted to re-ranking, especially those fully au-
tomatic, unsupervised solutions. In this paper, we propose
a k-reciprocal encoding method to re-rank the re-ID re-
sults. Our hypothesis is that if a gallery image is simi-
lar to the probe in the k-reciprocal nearest neighbors, it
is more likely to be a true match. Specifically, given an
image, a k-reciprocal feature is calculated by encoding its
k-reciprocal nearest neighbors into a single vector, which
is used for re-ranking under the Jaccard distance. The fi-
nal distance is computed as the combination of the original
distance and the Jaccard distance. Our re-ranking method
does not require any human interaction or any labeled data,
so it is applicable to large-scale datasets. Experiments on
the large-scale Market-1501, CUHK03, MARS, and PRW
datasets confirm the effectiveness of our method.
Attached files: re-ranking with reciprocal encoding.pdf
There is a huge proliferation of surveillance systems that require strategies for detecting different kinds of stationary foreground objects (e.g., unattended packages or illegally parked vehicles). As these strategies must be able to detect foreground objects remaining static in crowd scenarios, regardless of how long they have not been moving, several algorithms for detecting different kinds of such foreground objects have been developed over the last decades. This paper presents an efficient and highquality strategy to detect stationary foreground objects, which is able to detect not only completely static objects but also partially static ones. Three parallel nonparametric detectors with different absorption rates are used to detect currently moving foreground objects, short-term stationary foreground objects, and long-term stationary foreground objects. The results of the detectors are fed into a novel finite state machine that classifies the pixels among background, moving foreground objects, stationary foreground objects, occluded stationary foreground objects, and uncovered background. Results show that the proposed detection strategy is not only able to achieve high quality in several challenging situations but it also improves upon previous strategies.
Attached files: 20170513-Saturday Seminar-Wahyono.pdf
?Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build ?fully convolutional? networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip
architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of
a second for a typical image.
Attached files: Fully Convolutional Networks for Semantic Segmentation.pdf
Compared with other video semantic clues, such as gestures, motions etc., video
text generally provides highly useful and fairly precise semantic information, the analysis of
which can to a great extent facilitate video and scene understanding. It can be observed that the
video texts show stronger edges. The Nonsubsampled Contourlet Transform (NSCT) is a fully
shift-invariant, multi-scale, and multi-direction expansion, which can preserve the edge/
silhouette of the text characters well. Therefore, in this paper, a new approach has been
proposed to detect video text based on NSCT. First of all, the 8 directional coefficients of
NSCT are combined to build the directional edge map (DEM), which can keep the horizontal,
vertical and diagonal edge features and suppress other directional edge features. Then various
directional pixels of DEM are integrated into a whole binary image (BE). Based on the BE,
text frame classification is carried out to determine whether the video frames contain the text
lines. Finally, text detection based on the BE is performed on consecutive frames to discriminate
the video text from non-text regions. Experimental evaluations based on our collected TV
videos data set demonstrate that our method significantly outperforms the other 3 video text
detection algorithms in both detection speed and accuracy, especially when there are challenges
such as video text with various sizes, languages, colors, fonts, short or long text lines.
Attached files: art%3A10.1007%2Fs11042-017-4619-8.pdf
Estimating the disparity and normal direction of one pixel simultaneously instead of only disparity, also known as 3D label methods, can achieve much higher sub-pixel accuracy in the stereo matching problem. However, it is extremely difficult to assign an appropriate 3D label to each pixel from the continuous label space R^3 while maintaining global consistency because of the infinite parameter space. In this paper, we propose a novel algorithm called PatchMatch-based Superpixel Cut (PMSC) to assign 3D labels of an image more accurately. In order to achieve
robust and precise stereo matching between local windows, we develop a bilayer matching cost, where a bottom-up scheme is exploited to design the two layers. The bottom layer is employed to measure the similarity between small square patches locally by exploiting a pre-trained convolutional neural network, and then the top layer is developed to assemble the local matching costs in large irregular windows induced by the tangent planes of object surfaces. To optimize the spatial smoothness of local assignments, we propose a novel strategy to update 3D labels. In the procedure of optimization, both segmentation information and random refinement of PatchMatch are exploited to update candidate 3D label set for each pixel with high probability of achieving lower loss. Since pairwise energy of general candidate label sets violates the submodular property of graph cut,
we propose a novel multi-layer superpixel structure to group candidate label sets into candidate assignments, which thereby can be efficiently fused by ??-expansion graph cut. Extensive experiments demonstrate that our method can achieve higher sub-pixel accuracy in different datasets, and currently ranks 1st on the new challenging Middlebury 3.0 benchmark among all the existing methods.
Abstract?Humans have various complex postures and movements. Considerable attention is given to the problem of recognizing a human fall. However, the recognition rates must be further improved, for practical applications, from that obtained in the previous research. In this paper, a new recognition method, based on the analysis of a human fall, is provided. Furthermore, five eigenvectors that describe a fall are defined i.e. the aspect ratio, effective area ratio, human point margin, body axis angle, and centrifugal rate of the
body contour. Then, a support vector machine based on the Gauss radial basis function is trained to obtain a better identification result. The simulation results show that the model, though the combination of the five eigenvectors, has a recognition rate of 94.5%, which is a significant improvement as compared to the previous research.
Attached files: seminar paper.pdf
We address two difficulties in establishing an accurate
system for image matching. First, image matching relies
on the descriptor for feature extraction, but the optimal descriptor
often varies from image to image, or even patch
to patch. Second, conventional matching approaches carry
out geometric checking on a small set of correspondence
candidates due to the concern of efficiency. It may result
in restricted performance in recall. We aim at tackling the
two issues by integrating adaptive descriptor selection and
progressive candidate enrichment into image matching. We
consider that the two integrated components are complementary:
The high-quality matching yielded by adaptively
selected descriptors helps in exploring more plausible candidates,
while the enriched candidate set serves as a better
reference for descriptor selection. It motivates us to formulate
image matching as a joint optimization problem, in
which adaptive descriptor selection and progressive correspondence
enrichment are alternately conducted. Our approach
is comprehensively evaluated and compared with
the state-of-the-art approaches on two benchmarks. The
promising results manifest its effectiveness.
Attached files: ??.pdf
This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face?s landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting
for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate
priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
Attached files: Kazemi_One_Millisecond_Face_2014_CVPR_paper.pdf
Convolutional network techniques have recently
achieved great success in vision based detection tasks. This
paper introduces the recent development of our research on
transplanting the fully convolutional network technique to the
detection tasks on 3D range scan data. Specifically, the scenario
is set as the vehicle detection task from the range data of Velodyne
64E lidar. We proposes to present the data in a 2D point map and
use a single 2D end-to-end fully convolutional network to predict
the objectness confidence and the bounding boxes simultaneously.
By carefully design the bounding box encoding, it is able to
predict full 3D bounding boxes even using a 2D convolutional
network. Experiments on the KITTI dataset shows the state-ofthe-
art performance of the proposed method.
Background subtraction is usually based on lowlevel
or hand-crafted features such as raw color components,
gradients, or local binary patterns. As an improvement, we
present a background subtraction algorithm based on spatial
features learned with convolutional neural networks (ConvNets).
Our algorithm uses a background model reduced to a single
background image and a scene-specific training dataset to
feed ConvNets that prove able to learn how to subtract the
background from an input image patch. Experiments led on
2014 ChangeDetection.net dataset show that our ConvNet based
algorithm at least reproduces the performance of state-of-the-art
methods, and that it even outperforms them significantly when
scene-specific knowledge is considered.
Attached files: BS with scene specific.pdf
We propose a novel object localization methodology with
the purpose of boosting the localization accuracy of stateof-
the-art object detection systems. Our model, given a
search region, aims at returning the bounding box of an object
of interest inside this region. To accomplish its goal,
it relies on assigning conditional probabilities to each row
and column of this region, where these probabilities provide
useful information regarding the location of the boundaries
of the object inside the search region and allow the accurate
inference of the object bounding box under a simple
For implementing our localization model, we make use of
a convolutional neural network architecture that is properly
adapted for this task, called LocNet. We show experimentally
that LocNet achieves a very significant improvement on
the mAP for high IoU thresholds on PASCAL VOC2007 test
set and that it can be very easily coupled with recent stateof-
the-art object detection systems, helping them to boost
their performance. Finally, we demonstrate that our detection
approach can achieve high detection accuracy even
when it is given as input a set of sliding windows, thus proving
that it is independent of box proposal methods.
Attached files: Gidaris_LocNet_Improving_Localization_CVPR_2016_paper.pdf
Matching pedestrians across multiple camera views known
as human re-identication (re-identication) is a challenging problem in
visual surveillance. In the existing works concentrating on feature extraction,
representations are formed locally and independent of other
regions. We present a novel siamese Long Short-Term Memory (LSTM)
architecture that can process image regions sequentially and enhance
the discriminative capability of local feature representation by leveraging
contextual information. The feedback connections and internal gating
mechanism of the LSTM cells enable our model to memorize the spatial
dependencies and selectively propagate relevant contextual information
through the network. We demonstrate improved performance compared
to the baseline algorithm with no LSTM units and promising results compared
to state-of-the-art methods on Market-1501, CUHK03 and VIPeR
datasets. Visualization of the internal mechanism of LSTM cells shows
meaningful patterns can be learned by our method.
Attached files: 1607.08381.pdf
Research on video analysis for fire detection has become a hot topic in computer vision. However, the conventional algorithms use exclusively rule-based models and features vector to classify whether a frame is fire or not. These features are difficult to define and depend largely on the kind of fire observed. The outcome leads to low detection rate and high false-alarm rate. A different approach for this problem is to use a learning algorithm to extract the useful features instead of using an expert to build them. In this paper, we propose a
convolutional neural network (CNN) for identifying fire in videos. Convolutional neural network are shown to perform very well in the area of object classification. This network has the ability to perform feature extraction and classification within the same architecture. Tested on real video sequences, the proposed approach achieves better classification performance as some of relevant conventional video fire detection methods and indicates that using CNN to detect fire in videos is very promising
Attached files: paper.pdf
We present a block-wise approach to detect stationary objects based on spatio-temporal change detection. First, block candidates are extracted by filtering out consecutive blocks containing moving objects. Then, an online clustering approach groups similar blocks at each spatial location over time via statistical variation of pixel ratios. The stability changes are identified by analyzing the relationships between the most repeated clusters at regular sampling instants. Finally, stationary objects are detected as those stability changes that exceed an alarm time and have not been visualized before. Unlike previous approaches making use of Background Subtraction, the proposed approach does not require foreground segmentation and provides robustness to illumination changes, crowds and intermittent object motion. The experiments over an heterogeneous dataset demonstrate the ability of the proposed approach for short- and long-term operation while overcoming challenging issues.
Text displayed in a video is an essential part for the high-level semantic
information of the video content. Therefore, video text can be used as a valuable
source for automated video indexing in digital video libraries. In this paper, we
propose a workflow for video text detection and recognition. In the text detection
stage, we have developed a fast localization-verif ication scheme, in which an edgebased
multi-scale text detector first identifies potential text candidates with high
recall rate. Then, detected candidate text lines are refined by using an image entropybased
filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine
(SVM)-based verification procedures are applied to eliminate the false alarms. For
text recognition, we have developed a novel skeleton-based binarization method
in order to separate text from complex backgrounds to make it processible for
standard OCR (Optical Character Recognition) software. Operability and accuracy
of proposed text detection and binarization methods have been evaluated by using
publicly available test data sets.
Attached files: art%3A10.1007%2Fs11042-012-1250-6.pdf
This article tackles the problem of estimating non-rigid human 3D shape and motion from image sequences taken by
uncalibrated cameras. Similar to other state-of-the-art solutions we factorize 2D observations in camera parameters, base poses and
mixing coefficients. Existing methods require sufficient camera motion during the sequence to achieve a correct 3D reconstruction. To
obtain convincing 3D reconstructions from arbitrary camera motion, our method is based on a-priorly trained base poses. We show that
strong periodic assumptions on the coefficients can be used to define an efficient and accurate algorithm for estimating periodic motion
such as walking patterns. For the extension to non-periodic motion we propose a novel regularization term based on temporal bone
length constancy. In contrast to other works, the proposed method does not use a predefined skeleton or anthropometric constraints
and can handle arbitrary camera motion.
We achieve convincing 3D reconstructions, even under the influence of noise and occlusions. Multiple experiments based on a 3D
error metric demonstrate the stability of the proposed method. Compared to other state-of-the-art methods our algorithm shows a
Attached files: pami_final.pdf