Home | Login
Lectures       Previous announcements
Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2017
    Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires real-time inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully-convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI benchmark.
    Attached files: SqueezeDet: Unified Small Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving.pdf
    This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a ?stacked hourglass? network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
    Attached files: 20171216.pdf Stacked Hourglass Networks for Human Pose Estimation.pdf
    Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
    Attached files: PSPNet_cvpr17.pdf
    This paper describes a lane-level localization algorithm based on a map-matching method for application to automated driving in urban environments. The lane-level localization implies localizing the vehicle with centimeterlevel accuracy. In order to achieve a satisfactory level of position accuracy with a low-cost GPS, a sensor fusion approach is essential for lane-level localization. The proposed sensor fusion approach for the lane-level localization of a vehicle uses an around view monitoring (AVM) module and vehicle sensors. The proposed algorithm consists of three parts: lane detection, position correction, and localization filter. In order to detect lanes, a commercialized AVM module is used. Since this module can acquire an image around the vehicle, it is possible to obtain accurate position information of the lanes. With this information, the vehicle position can be corrected by the iterative closest point (ICP) algorithm.Thisalgorithmestimatestherigidtransformation between the lane map and lanes obtained by AVM in realtime. The vehicle position corrected by this transformation isfusedwiththeinformationofvehiclesensorsbasedonan extended Kalman filter. For higher accuracy, the covariance of the ICP is estimated using Haralick??s method.
    Attached files: 20171202-report-Yang Yu.pptx Lane-Level Localization Using an AVM Camera for an Automated Driving Vehicle in Urban Environments.pdf
    In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras. We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem. Our approach decomposes the human body into regions (parts) which are discriminative for person matching, accordingly computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information. Unlike most existing deep learning algorithms that learn a global or spatial partition-based local representation, our approach performs human body partition, and thus is more robust to pose changes and various human spatial distributions in the person bounding box. Our approach shows state-of-the-art results over standard datasets, Market- 1501, CUHK03, CUHK01 and VIPeR.
    Attached files: Zhao_Deeply-Learned_Part-Aligned_Representations_ICCV_2017_paper.pdf
    We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
    Attached files: 3d cnn.pdf
    In this paper, we presents a new binarization approach to extract text pixels from complex background in video frames. The binarization computation is a crucial step for video text recognition, which can greatly increase the recognition accuracy of an OCR software. The proposed approach consists of four phases. First, the text polarity is determined, i.e. light text with dark background or dark text with light background. Then the pixels in the given image are clustered into 𝐾 clusters using the K-means algorithm in the RGB color space and the text cluster is selected based on the text polarity. Further, the MRF Model is exploited to get the binarization result. Finally, the result is further refined by the Log-Gabor filter. The Experimental results on a large dataset show that the significant gains have been obtained according to the segmentation performance on the pixel level as well as the OCR accuracy.
    Attached files: A novel approach for binarization of overlay text.pdf
    Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
    Attached files: PointNet Deep Learning on Point Sets for 3D Classification and Segmentation.pdf
    Stereo matching is the key problem in many stereo vison based 3D applications. One of the factors make local stereo matching time-consuming is that every pixel has the same disparity range as the pre-set one, which should be larger than all possible disparities. An improper pre-set disparity range may lead to redundant computation for some pixels and inadequate computation for some others. In this paper, we propose an accurate and fast local stereo matching method, which employs fine disparity estimation and adaptive cost aggregation. The main contributions of our work include two parts. Firstly, we use phasebased correlation to estimate an initial disparity range for block center pixels. Secondly, we estimate a more limited disparity range for every in-block pixel according to its support weight to the block center pixel. Our disparity estimation and block matching techniques can not only reduce the disparity searching range for every pixel, but also can eliminate some pseudo match pairs. Four standard Middlebury stereo image pairs are tested to evaluate the performance of the proposed algorithm. Experimental results show that the proposed algorithm can reduce the matching time by 37.4% on average with relatively higher accuracy compared to traditional method.
    Attached files: Block Based Dense Stereo Matching Using adaptive cost aggregation and limited disparity estimation.pdf Original ASW paper.pdf
    It is a challenging task to recognize smoke from images due to large variance of smoke color, texture, and shapes. There are smoke detection methods that have been proposed, but most of them are based on hand-crafted features. To improve the performance of smoke detection, we propose a novel deep normalization and convolutional neural network (DNCNN) with 14 layers to implement automatic feature extraction and classification. In DNCNN, traditional convolutional layers are replaced with normalization and convolutional layers to accelerate the training process and boost the performance of smoke detection. To reduce overfitting caused by imbalanced and insufficient training samples, we generate more training samples from original training data sets by using a variety of data enhancement techniques. Experimental results show that our method achieved very low false alarm rates below 0.60% with detection rates above 96.37% on our smoke data sets.
    Attached files: paper.pdf
    We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem??s geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.
    Generative models of 3D human motion are often restricted to a small number of activities and can therefore not generalize well to novel movements or applications. In this work we propose a deep learning framework for human motion capture data that learns a generic representation from a large corpus of motion capture data and generalizes well to new, unseen, motions. Using an encoding-decoding network that learns to predict future 3D poses from the most recent past, we extract a feature representation of human motion. Most work on deep learning for sequence prediction focuses on video and speech. Since skeletal data has a different structure, we present and evaluate different network architectures that make different assumptions about time dependencies and limb correlations. To quantify the learned features, we use the output of different layers for action classification and visualize the receptive fields of the network units. Our method outperforms the recent state of the art in skeletal motion prediction even though these use action specific training data. Our results show that deep feedforward networks, trained from a generic mocap database, can successfully be used for feature extraction from human motion data and that this representation can be used as a foundation for classification and prediction.
    Attached files: Deep representation learning for human motion prediction and classification.pdf
    This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird?s eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
    Attached files: multi-view 3d object detection network for autonomous driving.pdf
    Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections?one between each layer and its subsequent layer?our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.
    Attached files: Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf
    Deep learning has received significant attention recently as a promising solution to many problems in the area of artificial intelligence. Among several deep learning architectures, convolutional neural networks (CNNs) demonstrate superior performance when compared to other machine learning methods in the applications of object detection and recognition. We use a CNN for image enhancement and the detection of driving lanes on motorways. In general, the process of lane detection consists of edge extraction and line detection. A CNN can be used to enhance the input images before lane detection by excluding noise and obstacles that are irrelevant to the edge detection result. However, training conventional CNNs requires considerable computation and a big dataset. Therefore, we suggest a new learning algorithm for CNNs using an extreme learning machine (ELM). The ELM is a fast learning method used to calculate network weights between output and hidden layers in a single iteration and thus, can dramatically reduce learning time while producing accurate results with minimal training data. A conventional ELM can be applied to networks with a single hidden layer; as such, we propose a stacked ELM architecture in the CNN framework. Further, we modify the backpropagation algorithm to find the targets of hidden layers and effectively learn network weights while maintaining performance. Experimental results confirm that the proposed method is effective in reducing learning time and improving performance.
    Attached files: 1-s2.0-S0893608016301885-main.pdf
    In this paper, we present a complete change detection system named multimode background subtraction. The universal nature of system allows it to robustly handle multitude of challenges associated with video change detection, such as illumination changes, dynamic background, camera jitter, and moving camera. The system comprises multiple innovative mechanisms in background modeling, model update, pixel classification, and the use of multiple color spaces. The system first creates multiple background models of the scene followed by an initial foreground/background probability estimation for each pixel. Next, the image pixels are merged together to form mega-pixels, which are used to spatially denoise the initial probability estimates to generate binary masks for both RGB and YCbCr color spaces. The masks generated after processing these input images are then combined to separate foreground pixels from the background. Comprehensive evaluation of the proposed approach on publicly available test sequences from the CDnet and the ESI data sets shows superiority in the performance of our system over other state-of-the-art algorithms.
    Person re-identification (ReID) is an important task in wide area video surveillance which focuses on identifying people across different cameras. Recently, deep learning networks with a triplet loss become a common framework for person ReID. However, the triplet loss pays main at- tentions on obtaining correct orders on the training set. It still suffers from a weaker generalization capability from the training set to the testing set, thus resulting in inferior performance. In this paper, we design a quadruplet loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better general- ization ability and can achieve a higher performance on the testing set. In particular, a quadruplet deep network using a margin-based online hard negative mining is proposed based on the quadruplet loss for the person ReID. In extensive experiments, the proposed network outperforms most of the state-of-the-art algorithms on representative datasets which clearly demonstrates the effectiveness of our proposed method.
    Attached files: Chen_Beyond_Triplet_Loss_CVPR_2017_paper.pdf
    Abandoned objects detection is one of the most challenging tasks in intelligent video surveillance systems. In this paper we present a new method for detecting abandoned objects (AO) using edges instead of pixel intensities. Our main focus, is on reducing false alarms, while keeping a high positives detection rate. Based on edge information, the proposed method reduces errors rate considerably compared to pixel intensities based approaches. At first, static edges are detected by applying a temporal accumulation step using the foreground edges mask resulting from the edge-based background subtraction model. Then, edges clustering is applied on the obtained stable edges mask, using edges? position and stability in time to delineate the object bounding box. Finally, an efficient classification approach, relying on edges? position, orientation, and staticness based scores, is applied on the AO candidates. The proposed approach has been validated on several challenging benchmarks, and is compared to other works of the literature. The results shows that our method reduce false alarms rates greatly, while keeping a good detection accuracy of true positives.
    This paper presents a new fall detection method of elderly people in a room environment based on shape analysis of 3D depth images captured by a Kinect sensor. Depth images are preprocessed by a median filter both for background and target. The silhouette of moving individual in depth images is achieved by a subtraction method for background frames. The depth images are converted to disparity map, which is obtained by the horizontal and vertical projection histogram statistics. The initial floor plane information is obtained by V disparity map, and the floor plane equation is estimated by the least square method. Shape information of human subject in depth images is analyzed by a set of moment functions. Coefficients of ellipses are calculated to determine the direction of the individual. The centroids of the human body are calculated and the angle between the human body and the floor plane is calculated. When both the distance from the centroids of the human body to the floor plane and the angle between the human body and the floor plane are lower than some thresholds, fall incident will be detected. Experiments with different falling direction are performed. Experimental results show that the proposed method can detect fall incidents effectively.
    Attached files: 3d fall dete.pdf
    The authors have conducted studies on recognizing Arabic news captions to develop a system for video retrieval to index and edit Arabic broadcast programs daily received and stored in big database. This paper describes a dedicated OCR for recognizing low resolution news captions in video images. News caption recognition system consisting of text line extraction, word segmentation and segmentation-recognition of words is developed and the performance was experimentally evaluated using datasets of frame images extracted from AlJazeera broadcasting programs. Character recognition of moving news caption is difficult due to combing noise yielded by the interlacing of scan lines. A technique to detect and eliminate the combing noise to correctly recognize the moving news caption is proposed. This paper also proposes a technique based on inter-frame text difference to detect transition frame of still news captions. The technique to detect transition frames is necessary for efficient video retrieve and play. The proposed technique is experimentally tested and shown to be robust to quick motion of the background and is able to detect the transition frame correctly with the F-measure higher than 90%. When compared with the ABBY FineReader 11 R ⃝ commercial OCR the dedicated OCR improves the recall of the Arabic characters in AlJazeera broadcasting news from 70.74% to 95.85% for non-interlaced moving news captions and from 23.82% to 96.29% for interlaced moving news captions.
    Attached files: Recognition and transition.pdf
    Forest fire is an serious hazard in many places around the world. For such threats, video-based smoke detection would be particularly important for early warning because smoke arises in any forest fire and can be seen from a long distance. This paper presents a novel and robust approach for smoke detection that employs Deep Belief Networks. The proposed method is divided into three phases. In the preprocessing phase, the region of high motion is extracted by background subtraction method. During the next phase, smoke pixel intensities are extracted from the Red, Green and Blue and Luminance; Chroma:Blue; Chroma:Red color spaces for foreground regions. Subsequently, second feature which is based on texture is computed for detecting smoke regions in which Local Extrema Co-occurrence Pattern, an improved version of local binary patterns are extracted from different foreground regions which compute not only texture of smoke but also intensity and color of smoke using Hue Saturation Value color space. Finally, Deep Belief Network is employed for classification. The proposed method proves its accuracy and robustness when tested on different varieties of scenarios whether wildfire-smoke video, hill base smoke video, indoor or outdoor smoke videos.
    Attached files: seminar_2017-07-08.pdf
    Markov random fields are widely used to model many computer vision problems that can be cast in an energy minimization framework composed of unary and pairwise potentials. While computationally tractable discrete optimizers such as Graph Cuts and belief propagation (BP) exist for multi-label discrete problems, they still face prohibitively high computational challenges when the labels reside in a huge or very densely sampled space. Integrating key ideas from PatchMatch of effective particle propagation and resampling, PatchMatch belief propagation (PMBP) has been demonstrated to have good performance in addressing continuous labeling problems and runs orders of magnitude faster than Particle BP (PBP). However, the quality of the PMBP solution is tightly coupled with the local window size, over which the raw data cost is aggregated to mitigate ambiguity in the data constraint. This dependency heavily influences the overall complexity, increasing linearly with the window size. This paper proposes a novel algorithm called sped-up PMBP (SPM-BP) to tackle this critical computational bottleneck and speeds up PMBP by 50-100 times. The crux of SPM-BP is on unifying efficient filter-based cost aggregation and message passing with PatchMatch-based particle generation in a highly effective way. Though simple in its formulation, SPM-BP achieves superior performance for sub-pixel accurate stereo and optical-flow on benchmark datasets when compared with more complex and taskspecific approaches.
    Matching cost aggregation is one of the oldest and still popular methods for stereo correspondence. While effective and efficient, cost aggregation methods typically aggregate the matching cost by summing/averaging over a user-specified, local support region. This is obviously only locally-optimal, and the computational complexity of the full-kernel implementation usually depends on the region size. In this paper, the cost aggregation problem is reexamined and a non-local solution is proposed. The matching cost values are aggregated adaptively based on pixel similarity on a tree structure derived from the stereo image pair to preserve depth edges. The nodes of this tree are all the image pixels, and the edges are all the edges between the nearest neighboring pixels. The similarity between any two pixels is decided by their shortest distance on the tree. The proposed method is non-local as every node receives supports from all other nodes on the tree. As can be expected, the proposed non-local solution outperforms all local cost aggregation methods on the standard (Middlebury) benchmark. Besides, it has great advantage in extremely low computational complexity: only a total of 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. It is very close to the complexity of unnormalized box filtering using integral image which requires 6 addition/subtraction operations. Unnormalized box filter is the fastest local cost aggregation method but blurs across depth edges. The proposed method was tested on a MacBook Air laptop computer with a 1.8 GHz Intel Core i7 CPU and 4 GB memory. The average runtime on the Middlebury data sets is about 90 milliseconds, and is only about 1.25? slower than unnormalized box filter. A non-local disparity refinement method is also proposed based on the non-local cost aggregation method.
    Attached files: A Non-Local Aggregation Method Stereo Matching.pdf
    We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a non-parametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that main- tains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
    Attached files: CVPR_Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.pdf
    With the increasing number of machine learning methods used for segmenting images and analyzing videos, there has been a growing need for large datasets with pixel accurate ground truth. In this let- ter, we propose a highly accurate semi-automatic method for segmenting foreground moving objects pictured in surveillance videos. Given a limited number of user interventions, the goal of the method is to provide results sufficiently accurate to be used as ground truth. In this paper, we show that by manually outlining a small number of moving objects, we can get our model to learn the appearance of the background and the foreground moving objects. Since the background and foreground moving objects are highly redundant from one image to another (videos come from surveillance cameras) the model does not need a large number of examples to accurately fit the data. Our end-to-end model is based on a multi-resolution convolutional neural network (CNN) with a cascaded architecture. Tests performed on the largest publicly-available video dataset with pixel accurate groundtruth (changde- tection.net) reveal that on videos from 11 categories, our approach has an average F-measure of 0.95 which is within the error margin of a human being. With our model, the amount of manual work for ground truthing a video gets reduced by a factor of up to 40. Code is made publicly available at: https://github.com/zhimingluo/MovingObjectSegmentation
    Attached files: Interactive deep learning method for segmenting moving object.pdf
    In this work, a deep learning approach has been developed to carry out road detection using only LIDAR data. Starting from an unstructured point cloud, top-view images encoding several basic statistics such as mean elevation and density are generated. By considering a top-view representation, road detection is reduced to a single-scale problem that can be addressed with a simple and fast fully convolutional neural network (FCN). The FCN is specifically designed for the task of pixel-wise semantic segmentation by combining a large receptive field with high-resolution feature maps. The proposed system achieved excellent performance and it is among the top-performing algorithms on the KITTI road benchmark. Its fast inference makes it particularly suitable for real-time applications
    Attached files: Fast LIDAR-based Road Detection Using Fully Convolutional Neural Networks.pdf
    A robust vanishing point estimation method is proposed that uses a probabilistic voting procedure based on intersection points of line segments extracted from an input image. The proposed voting function is defined with line segment strength that represents relevance of the extracted line segments. Next, candidate line segments for lanes are selected by considering geometric constraints. Finally, the host lane is detected by using the proposed score function, which is designed to remove outliers in the candidate line segments. Also, the detected host lane is refined by using inter-frame similarity that considers location consistency of the detected host lane and the estimated vanishing point in consecutive frames. Furthermore, in order to reduce computational costs in the vanishing point estimation process, a method using a lookup table is proposed.
    Attached files: A Robust Lane Detection Method Based on.pdf
    When considering person re-identification (re-ID) as a retrieval process, re-ranking is a critical step to improve its accuracy. Yet in the re-ID community, limited effort has been devoted to re-ranking, especially those fully au- tomatic, unsupervised solutions. In this paper, we propose a k-reciprocal encoding method to re-rank the re-ID re- sults. Our hypothesis is that if a gallery image is simi- lar to the probe in the k-reciprocal nearest neighbors, it is more likely to be a true match. Specifically, given an image, a k-reciprocal feature is calculated by encoding its k-reciprocal nearest neighbors into a single vector, which is used for re-ranking under the Jaccard distance. The fi- nal distance is computed as the combination of the original distance and the Jaccard distance. Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets. Experiments on the large-scale Market-1501, CUHK03, MARS, and PRW datasets confirm the effectiveness of our method.
    Attached files: re-ranking with reciprocal encoding.pdf
    There is a huge proliferation of surveillance systems that require strategies for detecting different kinds of stationary foreground objects (e.g., unattended packages or illegally parked vehicles). As these strategies must be able to detect foreground objects remaining static in crowd scenarios, regardless of how long they have not been moving, several algorithms for detecting different kinds of such foreground objects have been developed over the last decades. This paper presents an efficient and highquality strategy to detect stationary foreground objects, which is able to detect not only completely static objects but also partially static ones. Three parallel nonparametric detectors with different absorption rates are used to detect currently moving foreground objects, short-term stationary foreground objects, and long-term stationary foreground objects. The results of the detectors are fed into a novel finite state machine that classifies the pixels among background, moving foreground objects, stationary foreground objects, occluded stationary foreground objects, and uncovered background. Results show that the proposed detection strategy is not only able to achieve high quality in several challenging situations but it also improves upon previous strategies.
    Attached files: 20170513-Saturday Seminar-Wahyono.pdf
    ?Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build ?fully convolutional? networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
    Attached files: Fully Convolutional Networks for Semantic Segmentation.pdf
    Compared with other video semantic clues, such as gestures, motions etc., video text generally provides highly useful and fairly precise semantic information, the analysis of which can to a great extent facilitate video and scene understanding. It can be observed that the video texts show stronger edges. The Nonsubsampled Contourlet Transform (NSCT) is a fully shift-invariant, multi-scale, and multi-direction expansion, which can preserve the edge/ silhouette of the text characters well. Therefore, in this paper, a new approach has been proposed to detect video text based on NSCT. First of all, the 8 directional coefficients of NSCT are combined to build the directional edge map (DEM), which can keep the horizontal, vertical and diagonal edge features and suppress other directional edge features. Then various directional pixels of DEM are integrated into a whole binary image (BE). Based on the BE, text frame classification is carried out to determine whether the video frames contain the text lines. Finally, text detection based on the BE is performed on consecutive frames to discriminate the video text from non-text regions. Experimental evaluations based on our collected TV videos data set demonstrate that our method significantly outperforms the other 3 video text detection algorithms in both detection speed and accuracy, especially when there are challenges such as video text with various sizes, languages, colors, fonts, short or long text lines.
    Attached files: art%3A10.1007%2Fs11042-017-4619-8.pdf
    Estimating the disparity and normal direction of one pixel simultaneously instead of only disparity, also known as 3D label methods, can achieve much higher sub-pixel accuracy in the stereo matching problem. However, it is extremely difficult to assign an appropriate 3D label to each pixel from the continuous label space R^3 while maintaining global consistency because of the infinite parameter space. In this paper, we propose a novel algorithm called PatchMatch-based Superpixel Cut (PMSC) to assign 3D labels of an image more accurately. In order to achieve robust and precise stereo matching between local windows, we develop a bilayer matching cost, where a bottom-up scheme is exploited to design the two layers. The bottom layer is employed to measure the similarity between small square patches locally by exploiting a pre-trained convolutional neural network, and then the top layer is developed to assemble the local matching costs in large irregular windows induced by the tangent planes of object surfaces. To optimize the spatial smoothness of local assignments, we propose a novel strategy to update 3D labels. In the procedure of optimization, both segmentation information and random refinement of PatchMatch are exploited to update candidate 3D label set for each pixel with high probability of achieving lower loss. Since pairwise energy of general candidate label sets violates the submodular property of graph cut, we propose a novel multi-layer superpixel structure to group candidate label sets into candidate assignments, which thereby can be efficiently fused by ??-expansion graph cut. Extensive experiments demonstrate that our method can achieve higher sub-pixel accuracy in different datasets, and currently ranks 1st on the new challenging Middlebury 3.0 benchmark among all the existing methods.
    Abstract?Humans have various complex postures and movements. Considerable attention is given to the problem of recognizing a human fall. However, the recognition rates must be further improved, for practical applications, from that obtained in the previous research. In this paper, a new recognition method, based on the analysis of a human fall, is provided. Furthermore, five eigenvectors that describe a fall are defined i.e. the aspect ratio, effective area ratio, human point margin, body axis angle, and centrifugal rate of the body contour. Then, a support vector machine based on the Gauss radial basis function is trained to obtain a better identification result. The simulation results show that the model, though the combination of the five eigenvectors, has a recognition rate of 94.5%, which is a significant improvement as compared to the previous research.
    Attached files: seminar paper.pdf
    We address two difficulties in establishing an accurate system for image matching. First, image matching relies on the descriptor for feature extraction, but the optimal descriptor often varies from image to image, or even patch to patch. Second, conventional matching approaches carry out geometric checking on a small set of correspondence candidates due to the concern of efficiency. It may result in restricted performance in recall. We aim at tackling the two issues by integrating adaptive descriptor selection and progressive candidate enrichment into image matching. We consider that the two integrated components are complementary: The high-quality matching yielded by adaptively selected descriptors helps in exploring more plausible candidates, while the enriched candidate set serves as a better reference for descriptor selection. It motivates us to formulate image matching as a joint optimization problem, in which adaptive descriptor selection and progressive correspondence enrichment are alternately conducted. Our approach is comprehensively evaluated and compared with the state-of-the-art approaches on two benchmarks. The promising results manifest its effectiveness.
    Attached files: ??.pdf
    This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face?s landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
    Attached files: Kazemi_One_Millisecond_Face_2014_CVPR_paper.pdf
    Convolutional network techniques have recently achieved great success in vision based detection tasks. This paper introduces the recent development of our research on transplanting the fully convolutional network technique to the detection tasks on 3D range scan data. Specifically, the scenario is set as the vehicle detection task from the range data of Velodyne 64E lidar. We proposes to present the data in a 2D point map and use a single 2D end-to-end fully convolutional network to predict the objectness confidence and the bounding boxes simultaneously. By carefully design the bounding box encoding, it is able to predict full 3D bounding boxes even using a 2D convolutional network. Experiments on the KITTI dataset shows the state-ofthe- art performance of the proposed method.
    Background subtraction is usually based on lowlevel or hand-crafted features such as raw color components, gradients, or local binary patterns. As an improvement, we present a background subtraction algorithm based on spatial features learned with convolutional neural networks (ConvNets). Our algorithm uses a background model reduced to a single background image and a scene-specific training dataset to feed ConvNets that prove able to learn how to subtract the background from an input image patch. Experiments led on 2014 ChangeDetection.net dataset show that our ConvNet based algorithm at least reproduces the performance of state-of-the-art methods, and that it even outperforms them significantly when scene-specific knowledge is considered.
    Attached files: BS with scene specific.pdf
    We propose a novel object localization methodology with the purpose of boosting the localization accuracy of stateof- the-art object detection systems. Our model, given a search region, aims at returning the bounding box of an object of interest inside this region. To accomplish its goal, it relies on assigning conditional probabilities to each row and column of this region, where these probabilities provide useful information regarding the location of the boundaries of the object inside the search region and allow the accurate inference of the object bounding box under a simple probabilistic framework. For implementing our localization model, we make use of a convolutional neural network architecture that is properly adapted for this task, called LocNet. We show experimentally that LocNet achieves a very significant improvement on the mAP for high IoU thresholds on PASCAL VOC2007 test set and that it can be very easily coupled with recent stateof- the-art object detection systems, helping them to boost their performance. Finally, we demonstrate that our detection approach can achieve high detection accuracy even when it is given as input a set of sliding windows, thus proving that it is independent of box proposal methods.
    Attached files: Gidaris_LocNet_Improving_Localization_CVPR_2016_paper.pdf
    Matching pedestrians across multiple camera views known as human re-identi cation (re-identi cation) is a challenging problem in visual surveillance. In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions. We present a novel siamese Long Short-Term Memory (LSTM) architecture that can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. The feedback connections and internal gating mechanism of the LSTM cells enable our model to memorize the spatial dependencies and selectively propagate relevant contextual information through the network. We demonstrate improved performance compared to the baseline algorithm with no LSTM units and promising results compared to state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets. Visualization of the internal mechanism of LSTM cells shows meaningful patterns can be learned by our method.
    Attached files: 1607.08381.pdf
    Research on video analysis for fire detection has become a hot topic in computer vision. However, the conventional algorithms use exclusively rule-based models and features vector to classify whether a frame is fire or not. These features are difficult to define and depend largely on the kind of fire observed. The outcome leads to low detection rate and high false-alarm rate. A different approach for this problem is to use a learning algorithm to extract the useful features instead of using an expert to build them. In this paper, we propose a convolutional neural network (CNN) for identifying fire in videos. Convolutional neural network are shown to perform very well in the area of object classification. This network has the ability to perform feature extraction and classification within the same architecture. Tested on real video sequences, the proposed approach achieves better classification performance as some of relevant conventional video fire detection methods and indicates that using CNN to detect fire in videos is very promising
    Attached files: paper.pdf
    We present a block-wise approach to detect stationary objects based on spatio-temporal change detection. First, block candidates are extracted by filtering out consecutive blocks containing moving objects. Then, an online clustering approach groups similar blocks at each spatial location over time via statistical variation of pixel ratios. The stability changes are identified by analyzing the relationships between the most repeated clusters at regular sampling instants. Finally, stationary objects are detected as those stability changes that exceed an alarm time and have not been visualized before. Unlike previous approaches making use of Background Subtraction, the proposed approach does not require foreground segmentation and provides robustness to illumination changes, crowds and intermittent object motion. The experiments over an heterogeneous dataset demonstrate the ability of the proposed approach for short- and long-term operation while overcoming challenging issues.
    Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verif ication scheme, in which an edgebased multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropybased filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.
    Attached files: art%3A10.1007%2Fs11042-012-1250-6.pdf
    This article tackles the problem of estimating non-rigid human 3D shape and motion from image sequences taken by uncalibrated cameras. Similar to other state-of-the-art solutions we factorize 2D observations in camera parameters, base poses and mixing coefficients. Existing methods require sufficient camera motion during the sequence to achieve a correct 3D reconstruction. To obtain convincing 3D reconstructions from arbitrary camera motion, our method is based on a-priorly trained base poses. We show that strong periodic assumptions on the coefficients can be used to define an efficient and accurate algorithm for estimating periodic motion such as walking patterns. For the extension to non-periodic motion we propose a novel regularization term based on temporal bone length constancy. In contrast to other works, the proposed method does not use a predefined skeleton or anthropometric constraints and can handle arbitrary camera motion. We achieve convincing 3D reconstructions, even under the influence of noise and occlusions. Multiple experiments based on a 3D error metric demonstrate the stability of the proposed method. Compared to other state-of-the-art methods our algorithm shows a significant improvement.
    Attached files: pami_final.pdf
News | About us | Research | Lectures