Home | Login
Lectures       Previous announcements
Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
Seminars in 2016
    Falls are one of the major causes leading to injury of elderly people. Using wearable devices for fall detection has a high cost andmay cause inconvenience to the daily lives of the elderly. In this paper, we present an automated fall detection approach that requires only a low-cost depth camera. Our approach combines two computer vision techniques?shape-based fall characterization and a learning-based classifier to distinguish falls from other daily actions. Given a fall video clip, we extract curvature scale space (CSS) features of human silhouettes at each frame and represent the action by a bag of CSS words (BoCSS). Then, we utilize the extreme learning machine (ELM) classifier to identify the BoCSS representation of a fall fromthose of other actions. In order to eliminate the sensitivity of ELM to its hyperparameters, we present a variable-length particle swarm optimization algorithm to optimize the number of hidden neurons, corresponding input weights, and biases of ELM. Using a low-cost Kinect depth camera, we build an action dataset that consists of six types of actions (falling, bending, sitting, squatting, walking, and lying) from ten subjects. Experimenting with the dataset shows that our approach can achieve up to 91.15% sensitivity, 77.14% specificity, and 86.83% accuracy. On a public dataset, our approach performs comparably to state-ofthe- art fall detection methods that need multiple cameras.
    Attached files: se3lect 1.....pdf
    Feature trajectories have shown to be efficient for representing videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajectories is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sample dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajectories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the motion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncontrolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improvement over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.
    Attached files: wang_cvpr11.pdf
    We investigate the problem of estimating the 3D shape of an object, given a set of 2D landmarks in a single image. To alleviate the reconstruction ambiguity, a widely-used approach is to confine the unknown 3D shape within a shape space built upon existing shapes. While this approach has proven to be successful in various applications, a challenging issue remains, i.e., the joint estimation of shape parameters and camera-pose parameters requires to solve a non-convex optimization problem. The existing methods often adopt an alternating minimization scheme to locally update the parameters, and consequently the solution is sensitive to initialization. In this paper, we propose a convex formulation to address this problem and develop an efficient algorithm to solve the proposed convex program. We demonstrate the exact recovery property of the proposed method, its merits compared to alternative methods, and the applicability in human pose and car shape estimation.
    Attached files: 3D Shape Estimation from 2D Landmarks.pdf
    Background subtraction is a widely used technique for detecting moving objects in image sequences. Very often background subtraction approaches assume the availability of one or more clear (i.e., without foreground objects) frames at the beginning of the sequence in input. However, this assumption is not always true, especially when dealing with dynamic background or crowded scenes. In this paper, we present the results of a multi-modal background modeling method that is able to generate a reliable initial background model even if no clear frames are available. The proposed algorithm runs in real? time on HD images. Quantitative experiments have been conducted taking into account six different quality metrics on a set of 14 publicly available image sequences. The obtained results demonstrate a high-accuracy in generating the background model in comparison with several other methods.
    Attached files: parallel multi-model background modeling.pdf
    Real-time automated vehicle make and model recognition (VMMR) based on a bag of speeded-up robust features (BoSURF). Use SURF features of front- or rear-facing images and retain the dominant characteristic features (codewords) in adictionary. Single dictionary, modular dictionary. SURF features to BoSURF histograms Single multiclass SVM and an ensemble of multiclass SVM based on attribute bagging.
    Attached files: Real-Time Vehicle Make and Model Recognition.pdf 20161126-report-Yang Yu.pptx
    Today, there are two major paradigms for vision-based autonomous driving systems: mediated perception approaches that parse an entire scene to make a driving decision, and behavior reflex approaches that directly map an input image to a driving action by a regressor. In this paper, we propose a third paradigm: a direct perception approach to estimate the affordance for driving. We propose to map an input image to a small number of key perception indicators that directly relate to the affordance of a road/traffic state for driving. Our representation provides a set of compact yet complete descriptions of the scene to enable a simple controller to drive autonomously. Falling in between the two extremes of mediated perception and behavior reflex, we argue that our direct perception representation provides the right level of abstraction. To demonstrate this, we train a deep Convolutional Neural Network using recording from 12 hours of human driving in a video game and show that our model can work well to drive a car in a very diverse set of virtual environments. We also train a model for car distance estimation on the KITTI dataset. Results show that our direct perception approach can generalize well to real driving images. Source code and data are available on our project website.
    Attached files: DeepDriving-Learning Affordance for Direct Perception in Autonomous Driving.pdf
    Person re-identification is the problem of recogniz- ing people across images or videos from non-overlapping views. Although there has been much progress in person re-identification for the last decade, it still remains a chal- lenging task because of severe appearance changes of a per- son due to diverse camera viewpoints and person poses. In this paper, we propose a novel framework for person re- identification by analyzing camera viewpoints and person poses, so-called Pose-aware Multi-shot Matching (PaMM), which robustly estimates target poses and efficiently con- ducts multi-shot matching based on the target pose in- formation. Experimental results using public person re- identification datasets show that the proposed methods are promising for person re-identification under diverse view- points and pose variances.
    Attached files: Cho_Improving_Person_Re-Identification_CVPR_2016_paper.pdf
    We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest (LOI) in a video sequence. Through a line sampling process, the video is first converted into a temporal slice image. Next, the number of people is estimated in a set of overlapping sliding windows on the temporal slice image using a regression function that maps from local features to a count. Given that the count in a sliding window is the sum of the instantaneous counts in the corresponding time interval, an integer programming method is proposed to recover the number of pedestrians crossing the LOI in each frame. Integrating over a specific time interval yields the cumulative count of pedestrians crossing the line. Compared with current methods for line counting, our proposed approach achieves state-of-the-art performance on several challenging crowd video data sets.
    Attached files: 20161105-Saturday Seminar-Wahyono.pdf
    This paper deals with Korean-English bilingual videotext recognition for news headline generation. Because videotext contains semantic content information, it can be effectively used for understanding videos. Despite its usefulness, it is a challengeable task to apply text recognition technologies to practical video applications because of the computational complexity and recognition accuracy. In this paper, we propose a novel Korean-English bilingual videotext recognition method to overcome the computational complexity as well as achieve comparable recognition accuracy. To recognize both Korean and English characters effectively, the proposed method employs an elaborate splitmerge strategy in which the split segments are merged into characters using the recognition scores. Moreover, it avoids unnecessary computation using geometric features such as squareness and internal gap, and thus its computational overhead is remarkably reduced. Therefore, the proposed method is successfully employed in generating news headlines. The effectiveness and efficiency of the proposed method are verified by extensive experiments on a challenging database containing 51,290 text images (176,884 characters).
    Attached files: Korean english bilingual.pdf
    We present a new, massively parallel method for high-quality multiview matching. Our work builds on the Patchmatch idea: starting from randomly generated 3D planes in scene space, the best-fitting planes are iteratively propagated and refined to obtain a 3D depth and normal field per view, such that a robust photo-consistency measure over all images is maximized. Our main novelties are on the one hand to formulate Patchmatch in scene space, which makes it possible to aggregate image similarity across multiple views and obtain more accurate depth maps. And on the other hand a modified, diffusion-like propagation scheme that can be massively parallelized and delivers dense multiview correspondence over ten 1.9-Megapixel images in 3 seconds, on a consumer-grade GPU. Our method uses a slanted support window and thus has no fronto-parallel bias; it is completely local and parallel, such that computation time scales linearly with image size, and inversely proportional to the number of parallel threads. Furthermore, it has low memory footprint (four values per pixel, independent of the depth range). It therefore scales exceptionally well and can handle multiple large images at high depth resolution. Experiments on the DTU and Middlebury multiview datasets as well as oblique aerial images show that our method achieves very competitive results with high accuracy and completeness, across a range of different scenarios.
    Although there has long been interest in foreground-background segmentation based on change detection for video surveillance applications, the issue of inconsistent performance across different scenarios remains a serious concern. To address this, we propose a new type of wordbased approach that regulates its own internal parameters using feedback mechanisms to withstand difficult conditions while keeping sensitivity intact in regular situations. Coined ?PAWCS?, this method?s key advantages lie in its highly persistent and robust dictionary model based on color and local binary features as well as its ability to automatically adjust pixel-level segmentation behavior. Experiments using the 2012 ChangeDetection.net dataset show that it outranks numerous recently proposed solutions in terms of overall performance as well as in each category. A complete C++ implementation based on OpenCV is available online.
    Attached files: 07045991 (1).pdf
    3D reconstruction is one of the most popular research areas in computer vision and computer graphics, it is widely used in many fields, such as video game, animation and so on. It gets 3D model based on 2D images. Using this technology, we can implement scene recurrence, observe the model from any viewpoints stereoscopically and perceive the world well. In this paper, use technologies like point cloud building, surface reconstruction to obtain the visual hull. To make the visual hull looked more vivid and natural, adding texture is necessary. This research proves that this solution plan has some advantages, such as feasibility, easy reconstruction and so on.
    Attached files: 1_XXXVIII-part3B.pdf
    This paper addresses the problem of detecting people in two dimensional range scans. Previous approaches have mostly used pre-defined features for the detection and tracking of people. We propose an approach that utilizes a supervised learning technique to create a classifier that facilitates the detection of people. In particular, our approach applies AdaBoost to train a strong classifier from simple features of groups of neighboring beams corresponding to legs in range data. Experimental results carried out with laser range data illustrate the robustness of our approach even in cluttered office environments.
    Attached files: arrasICRA07.pdf
    Background modeling and subtraction is a classical topic in compute vision. Gaussianmixture modeling (GMM) is a popular choice for its capability of adaptation to background variations. Lots of improvements have been made to enhance the robustness by considering spatial consistency and temporal correlation. In this paper, we propose a sharable GMM based background subtraction approach. Firstly, a sharable mechanism is presented to model the many-to-one relationship between pixels and models. Each pixel dynamically searches the best matched model in the neighborhood. This kind of space-sharing way is robust to camera jitter, dynamic background, etc. Secondly, the sharable models are built for both background and foreground. The noises resulted by local small movements could be effectively eliminated through the background sharable models, while the integrity of moving objects is enhanced by the foreground sharable models, especially for small objects. Finally, each sharable model is updated through randomly selecting a pixel which matches this model. And a flexible mechanism is added for switching between background and foreground models. Experiments on ChangeDetection benchmark dataset demonstrate the effectiveness of our approach.
    Attached files: 07177419.pdf
    This paper aims to deal with real-time traffic sign recognition, i.e.localizing what type of traffic sign appears in which area of an input image at a fast processing time. To achieve this goal, we first propose an extremely fast detection module, which is 20 times faster than the existing best detection module. Our detection module is based on traffic sign proposal extraction and classification built upon a color probability model and a color HOG. Then, we harvest from a convolutional neural network to further classify the detected signs into their subclasses within each superclass.
    Attached files: Towards Real-Time Traffic Sign Detection.pdf
    Hierarchical neural networks have been shown to be effective in learning representative image features and recognizing object classes. However, most existing networks combine the low/middle level cues for classification without accounting for any spatial structures. For applications such as understanding a scene, how the visual cues are spatially distributed in an image becomes essential for successful analysis. This paper extends the framework of deep neural networks by accounting for the structural cues in the visual signals. In particular, two kinds of neural networks have been proposed. First, we develop a multitask deep convolutional network, which simultaneously detects the presence of the target and the geometric attributes (location and orientation) of the target with respect to the region of interest. Second, a recurrent neuron layer is adopted for structured visual detection. The recurrent neurons can deal with the spatial distribution of visible cues belonging to an object whose shape or structure is difficult to explicitly define. Both the networks are demonstrated by the practical task of detecting lane boundaries in traffic scenes. The multitask convolutional neural network provides auxiliary geometric information to help the subsequent modeling of the given lane structures. The recurrent neural network automatically detects lane boundaries, including those areas containing no marks, without any explicit prior knowledge or secondary modeling.
    Attached files: Deep Neural Network for Structural Prediction and lane detection in traffic scene.pdf
    Observing that text in virtually any script is formed of strokes, we propose a novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels. Stroke-specific keypoints are efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties. Classification based on effectively calculated features eliminates non-text regions. The stroke-specific keypoints produce 2 times less region segmentations and still detects 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After a novel efficient classification step, the number of regions is reduced to 7 times less than the standard method and is still almost 3 times faster. All stages of the proposed pipeline are scale- and rotation- invariant and support a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. When the proposed detector is plugged into a scene text localization and recognition pipeline, a state-of-the-art text localization accuracy is maintained whilst the processing time is significantly reduced.
    Attached files: 20160820-Saturday Seminar-Wahyono.pdf
    In this paper we propose a novel recurrent neural net- work architecture for video-based person re-identification. Given the video sequence of a person, features are extracted from each frame using a convolutional neural network that incorporates a recurrent final layer, which allows informa- tion to flow between time-steps. The features from all time- steps are then combined using temporal pooling to give an overall appearance feature for the complete sequence. The convolutional network, recurrent layer, and temporal pool- ing layer, are jointly trained to act as a feature extractor for video-based re-identification using a Siamese network ar- chitecture. Our approach makes use of colour and optical flow information in order to capture appearance and motion information which is useful for video re-identification. Ex- periments are conduced on the iLIDS-VID and PRID-2011 datasets to show that this approach outperforms existing methods of video-based re-identification.
    Attached files: McLaughlin_Recurrent_Convolutional_Network_CVPR_2016_paper.pdf
    In this paper we propose a novel recurrent neural net- work architecture for video-based person re-identification. Given the video sequence of a person, features are extracted from each frame using a convolutional neural network that incorporates a recurrent final layer, which allows informa- tion to flow between time-steps. The features from all time- steps are then combined using temporal pooling to give an overall appearance feature for the complete sequence. The convolutional network, recurrent layer, and temporal pool- ing layer, are jointly trained to act as a feature extractor for video-based re-identification using a Siamese network ar- chitecture. Our approach makes use of colour and optical flow information in order to capture appearance and motion information which is useful for video re-identification. Ex- periments are conduced on the iLIDS-VID and PRID-2011 datasets to show that this approach outperforms existing methods of video-based re-identification.
    Attached files: McLaughlin_Recurrent_Convolutional_Network_CVPR_2016_paper.pdf
    ?Maintaining a normal burning temperature is essential to ensuring the quality of nonferrous metals and cement clinker in a rotary kiln. Recognition of the temperature condition is an important component of a temperature control system. Because of the interference of smoke and dust in the kiln, the temperature of the burning zone is difficult to be measured accurately using traditional methods. Focusing on blurry images from which only the flame region can be segmented, an image recognition system for the detection of the temperature condition in a rotary kiln is presented. First, the flame region is segmented employing a region-growing method with a dynamic seed point. Seven features, comprising three luminous features and four dynamic features, are then extracted from the flame region. Dynamic features constructed from luminous feature sequences are proposed to overcome the problem of mis-recognition when the temperature of the flame region changes rapidly. Finally, classifiers are trained to recognize the temperature state of the burning zone using its features. Experimental results using real datasets demonstrate that the proposed image-based systems for recognizing the temperature condition are effective and robust.
    Attached files: reference_seminar_fire_temperature.pdf
    We present a set of experiments with a video OCR system (VOCR) tailored for video information retrieval and establish its importance in multimedia search in general and for some specific queries in particular. The system, inspired by an existing work on text detection and recognition in images, has been developed using techniques involving detailed analysis of video frames producing candidate text regions. The text regions are then binarized and sent to a commercial OCR resulting in ASCII text, that is finally used to create search indexes. The system is evaluated using the TRECVID data. We compare the system?s performance from an information retrieval perspective with another VOCR developed using multi-frame integration and empirically demonstrate that deep analysis on individual video frames result in better video retrieval. We also evaluate the effect of various textual sources on multimedia retrieval by combining the VOCR outputs with automatic speech recognition (ASR) transcripts. For general search queries, the VOCR system coupled with ASR sources outperforms the other system by a very large extent. For search queries that involve named entities, especially people names, the VOCR system even outperforms speech transcripts, demonstrating that source selection for particular query types is extremely essential.
    Attached files: VOCR_draft.pdf
    This paper approximates the 3D geometry of a scene by a small number of 3D planes. The method is especially suited to man-made scenes, and only requires two calibrated wide-baseline views as inputs. It relies on the computation of a dense but noisy 3D point cloud, as for example obtained by matching DAISY descriptors [35] between the views. It then segments one of the two reference images, and adopts a multi-model fitting process to assign a 3D plane to each region, when the region is not detected as occluded. A pool of 3D plane hypotheses is first derived from the 3D point cloud, to include planes that reasonably approximate the part of the 3D point cloud observed from each reference view between randomly selected triplets of 3D points. The hypothesis-to-region assignment problem is then formulated as an energy-minimization problem, which simultaneously optimizes an original data-fidelity term, the assignment smoothness over neighboring regions, and the number of assigned planar proxies. The synthesis of intermediate viewpoints demonstrates the effectiveness of our 3D reconstruction, and thereby the relevance of our proposed data fidelity-metric.
    Attached files: CVPR2016_Piecewise-planar 3D approximateion from wide-baseline stereo.pdf
    In this paper, we present an in-vehicle computing system capable of localizing lane markings and communicating them to drivers. To the best of our knowledge, this is the first system that combines the Maximally Stable Extremal Region (MSER) technique with the Hough transform to detect and recognize lane markings (i.e., lines and pictograms). Our system begins by localizing the region of interest using the MSER technique. A three-stage refinement computing algorithm is then introduced to enhance the results of MSER and to filter out undesirable information such as trees and vehicles. To achieve the requirements of real-time systems, the Progressive Probabilistic Hough Transform (PPHT) is used in the detection stage to detect line markings. Next, the recognition of the color and the form of line markings is performed; this it is based on the results of the application of the MSER to left and right line markings. The recognition of High-Occupancy Vehicle pictograms is performed using a new algorithm, based on the results of MSER regions. In the tracking stage, Kalman filter is used to track both ends of each detected line marking. Several experiments are conducted to show the efficiency of our system.
    Attached files: A real-time lane marking localization, tracking and communication system.pdf
    Vehicle lane-level localization is a fundamental technology in autonomous driving. To achieve accurate and consistent performance, a common approach is to use the LIDAR technology. However, it is expensive and computational demanding, and thus not a practical solution in many situations. This paper proposes a stereovision system, which is of low cost, yet also able to achieve high accuracy and consistency. It integrates a new lane line detection algorithm with other lane marking detectors to effectively identify the correct lane line markings. It also fits multiple road models to improve accuracy. An effective stereo 3D reconstruction method is proposed to estimate vehicle localization. The estimation consistency is further guaranteed by a new particle filter framework, which takes vehicle dynamics into account. Experiment results based on image sequences taken under different visual conditions showed that the proposed system can identify the lane line markings with 98.6% accuracy. The maximum estimation error of the vehicle distance to lane lines is 16 cm in daytime and 26 cm at night, and the maximum estimation error of its moving direction with respect to the road tangent is 0.06 rad in daytime and 0.12 rad at night. Due to its high accuracy and consistency, the proposed system can be implemented in autonomous driving vehicles as a practical solution to vehicle lane-level localization.
    This paper presents a stereo matching approach for a novel multi-perspective panoramic stereo vision system, making use of asynchronous and non-simultaneous stereo imaging towards real-time 3D 360◦ vision. The method is designed for events representing the scenes visual contrast as a sparse visual code allowing the stereo reconstruction of high resolution panoramic views. We propose a novel cost measure for the stereo matching, which makes use of a similarity measure based on event distributions. Thus, the robustness to variations in event occurrences was increased. An evaluation of the proposed stereo method is presented using distance estimation of panoramic stereo views and ground truth data. Furthermore, our approach is compared to standard stereo methods applied on event-data. Results show that we obtain 3D reconstructions of 1024 ?? 3600 round views and outperform depth reconstruction accuracy of state-of-the-art methods on event data.
    Attached files: Schraml_Event-Driven_Stereo_Matching_2015_CVPR_paper.pdf
    Recurrence of small image patches across different scales of a natural image has been previously used for solving ill-posed problems (e.g., superresolution from a single image). In this paper we show how this multi-scale property can also be used for ??blind-deblurring??, namely, removal of an unknown blur from a blurry image. While patches repeat ??as is?? across scales in a sharp natural image, this cross-scale recurrence significantly diminishes in blurry images. We exploit these deviations from ideal patch recurrence as a cue for recovering the underlying (unknown) blur kernel. More specifically, we look for the blur kernel k, such that if its effect is ??undone?? (if the blurry image is deconvolved with k), the patch similarity across scales of the image will be maximized. We report extensive experimental evaluations, which indicate that our approach compares favorably to state-of-the-art blind deblurring methods, and in particular, is more robust than them.
    Attached files: BlindDeblurring_ECCV2014.pdf
    In this paper, we propose a Multiple Background Model based Background Subtraction (MB2S) algorithm that is robust against sudden illumination changes in indoor environment. It uses multiple background models of expected illumination changes followed by both pixel and frame based background subtraction on both RGB and YCbCr color spaces. The masks generated after processing these input images are then combined in a framework to classify background and foreground pixels. Evaluation of proposed approach on publicly available test sequences show higher precision and recall than other state-of-the-art algorithms.
    The novelties of this paper are three aspects: 1) We use joint activities of four Gabor filters and confidence measure for speeding up the process of texture orientation estimation. 2) Misidentification chances and computational complexity of the algorithm are reduced by using a particle filter. It limits vanishing point search range and reduces the number of pixels to be voted. The algorithm combines the peakedness measure of vote accumulator space with the displacements of moving average of observations to regulate the distribution of vanishing point candidates. 3) Attributed to the design of a noise-insensitive observation model,
    Attached files: 20160528-report-Yang Yu.pptx Fast and Robust Vanishing Point Detection for Unstructured Road Following.pdf
    A new idea of an abandoned object detection system for road traffic surveillance systems based on three-dimensional image information is proposed in this paper to prevent traffic accidents. A novel Binocular Information Reconstruction and Recognition (BIRR) algorithm is presented to implement the new idea. As initial detection, suspected abandoned objects are detected by the proposed static foreground region segmentation algorithm based on surveillance video from a monocular camera. After detection of suspected abandoned objects, three-dimensional (3D) information of the suspected abandoned object is reconstructed by the proposed theory about 3D object information reconstruction with images from a binocular camera. To determine whether the detected object is hazardous to normal road traffic, road plane equation and height of suspected-abandoned object are calculated based on the three-dimensional information. Experimental results show that this system implements fast detection of abandoned objects and this abandoned object system can be used for road traffic monitoring and public area surveillance.
    Attached files: sensors-15-06885.pdf
    This paper revisits the classical multiple hypotheses tracking (MHT) algorithm in a tracking-by-detection framework. The success of MHT largely depends on the ability to maintain a small list of potential hypotheses, which can be facilitated with the accurate object detectors that are currently available. We demonstrate that a classical MHT implementation from the 90?s can come surprisingly close to the performance of state-of-the-art methods on standard benchmark datasets. In order to further utilize the strength of MHT in exploiting higher-order information, we introduce a method for training online appearance models for each track hypothesis. We show that appearance models can be learned efficiently via a regularized least squares framework, requiring only a few extra operations for each hypothesis branch. We obtain state-of-the-art results on popular tracking-by-detection datasets such as PETS and the recent MOT challenge.
    Attached files: MHTR_ICCV2015 - MultipleHypothesisTrackingRevisited.pdf
    Automatic fire detection has become more and more appealing because of the increasing use of video capabilities in surveillance systems used for early detection of fire. However, its high computational complexities limit its use in real-time applications. To meet the real-time processing of today?s fire detection techniques, this study proposes a single instruction, multiple data many-core model. To design an efficient many-core model for image processing applications such as fire detection, a key design parameter is the image data-per-processing-element (IDPE) variation of the many-core system, which is the amount of image data directly mapped to each processing element PE. This study quantitatively evaluates the impact of the IDPE variation on system performance and energy efficiency for the multi-stage fire detection approach that consists of movement-containing region detection, color segmentation, fire feature extraction of fires, and decision making if there is a fire or non-fire in a processing video frame. In this study, we use six IDPE ratios to determine an optimal many-core model that provides the most efficient operation for fire detection using architectural and workload simulation. Experimental results indicate that the most efficient many-core model is achieved at the 64 IDPE value in terms of the worst-case execution time and energy efficiency. In addition, this study compares the performance of the most efficient many core configuration with that of a commercial graphics processing unit (Nvidia GeForce GTX 480) to show the improved performance of the proposed many-core model for the fire detection algorithm. This many-core configuration outperforms the commercial graphic processing unit in the worst-case execution time and energy efficiency.
    Attached files: multicore_fire.pdf
    In this paper we explore interactions between the appearance of an outdoor scene and the ambient temperature. By studying statistical correlations between image sequences from outdoor cameras and temperature measurements we identify two interesting interactions. First, semantically meaningful regions such as foliage and reflective oriented surfaces are often highly indicative of the temperature. Second, small camera motions are correlated with the temperature in some scenes. We propose simple scene specific temperature prediction algorithms which can be used to turn a camera into a crude temperature sensor. We find that for this task, simple features such as local pixel intensities outperform sophisticated, global features such as from a semantically-trained convolutional neural network.
    Attached files: Glasner_Hot_or_Not_ICCV_2015_paper.pdf
    Multi-View-Stereo (MVS) methods aim for the highest detail possible, however, such detail is often not required. In this work, we propose a novel surface reconstruction method based on image edges, superpixels and secondorder smoothness constraints, producing meshes comparable to classic MVS surfaces in quality but orders of magnitudes faster. Our method performs per-view dense depth optimization directly over sparse 3D Ground Control Points (GCPs), hence, removing the need for view pairing, image rectification, and stereo depth estimation, and allowing for full per-image parallelization. We use Structure-from-Motion (SfM) points as GCPs, but the method is not specific to these, e.g. LiDAR or RGB-D can also be used. The resulting meshes are compact and inherently edge-aligned with image gradients, enabling good-quality lightweight per-face flat renderings. Our experiments demonstrate on a variety of 3D datasets the superiority in speed and competitive surface quality.
    Attached files: ICCV2015_Superpixel Meshes for Fast Edge-Preserving Surface Reconstruction.pdf
    Localization of the vehicle with respect to road lanes plays a critical role in the advances of making the vehicle fully autonomous. Vision based road lane line detection provides a feasible and low cost solution as the vehicle pose can be derived from the detection. While good progress has been made, the road lane line detection has remained an open one, given challenging road appearances with shadows, varying lighting conditions, worn-out lane lines etc. In this paper, we propose a more robust vision-based approach with respect to these challenges. The approach incorporates four key steps. Lane line pixels are first pooled with a ridge detector. An effective noise filtering mechanism will next remove noise pixels to a large extent. A modified version of sequential RANdom Sample consensus) is then adopted in a model fitting procedure to ensure each lane line in the image is captured correctly. Finally, if lane lines on both sides of the road exist, a parallelism reinforcement technique is imposed to improve the model accuracy. The results obtained show that the proposed approach is able to detect the lane lines accurately and at a high success rate compared to current approaches. The model derived from the lane line detection is capable of generating precise and consistent vehicle localization information with respect to road lane lines, including road geometry, vehicle position and orientation.
    Attached files: Vision-based approach towards lane line detection and vehicle localization.pdf
    While numerous algorithms have been proposed for ob- ject tracking with demonstrated success, it remains a chal- lenging problem for a tracker to handle large change in scale, motion, shape deformation with occlusion. One of the main reasons is the lack of effective image representa- tion to account for appearance variation. Most trackers use high-level appearance structure or low-level cues for repre- senting and matching target objects. In this paper, we pro- pose a tracking method from the perspective of mid-level vision with structural information captured in superpixels. We present a discriminative appearance model based on su- perpixels, thereby facilitating a tracker to distinguish the target and the background with mid-level cues. The tracking task is then formulated by computing a target-background confidence map, and obtaining the best candidate by max- imum a posterior estimate. Experimental results demon- strate that our tracker is able to handle heavy occlusion and recover from drifts. In conjunction with online update, the proposed algorithm is shown to perform favorably against existing methods for object tracking.
    Attached files: SLIC_Superpixels.pdf iccv11_superpixel tracking.pdf superpixel_iccv11a_supplementary.pdf
    Methods for super-resolution can be broadly classified into two families of methods: (i) The classical multi-image super-resolution (combining images obtained at subpixel misalignments), and (ii) Example-Based super-resolution (learning correspondence between low and high resolution image patches from a database). In this paper we propose a unified framework for combining these two families of methods. We further show how this combined approach can be applied to obtain super resolution from as little as a single image (with no database or prior examples). Our approach is based on the observation that patches in a natural image tend to redundantly recur many times inside the image, both within the same scale, as well as across different scales. Recurrence of patches within the same image scale (at subpixel misalignments) gives rise to the classical super-resolution, whereas recurrence of patches across different scales of the same image gives rise to example-based super-resolution. Our approach attempts to recover at each pixel its best possible resolution increase based on its patch redundancy within and across scales.
    Attached files: single_image_SR.pdf
    Most of the recently published background subtraction methods can still be classified as pixel-based, as most of their analysis is still only done using pixel-by-pixel comparisons. Few others might be regarded as spatial-based (or even spatiotemporal-based) methods, as they take into account the neighborhood of each analyzed pixel. Although the latter types can be viewed as improvements in many cases, most of the methods that have been proposed so far suffer in complexity, processing speed, and/or versatility when compared to their simpler pixel-based counterparts. In this paper, we present an adaptive background subtraction method, derived from the low-cost and highly efficient ViBe method, which uses a spatiotemporal binary similarity descriptor instead of simply relying on pixel intensities as its core component. We then test this method on multiple video sequences and show that by only replacing the core component of a pixel-based method it is possible to dramatically improve its overall performance while keeping memory usage, complexity and speed at acceptable levels for online applications.
    Advanced driver assistance systems--the accurate detection and classification of moving objects. Define a composite object representation to include class information in the core object??s description. Propose a complete perception fusion architecture based on the evidential framework to solve the detection and tracking of moving objects problem by integrating the composite representation and uncertainty management. Integrate our fusion approach in a real-time application inside a vehicle demonstrator from the interactIVe IP European project.
    Attached files: 1.pdf
    This paper addresses the problem of single-target tracker performance evaluation. We consider the performance measures, the dataset and the evaluation system to be the most important components of tracker evaluation and propose requirements for each of them. The requirements are the basis of a new evaluation methodology that aims at a simple and easily interpretable tracker comparison. The ranking-based methodology addresses tracker equivalence in terms of statistical significance and practical differences. A fully-annotated dataset with per-frame annotations with several visual attributes is introduced. The diversity of its visual properties is maximized in a novel way by clustering a large number of videos according to their visual attributes. This makes it the most sophistically constructed and annotated dataset to date. A multi-platform evaluation system allowing easy integration of third-party trackers is presented as well. The proposed evaluation methodology was tested on the VOT2014 challenge on the new dataset and 38 trackers, making it the largest benchmark to date. Most of the tested trackers are indeed state-of-the-art since they outperform the standard baselines, resulting in a highly-challenging benchmark. An exhaustive analysis of the dataset from the perspective of tracking difficulty is carried out. To facilitate tracker comparison a new performance visualization technique is proposed.
    Attached files: tracker evaluation.pdf
    In this paper, we propose a method that is able to detect fires by analyzing videos acquired by surveillance cameras. Two main novelties have been introduced. First, complementary information, based on color, shape variation, and motion analysis, is combined by a multiexpert system. The main advantage deriving from this approach lies in the fact that the overall performance of the system significantly increases with a relatively small effort made by the designer. Second, a novel descriptor based on a bag-of-words approach has been proposed for representing motion. The proposed method has been tested on a very large dataset of fire videos acquired both in real environments and from the web. The obtained results confirm a consistent reduction in the number of false positives, without paying in terms of accuracy or renouncing the possibility to run the system on embedded platforms.
    Attached files: 2015_fire_detection.pdf
    The preceding vehicles detection technique in nighttime traffic scenes is an important part of the advanced driver assistance system (ADAS). This paper proposes a region tracking-based vehicle detection algorithm via the image processing technique. First, the brightness of the taillights during nighttime is used as the typical feature, and we use the existing global detection algorithm to detect and pair the taillights. When the vehicle is detected, a time series analysis model is introduced to predict vehicle positions and the possible region (PR) of the vehicle in the next frame. Then, the vehicle is only detected in the PR. This could reduce the detection time and avoid the false pairing between the bright spots in the PR and the bright spots out of the PR. Additionally, we present a thresholds updating method to make the thresholds adaptive. Finally, experimental studies are provided to demonstrate the application and substantiate the superiority of the proposed algorithm. The results show that the proposed algorithm can simultaneously reduce both the false negative detection rate and the false positive detection rate.
    Attached files: 20160206-Saturday Seminar-Wahyono.pdf
    Character recognition in video is a challenging task because low resolution and complex background ofvideo caused is connections,loss of information,loss of shapes of the characters etc.In this paper,we introduce a novel ring radius transform(RRT) and the concept of medial pixels on characters with broken contours in the edge domain for reconstruction. For each pixel,the RRT assigns a value which is the distance to the nearest edge pixel.The medial pixels are those which have the maximum radius values in their neighborhood. We demonstrate the application of these concepts in the problem of character reconstruction to improve the character recognition rate in video images. With ring radius transform and medial pixels, our approach exploits the symmetry information between the inner and outer contours of a broken character to reconstruct the gaps. Experimental results and comparison with two existing methods show that the proposed method outperforms the existing methods in terms of measures such as relative error and character recognition rate.
    Attached files: A novel ring radius transform.pdf
    This paper presents a method to predict social saliency, the likelihood of joint attention, given an input image or video by leveraging the social interaction data captured by first person cameras. Inspired by electric dipole moments, we introduce a social formation feature that encodes the geometric relationship between joint attention and its social formation. We learn this feature from the first person social interaction data where we can precisely measure the locations of joint attention and its associated members in 3D. An ensemble classifier is trained to learn the geometric relationship. Using the trained classifier, we predict social saliency in real-world scenes with multiple social groups including scenes from team sports captured in a third person view. Our representation does not require directional measurements such as gaze directions. A geometric analysis of social interactions in terms of the F-formation theory is also presented.
    Attached files: Park_Social_Saliency_Prediction_2015_CVPR_paper.pdf
    This paper describes a multiple view based approach for building modeling via a novel multi-box grammar,which represents an occlusion relationship among the projections of a set of buildings sharing a common Manhattan World coordinate system. We formulate the building modeling problem as an energy minimization to combine the constraints from the multi-box grammar with (1) the semantic labeling information from appearance models, (2) the directional information w.r.t the vanishing points in each single view, and (3) the planar homography correspondence among multiple views. We further propose a two-step coarse-to-fine approach to achieve the optimal solution. First we employ super-pixels and a simplified edition of the grammar to reduce the searching space, and obtain an initial layout to accelerate the convergence speed. At the second stage, the scene model is refined to achieve pixel-level accuracy by minimizing the energy using Random Walk. Experiments on street view images demonstrate the capability of our method in reconstructing multiple buildings at different distances, and also the robustness in handling occlusion.
    Updating road markings is one of the routine tasks of transportation agencies. Compared with traditional road inventory mapping techniques, vehicle-borne mobile light detection and ranging (LiDAR) systems can undertake the job safely and efficiently. However, current hurdles include software and computing challenges when handling huge volumes of highly dense and irregularly distributed 3-D mobile LiDAR point clouds. This paper presents the development and implementation aspects of an automated object extraction strategy for rapid and accurate road marking inventory. The proposed road marking extraction method is based on 2-D georeferenced feature (GRF) images, which are interpolated from 3-D road surface points through a modified inverse distance weighted (IDW) interpolation. Weighted neighboring difference histogram (WNDH)-based dynamic thresholding and multiscale tensor voting (MSTV) are proposed to segment and extract road markings from the noisy corrupted GRF images. The results obtained using 3-D point clouds acquired by a RIEGL VMX-450 mobile LiDAR system in a subtropical urban environment are encouraging.
    Attached files: Using Mobile LiDAR Data for Rapidly Updating Road Markings.pdf
News | About us | Research | Lectures