Your Car Can See The World, Too: Traffic Scene Understanding Based On Stereo Vision

Technology has become ubiquitous in our vehicles. Advanced Driver Assistance Systems (ADAS) are more and more present in new cars, performing tasks such as warning of lane departures or braking before an impending rear-end accident. The goal is clear: reducing the accident rate by helping the human driver, whose errors are the leading cause of traffic accidents. At the end of that road awaits the autonomous vehicle, which is today closer than ever to being a reality instead of a mere sci-fi gimmick.

Although we are not always aware of the high complexity involved in driving a vehicle, providing an automatic system with the set of skills required for safe navigation of a car is a monumental task. The correct operation of the driving controls is, definitely, of vital importance; however, it is even more critical that all modules have a detailed model of the environment to ensure that decisions are safe for all road users. The generation of this model is the responsibility of the perception stack of the vehicle.

As with human drivers, automated driving systems must be aware of all potential hazards in the surroundings of the vehicle, particularly those posed by dynamic obstacles such as other cars and pedestrians. For that reason, reliable identification and classification of agents in the vicinity of the vehicle is a pre-requisite for autonomous driving, and one of the main tasks of automotive perception systems.

Perception algorithms rely on data from sensors to perform their function. The vehicle sensor setup determines, therefore, the limits of the perception modules in terms of functionality and accuracy. Among the different sensors employed in these applications, cameras offer a compelling set of features: they can provide valuable appearance information while being highly cost-effective and enabling close-to-market setups with minor impact on the vehicle styling.

Fortunately, recent advances in deep learning have paved the way to achieve an in-depth understanding of the traffic scene from onboard cameras with levels of accuracy never seen before. Current object detection paradigms based on convolutional neural networks, such as Faster R-CNN [1], used in our work, can be leveraged to perform online multi-class detection, providing a solid foundation for the identification of the different road users in the scene.

Additionally, we have shown in our work that the appearance of an object can provide valuable hints to estimate its orientation, which is particularly beneficial to predict the trajectory of dynamic obstacles, such as the ones found on traffic environments. Interestingly enough, we have confirmed that the same neural network can be trained to perform detection and orientation estimation simultaneously, without experiencing a significant loss of performance in either task.

It is undeniable that deep learning techniques can be computationally expensive. When it comes to onboard systems, power and size limitations are an essential factor to take into account in the design. We prove in our work that different hyperparameters, such as the image scale and the backbone of the network, can be tuned to reach an optimal trade-off between accuracy and performance, according to the requirements of the application.

Nevertheless, the identification of objects in the images provided by an onboard camera is only the first part of the problem. Safe operation of the vehicle in crowded environments needs an accurate estimation of the 3D location of objects in the space around the car. As stated above, some measures, such as the orientation, can be estimated from the appearance of objects in camera frames. Still, when a precise localization is necessary, additional sources of data, capable of providing spatial information, are required.

Lidar rangefinders, able to provide accurate distance measures of the surroundings of the vehicle, have become popular in the last decades to that end. Still, they are not the only alternative available. We aimed to prove that stereo vision systems can also provide geometrical information to localize the objects in the scene with reasonable accuracy.

Stereo devices are made of two cameras that can capture pictures of the same scene from different perspectives. As with the human eyes, this enables retrieving 3D information, based on the distance between the projection of the same point onto each image. The procedure to recover the 3D information requires some processing and introduces an error which cannot be avoided; on the other hand, stereo vision eliminates the need for expensive, bulky lidar devices, which is particularly desirable in onboard setups. Fig. 1 shows our stereo vision system, which is a particular case made of three cameras so that we can choose which pair of “eyes” we want to use depending on the desired detection range.

Fig. 1. Trinocular stereo camera mounted at the top of the windscreen. Image courtesy Intelligent Systems Lab.

Distances provided by the stereo system are relative to the device itself. Nonetheless, if the ground is assumed flat, which is a reasonable premise in traffic environments, then measures can be referenced to the vehicle instead, providing an environment model which is immediately useful for navigation. This transformation involves a calibration procedure based on the detection of the ground plane.

Finally, detections in the image, endowed with an estimation of their orientation, can be associated with the corresponding 3D information from the stereo system to estimate the current location of all the objects in the field of view of the cameras.

From all these building blocks, our work proposes an integrated framework where detection and 3D reconstruction are carried out in parallel and fused at a late stage to provide a list containing all the detected objects, including their predicted category (e.g., car, pedestrian) and their location with respect to the vehicle.

To test the performance of the system, we performed several experiments using real-world data from the reference benchmark for onboard perception, the KITTI dataset [2]. Firstly, we proved that the detection module could provide reliable results on which to base the rest of the pipeline; besides, it was possible to estimate a reasonably accurate orientation value for a high percentage of the detected obstacles. Secondly, combining appearance and geometrical information allowed localizing the objects in the environment with a median error of 0.5 m within a wide range of distances.

Results suggest that cameras and, especially, stereo vision systems, have strong potential for enabling scene understanding from onboard perception systems. Using a single modality, as in the proposed framework, means simplifying the required sensor setup and avoiding some burden, such as inter-sensor calibration. Although often overlooked, vision-based sensors will, predictably, play a significant role in upcoming autonomous cars.

These findings are described in the article entitled Traffic scene awareness for intelligent vehicles using ConvNets and stereo vision, recently published in the journal Robotics and Autonomous Systems.


  1. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
  2. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.