One of the major applications of machine learning in autonomous driving is semantic segmentation or scene parsing of urban driving scenes. Until a few years ago, semantic segmentation was one of the most challenging problems in computer. Nowadays deep learning has revolutionized computer vision, enabling accurate environment recognition beyond human performance. This degree of accuracy comes with challenges: computational bound in the embedded system, need for large datasets, and learning issues like class imbalance, unobserved objects, corner cases, etc. must resolved to enable automotive applications.

Regarding dataset, autonomous driving researchers are lucky:
By now, several decent publicly available datasets exist that exhibit a variety of scenes, annotations and geographical distribution. The most frequently used semantic segmentation datasets are
KITTI, Cityscapes, Mapillary Vistas, ApolloScape, and recently released Berkeley Deep Drive’s BDD100K.

Dataset Comparison

Labeled Images for Training Classes Multiple Cities Environment
KITTI 200 34 No Daylight
Cityscapes 3478 34 Yes Daylight
Mapillary Vistas
20k 66 Yes Daylight, rain, snow, fog, haze, dawn, dusk and night
ApolloScape 147k 36 No Daylight, snow, rain, foggy
BDD100K 8000 19 Yes Daylight, rain, snow, fog, haze, dawn, dusk and night



The KITTI semantic segmentation dataset consists of 200 semantically annotated training images and of 200 test images. The total KITTI dataset is not only for semantic segmentation, it also includes dataset of 2D and 3D object detection, object tracking, road/lane detection, scene flow, depth evaluation, optical flow and semantic instance level segmentation. KITTI captures locations around Karlsruhe in Germany in rural areas and on highways.

Class distribution of KITTI Dataset

Source: Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger: “Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes”, 2017; [ arXiv:1708.01566]


Cityscapes is widely used for semantic understanding of urban scenes. The dataset is recorded during the span of several months, covering spring, summer, and fall in 50 cities of Germany and neighboring countries. Images are recorded with an automotive grade 22cm baseline stereo camera. This dataset consists of 5k fine annotated and 20k weakly annotated images. It contains significantly more object instance (e.g., human and vehicle) than KITTI. Here it shows, class distribution of annotated image over 19 different class.

Class distribution of Cityscapes dataset

Source: Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth: “The Cityscapes Dataset for Semantic Urban Scene Understanding”, 2016; [ arXiv:1604.01685].

Mapillary Vistas

This dataset is five times larger than the fine annotations of Cityscapes dataset. All of the images are extracted from’s crowdsourced image database, covering North and South America, Europe, Africa, and Asia. It has different capturing viewpoints like road, sidewalk and off-road. It consists of images from different imaging devices (mobile phones, tablets, action cameras), therefore it has different types of camera noise. It has 25k high-resolution images annotated with 66 classes.


Class distribution of Mapillary dataset


Source: Neuhold, Gerhard, et al. “The mapillary vistas dataset for semantic understanding of street scenes.” Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy. 2017; []


This dataset contains 147k images with its corresponding pixel-level annotation. It also includes the pose information and depth maps for the static background.  All images are captured using Riegl VMX-1HA which has VMX-CS6 camera system for the resolution of 3384 x 2710. The specification of classes are similar to the Cityscapes dataset but due to the popularity of the tricycle in East Asia countries, they added a new tricycle class which covers all kinds of three-wheeled vehicles.

Class distribution of Apollo scape dataset

Source: Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin: “The ApolloScape Dataset for Autonomous Driving”, 2018; [ arXiv:1803.06184].

BDD 100K

This dataset is the largest publicly available self-driving dataset. It is 800 times larger than ApolloScape dataset. For the use case of semantic segmentation, it has similar train classes to the Cityscapes dataset. This dataset is mainly captured from the different areas of US. Infrastructure and highway traffic signs compare to the Cityscapes dataset. This dataset also includes object detection, Lane detection, drivable area, and semantic instance segmentation datasets.

Class distribution of BDD100K dataset

Source: Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan: “BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling”, 2018; [ arXiv:1805.04687].

Other Useful Resources

Categories: general

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.