One of the major applications of machine learning in autonomous driving is semantic segmentation or scene parsing of urban driving scenes. Until a few years ago, semantic segmentation was one of the most challenging problems in computer. Nowadays deep learning has revolutionized computer vision, enabling accurate environment recognition beyond human performance. This degree of accuracy comes with challenges: computational bound in the embedded system, need for large datasets, and learning issues like class imbalance, unobserved objects, corner cases, etc. must resolved to enable automotive applications.
Regarding dataset, autonomous driving researchers are lucky:
By now, several decent publicly available datasets exist that exhibit a variety of scenes, annotations and geographical distribution. The most frequently used semantic segmentation datasets are KITTI, Cityscapes, Mapillary Vistas, ApolloScape, and recently released Berkeley Deep Drive’s BDD100K.
|Labeled Images for Training||Classes||Multiple Cities||Environment|
||20k||66||Yes||Daylight, rain, snow, fog, haze, dawn, dusk and night|
|ApolloScape||147k||36||No||Daylight, snow, rain, foggy|
|BDD100K||8000||19||Yes||Daylight, rain, snow, fog, haze, dawn, dusk and night|
The KITTI semantic segmentation dataset consists of 200 semantically annotated training images and of 200 test images. The total KITTI dataset is not only for semantic segmentation, it also includes dataset of 2D and 3D object detection, object tracking, road/lane detection, scene flow, depth evaluation, optical flow and semantic instance level segmentation. KITTI captures locations around Karlsruhe in Germany in rural areas and on highways.
Cityscapes is widely used for semantic understanding of urban scenes. The dataset is recorded during the span of several months, covering spring, summer, and fall in 50 cities of Germany and neighboring countries. Images are recorded with an automotive grade 22cm baseline stereo camera. This dataset consists of 5k fine annotated and 20k weakly annotated images. It contains significantly more object instance (e.g., human and vehicle) than KITTI. Here it shows, class distribution of annotated image over 19 different class.
This dataset is five times larger than the fine annotations of Cityscapes dataset. All of the images are extracted from www.mapillary.com’s crowdsourced image database, covering North and South America, Europe, Africa, and Asia. It has different capturing viewpoints like road, sidewalk and off-road. It consists of images from different imaging devices (mobile phones, tablets, action cameras), therefore it has different types of camera noise. It has 25k high-resolution images annotated with 66 classes.
This dataset contains 147k images with its corresponding pixel-level annotation. It also includes the pose information and depth maps for the static background. All images are captured using Riegl VMX-1HA which has VMX-CS6 camera system for the resolution of 3384 x 2710. The specification of classes are similar to the Cityscapes dataset but due to the popularity of the tricycle in East Asia countries, they added a new tricycle class which covers all kinds of three-wheeled vehicles.
This dataset is the largest publicly available self-driving dataset. It is 800 times larger than ApolloScape dataset. For the use case of semantic segmentation, it has similar train classes to the Cityscapes dataset. This dataset is mainly captured from the different areas of US. Infrastructure and highway traffic signs compare to the Cityscapes dataset. This dataset also includes object detection, Lane detection, drivable area, and semantic instance segmentation datasets.