Previously, we have discussed how Machine Learning is fundamentally different from conventional software development. Machine Learning is data driven: performance doesn’t scale with the development effort, but with the amount of data used for training. Retraining with new data can even be automated, if data is recurrently updated.

Surprisingly, training data management is not a big focus in the Machine Learning community and – even more unfortunate –  it is often not receiving major attention in internal development processes. This article aims to explain why data management is such critical for Machine Learning – especially for ML-powered autonomous driving.

Getting data is the main effort in Machine Learning

Until today, there are few Machine Learning projects without the “surprise” at some point that data is missing, corrupted, expensive, hard to obtain, or just arriving far later than expected. This recurring pattern contradicts the frequently given promise that Machine Learning is substantially faster than conventional software development, since one just has to provide more data.

Consequently, project deadlines are missed, student theses come to nothing, and developers idle or shift focus until the situation improves. Here are some reasons why data is such vital for Machine Learning:

Biggest workload

Data collection, cleansing and management typically make up for more than 90% of the total development effort in Machine Learning projects. The chart below provides an overview of related activities.

Biggest impact on performance

The quality of training data is the dominant factor for the performance of trained models; It doesn’t require a genius to design a model for great data, but not even a genius can design a model for corrupted data. The principle that poor input inevitably leads to poor output is also referred to as “shit in shit out”.

Data points enable use cases

The content of a training dataset defines which use cases a model trained with it can fulfill. For an autonomous car, use cases are derived from likely or criticial scenarios in the operational design domain (i.e. deployment setting, e.g. highway). For these use cases, data points with important features (e.g. crossings) must be identified and incorporated in the training dataset to get reasonable performance and to enable performance evaluation per use case.

Dataset biases may render trained models useless

Biases are “dataset bugs”, i.e. imbalances in the distribution of training datasets towards undesired patterns and features. These imbalances are often hard to spot, and can – in the worst case – render an entire training dataset (and the models it trains) useless[1]. Most practical applications are too complex to entirely avoid biases, but they must be identified and compensated as good as possible.

Collecting data for self-driving cars is a particularly big effort

Apart from general pitfalls in Machine Learning, there is a variety of challenges specific to the autonomous driving context making data collection and dataset building particularly complex. Here are some of them:

Sensor data explosion

In a vehicle equipped with a full sensor set for 360° perception, cameras, lidars, radars, ultrasonic sensors and vehicle bus data produces Gigabytes of data per second at full sampling rate. Once camera resolutions beyond 2MP are considered, 10 Gbit/s ethernet is quickly maxed out. The main drivers of data volume are camera and lidar sensors.

Sensor immaturity

Autonomous driving is in a phase of rapid development, where a sprawling landscape of startups and established providers come up with promising new sensors on a weekly basis. The downside is that these sensors are prototypes, not production grade and consequently massive effort is required for sensor integration and maintenance on both software and hardware side. Hardware bugs, firmware updates, and postponed sensor deliveries must be expected and will take their toll on development velocity.


Calibrating one camera, or a stereo camera configuration is certainly a problem solvable with reasonable effort, but calibrating five cameras with a lidar, four radars and eight ultrasonic sensors is a major effort. Ensuring for every recording that all sensors are in sync is tedious even if respective software routines and calibration targets are perfectly in place. To minimize this effort and ensure that calibration doesn’t break during rides, online calibration is the key.

Figure: A simple camera and lidar setup, ultrasonic and radar sensors omitted for the sake of simplicity.

Chasing edge cases

A training dataset for Machine Learning should represent its intended deployment environment. This does not mean that the training dataset should have the same feature distribution as the deployment environment, since (for a finite dataset) this would mean that the dataset would consist of redundant standard situations only. On the contrary, datasets should be diverse and biased towards edge cases rather than towards frequently occurring “standard situations” (e.g. empty streets). In an advanced implementation, active learning is used to select data with the largest impact on the model’s performance.

Balancing a training dataset towards edge cases requires the ability to identify such, and the effort to collect them in the field. This is challenging because real driving environments are highly complex, and potential edge cases are endless, with some cases being so rare that it requires millions of kilometers of driving to observe them.

Traffic events are a long tail distribution, and since it is not possible to collect all possible edge cases, it must be decided which edge cases to collect, and when the dataset is sufficient.

The chart above illustrates the effort it takes to collect rare cases using the example of pedestrians. While general pedestrians are frequently occurring (in urban areas), situations in which pedestrians are walking on the street are comparably rare and require more extensive recording to be captured. Events like children playing on the street, or pedestrians lying on the street (e.g. due to accidents) are major challenges for self-driving cars but occur so rarely that they are hardly captured.

Decline of data collection efficiency over distance

Biasing a dataset towards edge cases enforces diversity and leads to an efficient representation of the traffic environment for models to learn from. Rejecting data points that would be redundant for the dataset causes data collection efficiency (data points/recorded kilometer) to rapidly decline over the course of capturing – at some point the “standard situations” are sufficiently represented and edge cases become increasingly rare.

For the development of self-driving car technology, this means that with a limited amount of resources it is possible to collect enough data for a proof-of-concept, but a product-grade solution will require orders of magnitude more resources. For a startup that just shoots for its next well-constrained demo, it is usually sufficient to just send some engineers out for measurement rides. Good domain coverage including some of the less frequent events will require designated drivers and eventually even a recording fleet. And some cases are so rare or expensive that simulating them is the only efficient means of capturing them.

So why is data not a major focus?

After going through all the above, the teaser sentence for this article “Surprisingly, (training) dataset management […] is often not receiving major attention in internal development processes.” might make one wonder: If it is such a big and important effort to collect data, why wouldn’t data be a top priority?

New algorithms (GANs, Capsule networks, etc.) and new applications (semantic segmentation, fake celebrity face generation, etc.) are evidently catchier than data collection and dataset management, see Google Trends. Handling the data is certainly more engineering work than groundbreaking research and won’t show up at major conferences.

It is the fate of enabler technologies to receive less attention compared to their functional counterparts. Unfortunately this imbalance manifests itself in open source projects, where an abundance of Deep Learning frameworks contrasts a desert of dataset management solutions. Large companies usually have their proprietary solutions in place, but for researchers and small AI companies this is an actual inhibitor that prevents them from scaling their technology to a certain level.

Striving for proper dataset management

The inspiration for this article was drawn from recurring encounters with unordered datasets tamed by a bunch of scripts, with few information on their domain coverage. Being aware of the importance of high-quality data and the pain its collection and refinement cause, adequate dataset management becomes imperative. In contrast to other stages of the training data pipeline (sensor interfacing, data collection, etc.), dataset management is less tied to individual (sensor, …) configurations and holds more potential for a general solution.

This potential and the ugliness of some provisional solutions motivates to consider dataset management as a primary candidate for improvement, but what should proper dataset management look like? Here are some key features that one would look for:


All data used for training Machine Learning models must be stored permanently to allow reproduction of training results. Removing data from datasets should only be possible via soft deletion. Hard deletes should only be allowed if necessary, e.g. if sensors were broken and large amounts of storage are blocked. Independent from whether a soft or hard delete is performed, all derived data including trained models should be deleted in the same fashion to avoid inconsistency.


The whole Machine Learning process should be traceable in a way that every trained model references the environment used for training (git commit hashes, library versions, augmentation steps, random seeds, etc.) and the dataset it was trained with. The training dataset shall reference the measurement setup (vehicle, sensor configuration, hardware revisions, sensor recording software) as well as the requirements associated to the dataset and its individual elements.



To understand the performance of a Machine Learning model, it is fundamental to have meta data available that describes the situation in which every data point was acquired as detailed as possible. These attributes allow to match the dataset to the intended use cases, to assess the performance of trained models accordingly (e.g. low performance rainy weather AND narrow streets AND at crossings) and improve the dataset subsequently. These details can obtained by annotation and by association with external data sources like maps, weather services, etc.

Conclusion and up next

Ideally, this article has cast some light on dataset building and motivates to increase efforts in this domain. More public discussion and open source activity would substantially contribute to industrial Machine Learning applications, particularly in autonomous driving. For those who were not involved much in data management before, this article might provide an introduction and high-level overview.

In the coming articles we’ll look at data pipeline for autonomous driving and – at last! – Machine Learning applications.

This article was originally inspired by the cartoon below.

[1] Dataset bias leading to catastrophic failure when deploying Machine Learning models is often illustrated by the “tank” example, in which a neural network that was supposed to decide whether images contained tanks performed well until it was found that it had only learned the features associated to the time when the images where collected, and nothing about tanks. This story is likely an urban legend.

Categories: Development

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.