Info! Please, check our previous work on domain adaptation of virtual and real environments here.

thumbnail of gros_cvpr16



author={German Ros and Laura Sellart and Joanna Materzynska and David Vazquez and Antonio Lopez},
title={ {The SYNTHIA Dataset}: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes},



Vision-based semantic segmentation in urban scenarios is a key functionality for autonomous driving. Recent revolutionary results of deep convolutional neural networks (DCNNs) foreshadow the advent of reliable classifiers to perform such visual tasks. However, DCNNs require learning of many parameters from raw images; thus, having a sufficient amount of diverse images with class annotations is needed. These annotations are obtained via cumbersome, human labour which is particularly challenging for semantic segmentation since pixel-level annotations are required. In this paper, we propose to use a virtual world to automatically generate realistic synthetic images with pixel-level annotations. Then, we address the question of how useful such data can be for semantic segmentation — in particular, when using a DCNN paradigm. In order to answer this question we have generated a synthetic collection of diverse urban images, named SYNTHIA, with automatically generated class annotations. We use SYNTHIA in combination with publicly available real-world urban images with manually provided annotations. Then, we conduct experiments with DCNNs that show how the inclusion of SYNTHIA in the training stage significantly improves performance on the semantic segmentation task.

1. Driving Semantic Segmentation Datasets


Driving scenes sets for semantic segmentation. We define the number of training images (T), validation (V) and in total (A).

2. Reference Deconvolutional Network architecture (Tiny-Net) 


Tiny-Net, a compact deconvolutional network for semantic segmentation. We use it as one of the architectures of our experiments.

3. Semantic Segmentation Results

  • Quantitative Results

First we run a direct evaluation, training our models on SYNTHIA and testing them on independent real domains (KITTI, CamVid, etc.).  Notices that in our experiments with use low-resolution images to speed-up the training process, although SYNTHIA images are generated in very high resolution. Despite this fact, the results obtained by the models trained in SYNTHIA offer a good performance in the real scenes.

Results of training a T-Net and a FCN on SYNTHIA-Rand and evaluating it on state-of-the-art datasets of driving scenes.

Then we evaluate the gain of adding a small collection of  images from the target domain (real images in our case) to the training stage, along with SYNTHIA data. Here, using Domain Adaptation techniques, as for instance the Balanced Gradient Contribution approach leads us to a considerable boost in accuracy, as shown below.

Comparison of training a T-Net and FCN on real images only and the effect of extending training sets with SYNTHIA-Rand

4. Quantitative Results

Here we can show some quantitative results for different models (T-Net and FCN) in several conditions. (i) just trained with real data (R); (ii) trained with just virtual data (V); and (iii) training on a combination of real and virtual data. We can see how the blend of virtual and synthetic data helps to produce a model with higher accuracy.