U-Net: Convolutional Networks for Biomedical
Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox
Computer Science Department and BIOSS Centre for Biological Signalling Studies,
University of Freiburg, Germany
ronneber@informatik.uni-freiburg.de,
WWW home page: http://lmb.informatik.uni-freiburg.de/
Abstract. There is large consent that successful training of deep net-
works requires many thousand annotated training samples. In this pa-
per, we present a network and training strategy that relies on the strong
use of data augmentation to use the available annotated samples more
efficiently. The architecture consists of a contracting path to capture
context and a symmetric expanding path that enables precise localiza-
tion. We show that such a network can be trained end-to-end from very
few images and outperforms the prior best method (a sliding-window
convolutional network) on the ISBI challenge for segmentation of neu-
ronal structures in electron microscopic stacks. Using the same net-
work trained on transmitted light microscopy images (phase contrast
and DIC) we won the ISBI cell tracking challenge 2015 in these cate-
gories by a large margin. Moreover, the network is fast. Segmentation
of a 512x512 image takes less than a second on a recent GPU. The full
implementation (based on Caffe) and the trained networks are available
at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.
1 Introduction
In the last two years, deep convolutional networks have outperformed the state of
the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks
have already existed for a long time [8], their success was limited due to the
size of the available training sets and the size of the considered networks. The
breakthrough by Krizhevsky et al. [7] was due to supervised training of a large
network with 8 layers and millions of parameters on the ImageNet dataset with
1 million training images. Since then, even larger and deeper networks have been
trained [12].
The typical use of convolutional networks is on classification tasks, where
the output to an image is a single class label. However, in many visual tasks,
especially in biomedical image processing, the desired output should include
localization, i.e., a class label is supposed to be assigned to each pixel. More-
over, thousands of training images are usually beyond reach in biomedical tasks.
Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict
the class label of each pixel by providing a local region (patch) around that pixel
arXiv:1505.04597v1 [cs.CV] 18 May 2015