VOC Dataset Examples

You can view the PASCAL VOC 2012 leaderboard here.


Semantic Segmentations APIs in PyTorch are not very standardised across repositories, meaning that it may require a lot of glue to get them working with this evaluation procedure (which is based on torchvision).

For easier VOC integration with sotabench it is recommended to use the more general API sotabencheval.

Getting Started

You'll need the following in the root of your repository:

  • file - contains benchmarking logic; the server will run this on each commit
  • requirements.txt file - Python dependencies to be installed before running
  • (optional) - any advanced dependencies or setup, e.g. compilation

Once you connect your repository to, the platform will run your file whenever you commit to master.

We now show how to write the file to evaluate a PyTorch object model with the torchbench library, and to allow your results to be recorded and reported for the community.

The VOC Evaluation Class

You can import the evaluation class from the following module:

from torchbench.semantic_segmentation import PASCALVOC

The PASCALVOC class contains several components used in the evaluation, such as the dataset:

# torchvision.datasets.voc.VOCSegmentation

And some default arguments used for evaluation (which can be overridden):

# <torchbench.semantic_segmentation.transforms.Normalize at 0x7f9d645d2160>

# <torchbench.semantic_segmentation.transforms.Compose at 0x7f9d645d2278>

# <function torchbench.utils.default_data_to_device>

# <function torchbench.semantic_segmentation.utils.default_seg_collate_fn>

# <function torchbench.semantic_segmentation.utils.default_seg_output_transform>

We will explain these different options shortly and how you can manipulate them to get the evaluation logic to play nicely with your model.

An evaluation call - which performs evaluation, and if on the server, saves the results - looks like the following through the benchmark() method:

from torchvision.models.segmentation import fcn_resnet101
model = fcn_resnet101(num_classes=21, pretrained=True)

    paper_model_name='FCN ResNet-101',

These are the key arguments: the model which is a usually a nn.Module type object, but more generally, is any method with a forward method that takes in input data and outputs predictions. paper_model_name refers to the name of the model and paper_arxiv_id (optionally) refers to the paper from which the model originated. If these two arguments match a record paper result, then will match your model with the paper and compare your code's results with the reported results in the paper.

A full example

Below shows an example for the torchvision repository benchmarking a FCN ResNet-101 model:

from torchbench.semantic_segmentation import PASCALVOC
from torchbench.semantic_segmentation.transforms import (
from torchvision.models.segmentation import fcn_resnet101
import torchvision.transforms as transforms
import PIL

def model_output_function(output, labels):
    return output['out'].argmax(1).flatten(), target.flatten()

def seg_collate_fn(batch):
    images, targets = list(zip(*batch))
    batched_imgs = cat_list(images, fill_value=0)
    batched_targets = cat_list(targets, fill_value=255)
    return batched_imgs, batched_targets

model = fcn_resnet101(num_classes=21, pretrained=True)

normalize = Normalize(
    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
my_transforms = Compose([Resize((520, 480)), ToTensor(), normalize])

    paper_model_name='FCN ResNet-101',

PASCALVOC.benchmark() Arguments

The source code for the PASCALVOC evaluation method can be found here. We now explain each argument.


a PyTorch module, (e.g. a nn.Module object), that takes in VOC data and outputs detections.

For example, from the torchvision repository:

from torchvision.models.segmentation import fcn_resnet101
model = fcn_resnet101(num_classes=21, pretrained=True)


(str, optional): Optional model description.

For example:

model_description = 'Using ported TensorFlow weights'


Composing the transforms used to transform the input data (the images), e.g. resizing (e.g transforms.Resize), center cropping, to tensor transformations and normalization.

For example:

import torchvision.transforms as transforms
input_transform = transforms.Compose([
    transforms.Resize(512, PIL.Image.BICUBIC),


Composing the transforms used to transform the target data


Composing the transforms used to transform the input data (the images) and the target data (the labels) in a dual fashion - for example resizing the pair of data jointly.

Below shows an example; note the fact that the __call__ takes in two arguments and returns two arguments (ordinary torchvision transforms return one result).

from torchvision.transforms import functional as F

class Compose(object):
    def __init__(self, transforms):
        self.transforms = transforms

    def __call__(self, image, target):
        for t in self.transforms:
            image, target = t(image, target)
        return image, target

class ToTensor(object):
    def __call__(self, image, target):
        image = F.to_tensor(image)
        return image, target

class ImageResize(object):
    def __init__(self, resize_shape):
        self.resize_shape = resize_shape

    def __call__(self, image, target):
        image = F.resize(image, self.resize_shape)
        return image, target

transforms = Compose([ImageResize((512, 512)), ToTensor()])

Note that the default transforms are:

from torchbench.semantic_segmentation.transforms import (Normalize, Resize, ToTensor, Compose)
normalize = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
transforms = Compose([Resize((520, 480)), ToTensor(), normalize])


(callable, optional): An optional function that takes in model output (after being passed through your model forward pass) and transforms it. Afterwards, the output will be passed into an evaluation function.

The model output transform is a function that you can pass in to transform the model output after the data has been passed into the model. This is useful if you have to do further processing steps after inference to get the predictions in the right format for evaluation.

The model evaluation for each batch is as follows from are:

with torch.no_grad():
    for i, (input, target) in enumerate(iterator):
        input, target = send_data_to_device(input, target, device=device)
        output = model(input)
        output, target = model_output_transform(output, target)
        confmat.update(target, output)

The default model_output_transform is:

def default_seg_output_transform(output, target):
    return output["out"].argmax(1).flatten(), target.flatten()

We can see the output and target are flattened to 1D tensors, and in the case of the output, we take the maximum predicted class to compare against for accuracy. Each element in each tensor represents a pixel, and contains a class, e.g. class 6, and we compare pixel-by-pixel the model predictions against the ground truth labels to calculate the accuracy.


How the dataset is collated - an optional callable passed into the DataLoader

As an example the default collate function is:

def default_seg_collate_fn(batch):
    images, targets = list(zip(*batch))
    batched_imgs = cat_list(images, fill_value=0)
    batched_targets = cat_list(targets, fill_value=255)
    return batched_imgs, batched_targets


An optional function specifying how the model is sent to a device

As an example the PASCAL VOC default is:

def default_data_to_device(input, target=None, device: str = "cuda", non_blocking: bool = True):
    """Sends data output from a PyTorch Dataloader to the device."""

    input =, non_blocking=non_blocking)

    if target is not None:
        target =, non_blocking=non_blocking)

    return input, target


data_root (str): The location of the VOC dataset - change this parameter when evaluating locally if your VOC data is located in a different folder (or alternatively if you want to download to an alternative location).

Note that this parameter will be overriden when the evaluation is performed on the server, so it is solely for your local use.


num_workers (int): The number of workers to use for the DataLoader.


batch_size (int) : The batch_size to use for evaluation; if you get memory errors, then reduce this (half each time) until your model fits onto the GPU.


paper_model_name (str, optional): The name of the model from the paper - if you want to link your build to a machine learning paper. See the VOC benchmark page for model names,, e.g. on the paper leaderboard tab.


paper_arxiv_id (str, optional): Optional linking to ArXiv if you want to link to papers on the leaderboard; put in the corresponding paper's ArXiv ID, e.g. '1611.05431'.


paper_pwc_id (str, optional): Optional linking to Papers With Code; put in the corresponding papers with code URL slug, e.g. 'u-gat-it-unsupervised-generative-attentional'


paper_results (dict, optional) : If the paper you are reproducing does not have model results on, you can specify the paper results yourself through this argument, where keys are metric names, values are metric values. e.g::

{'Accuracy': 0.745, 'Mean IOU': 0.592}.

Ensure that the metric names match those on the sotabench leaderboard - for VOC it should be 'Accuracy', 'Mean IOU'.


pytorch_hub_url (str, optional): Optional linking to PyTorch Hub url if your model is linked there; e.g: 'nvidia_deeplearningexamples_waveglow'.

Need More Help?

Head on over to the Computer Vision section of the sotabench forums if you have any questions or difficulties.