ECCV 2022 BDD100K Challenges

blog-image

We are hosting multi-object tracking (MOT) and segmentation (MOTS) challenges based on BDD100K, the largest open driving video dataset as part of the ECCV 2022 Self-supervised Learning for Next-Generation Industry-level Autonomous Driving (SSLAD) Workshop.

Participation

Please first test your results on our eval.ai challenge pages and get your performance. We only consider teams who outperform our baselines for each challenge in the challenge rankings. This page provides more details on our challenges.

Overview

This is a large-scale tracking challenge under the most diverse driving conditions. Understanding the temporal association and shape of objects within videos is one of the fundamental yet challenging tasks for autonomous driving. The BDD100K MOT and MOTS datasets provides diverse driving scenarios with high quality instance segmentation masks under complicated occlusions and reappearing patterns, which serves as a great testbed for the reliability of the developed tracking and segmentation algorithms in real scenes. The BDD100K dataset also include 100K raw video sequences, which can be readily used for self-supervised learning. We hope the utilization of large-scale unlabeled video data in self-driving could further boost the performance of MOT & MOTS. In this challenge, we provide two tracks: (1) Main track - standard MOT and MOTS, and (2) Teaser track - self-supervised MOT and MOTS. We encourage participants from both academia and industry.

Challenge Tracks

We introduce two challenge tracks for our BDD100K challenges: standard multi-object tracking and self-supervised tracking. For both tracks, you can use the full 100K raw video sequences, which are mostly unlabeled.

Main Track: Multi-Object Tracking

  • Multiple Object Tracking (MOT): Given a video sequence of camera images, predict 2D bounding boxes for each object and their association across frames.

  • Multiple Object Tracking and Segmentation (MOTS): In addition to MOT, also predict segmentation masks for each object.

Teaser Track: Self-Supervised Tracking

In this track, we investigate training object trackers without relying on tracking annotations, which can be costly to obtain. Object bounding boxes and masks are still available, but associations between objects are not.

  • Self-Supervised Multiple Object Tracking (MOT)

  • Self-Supervised Multiple Object Tracking and Segmentation (MOTS)

Prizes

All participants will receive certificates with their ranking, if desired. (Additional prizes coming soon.)

Timeline

The challenge starts on August 1st, 2022 and will end at 5 PM GMT on October 10th, 2022. You can use this tool to convert to your local time.

Data

BDD100K has been collected throughout diverse scenarios, covering New York, San Francisco Bay Area, and other regions in the US. It contains scenes in a wide variety of locations, weather conditions and day time periods, such as highways, city streets, residential areas, rainy/snowy weathers, etc. The BDD100K MOT set contains 2,000 fully annotated 40-second sequences at 5 FPS under different weather conditions, time of the day, and scene types. We use 1,400/200/400 videos for train/val/test, containing a total of 160K instances and 4M objects. The MOTS set uses a subset of the MOT videos, with 154/32/37 videos for train/val/test, containing 25K instances and 480K object masks. For all challenges, the full 100K raw video sequences at 30 FPS are also available for training.

Baselines

We provide two sets of baselines, one for each track, which serve as an example on how to utilize the BDD100K data.

  • Main Track:
  • Teaser Track:
    • We train the QDTrack baseline model but without tracking annotations. To do so, we only use single frame annotations and simulate tracking by matching each object to itself. For MOTS, we add a mask prediction head.

You can also find the baselines in the BDD100K Model Zoo.

Submission

For submission, please follow the following formats for each challenge.

MOT Format

To evaluate your algorithms on BDD100K MOT benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - box2d []:
            - x1: float
            - y1: float
            - x2: float
            - y2: float

You can find an example result file here.

MOTS Format

To evaluate your algorithms on BDD100K MOTS benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - rle:
            - counts: str
            - size: (height, width)

You can find an example result file here.

Evaluation Server

You can submit your predictions to our challenge evaluation servers hosted on EvalAI:

Note that these are separate servers used specifically for the challenges. Submissions to the public MOT and MOTS servers will not be used.

Submission Policy

You can make 3 successful submissions per month (at most 1 per day) to the test set and unlimited to the validation set. You can modify the visibility of your submission to be public or private. Before the final deadline, please make your final submission public so it is visible on the public leaderboard.

Evaluation

We provide more details here regarding evaluation.

Super-category

In addition to the evaluation of all 8 classes, we also evaluate results for 3 super-categories specified below. The super-category evaluation results are provided only for the purpose of reference.

    "HUMAN":   ["pedestrian", "rider"],
    "VEHICLE": ["car", "bus", "truck", "train"],
    "BIKE":    ["motorcycle", "bicycle"]

Ignore Regions

After the bounding box matching process in evaluation, we ignore all detected false-positive boxes that have >50% overlap with the crowd region (ground-truth boxes with the “Crowd” attribute).

We also ignore object regions that are annotated as 3 distracting classes (“other person”, “trailer”, and “other vehicle”) by the same strategy of crowd regions for simplicity.

Pre-training

It is a fair game to pre-train your network with ImageNet, but if other datasets are used, please note in the submission description. We will rank the methods without using external datasets except ImageNet.

Metrics

We employ mean Higher Order Tracking Accuracy (HOTA, mean of HOTA of the 8 categories) as our primary evaluation metric for ranking. We also employ mean Multiple Object Tracking Accuracy (mMOTA) and mean ID F1 score (mIDF1), which are previously used as the main metrics. All metrics are detailed below. Note that the overall performance is measured for all objects without considering the category if not mentioned. For MOTS, we use the same metrics set as MOT. The only difference lies in the computation of distance matrices. In MOT, it is computed using box IoU, while for MOTS the mask IoU is used.

  • mHOTA (%): mean Higher Order Tracking Accuracy [3] across all 8 categories.

  • mMOTA (%): mean Multiple Object Tracking Accuracy [4] across all 8 categories.

  • mIDF1 (%): mean ID F1 score [5] across all 8 categories.

  • mMOTP (%): mean Multiple Object Tracking Precision [4] across all 8 categories.

  • HOTA (%): Higher Order Tracking Accuracy [3]. It balances the evaluation of detection and association into a single unified metric.

  • MOTA (%): Multiple Object Tracking Accuracy [4]. It measures the errors from false positives, false negatives and identity switches.

  • IDF1 (%): ID F1 score [5]. The ratio of correctly identified detections over the average number of ground-truths and detections.

  • MOTP (%): Multiple Object Tracking Precision [4]. It measures the misalignments between ground-truths and detections.

Questions

If you have any questions, please go to the BDD100K discussions board.

Organizers

Siyuan Li
Thomas E. Huang
Tobias Fischer
Fisher Yu

Citations

[1] Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 164–173 (2021)

[2] Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross- attention networks for multiple object tracking and segmentation. Advances in Neural Information Processing Systems 34 (2021)

[3] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. LealTaixe, and B. Leibe. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129(2):548–578, 2021

[4] Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10

[5] Ristani, Ergys, et al. “Performance measures and a data set for multi-target, multi-camera tracking.” European Conference on Computer Vision. Springer, Cham, 2016