Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

(Full PyTorch Code Reference: https://github.com/autonomousvision/transfuser)

Introduction

Image-only model, LiDAR-only model (existing models that using only one type of input) → bad performance on adversarial scenarios (environments with many variables in driving)

Problem of image-only model: driving without considering cars coming from the left → crash

Problem of LiDAR-only model: driving without considering a traffic light ahead → signal violation

→ Use two types of data together:

Multi-view 3d object detection network for autonomous driving (2017 CVPR)

But still have some problem - cannot consider all information while extracting feature map by each data (problem of model structure, not about data) ex. difficulty driving in complex situations such as downtown driving - ego-vehicle should consider the relationship between traffic and traffic light

Transfuser

Transfuser: Model that considering whole information by using Transformer when extracting features from single-view image nad LiDAR input data

Method

1. Problem Setting

Task

point-to-point navigation (completing without accident along waypoint to the goal location)