BEVFormer

Framework

BEVFormer

BEVFormer_head

BEVTransformer

Encoder

Input
- bev_query: bevh * bevw x embed_dim
- bev_pos: 通过SinePositionalEncoding对bev_mask（bsbevhbevw) 生成位置编码
- key, value: multi_level_camera_feature (bs x num_cam x c x h x w)
- ref_3d: 将Lidar3D空间栅格化，对每个栅格中心点根据外参投影到各个图像，计算投影点是否在图像平面前、图像平面内，从而得到每个3D栅格对应到每个相机的映射 ref_points_cam
History Input
- prev_bev: 获取历史BEV特征
- shift: 计算偏移量
- feature_combination: 将不同维度特征融合，feat 加上 level_embed, 在 hxw 进行拼接, 得到num_cam x (h1w1 + h2w2) x bs x c
Output
- bev_embed: bevh * bevw x embed_dim

Decoder

Input
- query: object_query
- value: bev_embed
Output
- query 3d infomation

Encoder 详细解释

将图像特征转换为BEV特征，BEVFormerEncoder (bev_query, key=feat, value=feat)，依次执行：[self_attention, norm, cross_attention, norm, ffn, norm]，以下重点介绍 self_attention 和 cross_attension。

TemporalSelfAttention

ref_2d: 得到2D bev 平面的映射关系

TemporalSelfAttention

SpatialCrossAttention

投影点只有一个对于目标的观测偏少，所以增加 sampling_offset 用于临近周围点的采样。 MSDeformableAttention3D 只对有关联关系的特征位置进行关联计算。

SpatialCrossAttention

Misc

数据： 3D点云与相机时间戳对齐、目标速度补偿，对检测结果是否有影响？主车运动时的检测效果
模型：输入图像分辨率对结果的影响 BEV 尺度对小目标和整体检测的影响，50x50 vs 100x100
训练：类别均衡采样方式对效果提升，数据量的影响
调试： attention 可视化，feature map 响应值可视化
计算： Decoder 层数和 BEV 尺度对资源的影响分析每个模块的计算量、推理时间、参数量、显存占用

参考链接

https://github.com/fundamentalvision/BEVFormer
https://arxiv.org/abs/2010.04159
https://mmcv.readthedocs.io/en/latest/_modules/mmcv/ops/multi_scale_deform_attn.html