Lift, Splat, Shoot

背景

本文是较早的一篇基于环视2D图像进行3D或者BEV空间感知的工作。文章的核心思想是通过 Lift 从2D图像提取得到视锥（frustum）特征，然后将特征散布（Splat）到BEV的栅格中。经过端到端的学习，网络能够从多张环视图得到场景的统一表达，而且对标定具有一定的鲁棒性。

核心工作

LSS

文章核心部分利用Lift-Splat操作将多视图像转换为BEV特征。具体来说，在Lift阶段，为每个像素学习一个深度范围的分类得分，类似于注意力（attention）机制，每个深度范围都有一个置信度，而这个置信度将作为 Splat 阶段特征融合的权重。Splat阶段通过相机的内外参、深度值，将特征融合到BEV空间。

Shooting部分是规划线的学习。文章将端到端的运动规划问题构建为对一组固定模板轨迹进行分类任务，定义模板的逻辑回归是由模型输出的鸟瞰图成本地图中的数值之和），然后，以最大化专家轨迹的似然性来训练模型。

等价性。由于训练数据量不足，模型结构表达能力相对更强可能过拟合，文章使用了多种图像增强方式。这个过程能用矩阵表达，基本等价与内参的修正。

def img_transform(img, post_rot, post_tran,
                  resize, resize_dims, crop,
                  flip, rotate):
      """
      post_rot, post_tran: homography transformation
      """
      # adjust image
      img = img.resize(resize_dims)
      img = img.crop(crop)
      if flip:
              img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
      img = img.rotate(rotate)

      # homography
      post_rot *= resize
      post_tran -= torch.Tensor(crop[:2])
      if flip:
          A = torch.Tensor([[-1, 0], [0, 1]])
          b = torch.Tensor([crop[2] - crop[0], 0])
          post_rot = A.matmul(post_rot)
          post_tran = A.matmul(post_tran) + b
      A = get_rot(rotate/180*np.pi)
      b = torch.Tensor([crop[2] - crop[0], crop[3] - crop[1]]) / 2
      b = A.matmul(-b) + b
      post_rot = A.matmul(post_rot)
      post_tran = A.matmul(post_tran) + b

      return img, post_rot, post_tran

具体实现

图像特征使用EfficientNet提取特征，其中feature的维度为D+C，其中D为深度的层数， C为最终的特征维度。将D维特征经过一个卷积后计算Softmax，等到深度分布。将同一份特征C外乘以深度分布扩展为DxC维。

def get_depth_feat(self, x):
    x = self.get_eff_depth(x)
    # Depth
    x = self.depthnet(x)

    depth = self.get_depth_dist(x[:, :self.D])
    new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)

图像特征如何映射为BEV特征：如果相机内外参确定、后处理确定，图像空间中的任意像素到相机空间1-n的关系是确定的，其中n是每个像素点对应的深度范围，再通过外参到自车空间的转换是唯一确定的。首先将单位视锥体根据图像增强的单应变换转换到无变换前的位置，然后通过内参将视锥体坐标转换为相机空间坐标，最后通过相机的外参转换为自车的统一坐标系。

def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
    """Determine the (x,y,z) locations (in the ego frame)
    of the points in the point cloud.
    Returns B x N x D x H/downsample x W/downsample x 3
    """
    B, N, _ = trans.shape

    # undo post-transformation
    # B x N x D x H x W x 3
    points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
    points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

    # cam_to_ego
    points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                        points[:, :, :, :, :, 2:3]
                        ), 5)
    combine = rots.matmul(torch.inverse(intrins))
    points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
    points += trans.view(B, N, 1, 1, 1, 3)

    return points

有了图像特征到3D空间点的映射关系，就可以反向的根据BEV空间的坐标去查对应的图像特征。存在一个3D空间栅格对应不同图像的特征上，通过将特征按对应3D空间的坐标排序，就可以将相同3D空间的特征聚合到一起。然后通过积分图的形式就可以得到每个空间的融合特征。

文章开源代码中有很多的实现细节可以参考，比如如果验证图像增强效果：将点云按变换矩阵投影到图像上；积分图实现的两种方式对于梯度计算的影响：第一种方式pytorch的自动微分进行计算，可能产生较大的计算图；第二种方式自己定义优化的梯度计算。

本文开创了一个将环视2D空间转化为BEV空间的新方案，后续有很多基于此工作的扩展，比如 BEVDet、BEVFusion等等。

参考链接

项目：https://nv-tlabs.github.io/lift-splat-shoot/
代码：https://github.com/nv-tlabs/lift-splat-shoot