建筑时空变压器以Egentric 3D姿势估计

论文标题

建筑时空变压器以Egentric 3D姿势估计

Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

论文作者

Park, Jinman, Kaai, Kimathi, Hossain, Saad, Sumi, Norikatsu, Rambhatla, Sirisha, Fieguth, Paul

论文摘要

由于严重的自我估计和Fish-eye视图从头部安装的摄像头引起的强烈自我变形，图像中的Egentric 3D人类姿势估计（HPE）具有挑战性。尽管现有作品使用基于中等热图的表示以抵消扭曲的成功，但解决自我概括仍然是一个空旷的问题。在这项工作中，我们利用过去框架的信息来指导我们基于自我注意的3D HPE估计程序-Ego-Stan。具体而言，我们构建了一个时空变压器模型，该模型会介绍基于语义上丰富的卷积神经网络特征图。我们还提出了功能地图令牌：一组新的可学习参数，可以参与这些特征映射。最后，我们在XR-Egopose数据集上展示了Ego-Stan的出色性能，在XR-Egopose数据集上，它对总平均每个接头位置误差提高了30.6％，而与最先进的ART相比，参数下降了22％。

Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题