Listening Human Behavior: 3D Human Pose Estimation with Acoustic Signals
IEEE/CVF CVPR 2023
Yuto Shibata†, Yutaka Kawashima, Mariko Isogawa†‡, Go Irie, Akisato Kimura, and Yoshimitsu Aoki
†corresponding author, ‡project lead
Overview
Given only acoustic signals without any high-level information, such as voices or sounds of scenes/actions, how much can we infer about the behavior of humans? Unlike existing methods, which suffer from privacy issues because they use signals that include human speech or the sounds of specific actions, we explore how low-level acoustic signals can provide enough clues to estimate 3D human poses by active acoustic sensing with a single pair of microphones and loudspeakers. This is a challenging task since sound is much more diffractive than other signals and therefore covers up the shape of objects in a scene.
Accordingly, we introduce a framework that encodes multichannel audio features into 3D human poses. Aiming to capture subtle sound changes to reveal detailed pose information, we explicitly extract phase features from the acoustic signals together with typical spectrum features and feed them into our human pose estimation network. Also, we show that reflected or diffracted sounds are easily influenced by subjects’ physique differences e.g., height and muscularity, which deteriorates prediction accuracy. We reduce these gaps by using a subject discriminator to improve accuracy. Our experiments suggest that with the use of only low-dimensional acoustic information, our method outperforms baseline methods.
Paper
Video
Dataset
- Acoustic Signal Data (zip file, 2.8GB)
- Mocap Data (zip file, 170MB)
- Meta files for Trimming the Acoustic Signal Data (zip file, 5KB)
Due to a collaborative research agreement with our research partner, the dataset captured in the Anechoic chamber room cannot be made publicly available.
Code
Citation
@InProceedings{Shibata_CVPR2023,
author = {Shibata, Yuto and Yutaka, Kawashima and Isogawa, Mariko and Irie, Go and Kimura, Akisato and Aoki, Yoshimitsu},
title = {Listening Human Behavior: 3D Human Pose Estimation with Acoustic Signals},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023},
pages={13323 - 13332}
}