MPI-INF-3DHP: Single-Person 3D Pose Dataset

3.2 Training Datasets for 3D Body Pose Estimation

3.2.1 MPI-INF-3DHP: Single-Person 3D Pose Dataset

Capture Configuration: The dataset is captured in a multi-camera studio with ground truth from a commercial marker-less motion capture (The Captury2016). No special suits and markers are needed, allowing the capture of motions wearing everyday apparel, including relatively loose clothing.

In contrast to existing datasets, the dataset is captured with a green-screen background to allow automatic segmentation and augmentation. For detailed considerations concerning the design and setup of a multi-view markerless body shape and appearance capture studio, refer to Starck et al.

2009. The 14 cameras of the dataset cover a wide range of viewpoints, with five cameras mounted at chest height with a roughly 15^◦elevation variation similar to the camera orientation jitter in other datasets (Wenzheng Chen et al.2016). Another five cameras are mounted higher and angled down 45^◦, three more have a top down view, and one camera is at knee height angled up. Figure3.3shows the distribution of camera viewpoints in the capture setup, as well as representative images from a subset of the cameras. The cameras record 4 MP 1:1 aspect ratio images at up to 50 FPS.

Actors and Activities 8 actors (4 male + 4 female) are recorded as part of the training set, performing 8 activity sets each, ranging from walking and sitting to complex exercise poses and dynamic actions. The activity types and prompts are designed to span more diverseposeclasses than existing pose corpora such as Human3.6m. Each activity set spans roughly one minute. Each

CHAPTER 3. CAPTURING ANNOTATED 3D BODY POSE DATA 40

Figure 3.2: The MPI-INF-3DHP training set is comprised of 8 actors. Here, each actor is visualized in both sets of clothing in which the actor was recorded. One set is normal street wear, while the other set is purposefully chosen to have uniformly colored upper and lower body clothing such that they can be independently chroma-keyed for augmentation.

actor features 2 sets of clothing split across the activity sets. One clothing set iscasual everyday apparel, and the other isplain-coloredto allow augmentation as shown in Figure3.2. See Table3.1 for a detailed breakdown. Figure3.5shows example poses from various activity sets. Refer to AppendixAfor the prompts used to guide the actors through the activities.

Overall, from all 14 cameras, roughly 1.5M frames are captured, 500k of which are from the five chest high cameras. In addition to the true 3D and 2D pose annotations, 3D poses of a height-normalized skeleton compatible with the ‘universal’ skeleton of Human3.6M are also made available.

The normalization scales the skeleton such that the knee-to-neck height (sum of the lengths of the thigh and spine) is 930mm.

Dataset Augmentation Although the captured dataset out of the box has more clothing variation than other single-person datasets such as Human3.6m [2014], the appearance variation is still not comparable to in-the-wild images. The augmentation approach proposed in this work is inspired by prior work which uses images to augment the background of recorded footage (Wenzheng Chen et al.

2016; Ionescu et al.2014b; Rhodin et al.2016a), and recolors plain-color shirts (Rhodin et al.2016a) while keeping the shading details, using intrinsic image decomposition to separate reflectance and shading (Meka et al.2016). Chroma-key masks are extracted for the background, a chair in the scene, as well as upper and lower body segmentation for the plain-colored clothing sets. This provides an increased scope for foreground and background augmentation, in contrast to the marker-less recordings of Joo et al.2015. The chroma-key masks are obtained with the Nuke (Nuke2015) visual effects software. See Figure3.4for example masks as well as augmentation results. The background is simply replaced with images representing indoor and outdoor scenes. For clothing and chair augmentation, the luminance of the masked regions is used as a proxy for shading, and is blended with images of cloth patterns and textures.

41 3.2. TRAINING DATASETS FOR 3D BODY POSE ESTIMATION

Figure 3.3:Visualization of the camera viewpoints available for the proposed MPI-INF-3DHP single person dataset. Also shown are images from a subset of the viewpoints with the orientations of the visible cameras overlaid. The dataset is captured with a green-screen background such that it can be chroma-keyed and augmented with various images. The chair is covered with a red cloth such that it can be independently chroma-keyed and augmented.

Table 3.1:MPI-INF-3DHP training dataset is comprised of 8 actors recorded from 14 camera viewpoints, performing 8 activity sets each. The activities are each 1 minute long, and grouped into 2 sets of 4 minutes each. The actors wear casual everyday apparel (Street) and plain-colored clothes (Plain) to allow clothing appearance augmentation. Overall, 1.5M frames from a diverse range of viewpoints are available, capturing a diverse range of poses and activities. Through the extensive avenues of background and clothing appearance augmentation made available, the number of effective frames available for training can be increased combinatorially. All cameras record at a2048×2048pixel resolution.

Actor Actor # Frames Clothing FPS Total

ID Gender Seq1 Seq2 Seq1 Seq2 Seq1 Seq2 Frames

S1 F 6416 12430 Street Plain 25 50 260k

S2 M 6502 6081 Street Plain 25 25 175k

S3 M 12489 12283 Street Plain 50 50 345k

S4 F 6171 6675 Street Plain 25 25 175k

S5 F 12820 12312 Street Plain 50 50 350k

S6 F 6188 6145 Street Plain 25 25 170k

S7 M 6239 6320 Plain Street 25 25 170k

S8 M 6468 6054 Plain Street 25 25 170k

CHAPTER 3. CAPTURING ANNOTATED 3D BODY POSE DATA 42

Figure 3.4:Avenues of appearance augmentation in MPI-INF-3DHP dataset. Actors are captured using a markerless multi-camera setup in a green-screen studio (left), and segmentation masks computed for different regions (center left). The captured footage can be augmented by compositing different textures to the background, chair, upper body and lower body areas, independently (center right and right).

Figure 3.5:Representative frames from MPI-INF-3DHP training set, showing different subjects in different clothing sets and poses from different activity sets as well as the scope of appearance augmentation made possible by the dataset.

The efficacy of this increased scope of augmentation at enabling in-the-wild performance of learning based 3D pose estimation approaches is demonstrated throughout the thesis, starting with Chapter4.

Additional Data Captured: In addition to the 14 synchronized camera viewpoints, another synchro-nized chest high camera records with a fish-eye lens. An RGB-D camera (ASUS2011), also placed at chest height and pointed forward records RGB (1280×1024 px) and depth images (640×480 px) at 30 FPS. The RGB-D camera, however, is not frame synchronized with the remaining cameras.

3D body surface scans of each subject are also captured with a laser scanner. See AppendixAfor details.

43 3.2. TRAINING DATASETS FOR 3D BODY POSE ESTIMATION

Figure 3.6:The process of creation of the multi-person composited 3D human pose dataset MuCo-3DHP from per-camera image samples from the single-person MPI-INF-3DHP dataset. The images are composited in a depth-aware manner using the 3D pose annotations made available in MPI-INF-3DHP. Appearance diversity can be greatly amplified by augmenting the background as well as clothing appearance.

In document Real-time 3D Human Body Pose Estimation from Monocular RGB Input (sider 51-55)