Random Forests for Per-Pixel Classiﬁcation

1.5 List of Publications

2.1.4 Random Forests for Per-Pixel Classiﬁcation

Several methods reported in this thesis rely on per-pixel classiﬁcation of the input image. For this segmentation problem, we use per-pixel classiﬁcation forests which have been shown to produce state-of-the-art results in human pose estimation and other segmentation prob-lems [122, 64, 129]. We provide a brief overview and refer the reader to [30] for further details.

Figure 2.5 An ensemble of random decision trees forms a random forest.

Figure 2.5 illustrates a sam-plerandom decision forest(or ran-dom forest). A ranran-dom forest con-sists of many binary decision trees, each of which is trained on a ran-dom subset of the input data (hence the name random decision trees).

Having an ensemble of decision trees helps improve generalization to unseen examples. At test time, input data points are passed from

the root node to a leaf node of a tree. At each split node, a decision is made about which child the data point must pass through. Therefore, at train time, decisions that need to be made at the split nodes are optimized. This binary decision made at a split node is called a feature responseand aweak learner is employed to prevent overﬁtting. Arbitrary informa-tion about the data points can be stored at a leaf node. Typically, an empirical distribuinforma-tion about all the data points that reach a leaf node are stored.

Figure 2.6 A depth image of the hand (left) is segmented into 12 hand parts with a depth classiﬁcation forest.

In per-pixel classiﬁcation forests, the goal is to train a forest to label each input pixel into a class label (e.g., part of a human body). At train time, the decisions at the split nodes are optimized based on thousands of training examples. For the task of depth-based classiﬁcaion we use the feature response function

𝑓(𝐼,x) = 𝑑_𝐼(x+ u

𝑑_𝐼(x)) − 𝑑_𝐼(x+ v 𝑑_𝐼(x)) ,

where𝐼is the input depth image,xis the pixel location,uandvare randomly chosen oﬀsets from the current pixel location, and𝑑(.)denotes the depth at a certain location on the image.

At test time, for each input pixel, a tree in the forest makes a prediction about which part it likely belongs to (see Figure 2.6). The output from all trees in the forest is aggregated to provide a ﬁnal prediction about the pixel’s class as𝑝(𝑐 | 𝐼,x) = _𝑇¹ ∑^𝑇

𝑡=1

𝑝_𝑡(𝑐 | 𝐼,x),where𝑝 is the predicted class distribution for the pixelx and𝑇 is the number of random trees that makes a prediction𝑝_𝑡.

Chapter 3 Interactive Multi-Camera Hand Tracking

Figure 3.1 Our approach combines two methods: (1) Generative pose estimation on multiple RGB images using local opti-mization (bottom row and top left) (2) Part-based pose retrieval on ﬁve ﬁnger databases indexed using detected ﬁngertips on a sin-gle depth image (top right).

Tracking hands in action has several applica-tions in human–computer interaction, teleop-eration, sign language recognition, and virtual character control among others. An ideal hand tracker that can be used for these applications is a markerless method that tracks hand motion in real-time, using a single camera under chang-ing lightchang-ing and scene clutter. As a ﬁrst step towards solving this hard problem, we address the relatively less diﬃcult problem of mark-erless, interactive (i.e., tracking at near-real-time framerates), multi-camera hand tracking in this chapter. Parts of this chapter appeared in a previous publication [131]. In subsequent chapters, we show how to solve hand tracking under progressively harder scenarios such as faster runtime, less cameras, and more com-plex scenes.

Figure 3.2 Overview of our interactive multi-camera hand tracking approach. SoG stands for Sum of Gaussians.

3.1 Introduction

Interactivemarkerless tracking ofarticulated hand motionis an important problem with a wide range of applications. Markerorglove-based solutions exist for tracking the articu-lations of the hand [153], but they constrain natural hand movement and require extra user eﬀort. Recently, many commercial sensors have been developed that detect 3D ﬁngertip locations without using markers but these sensors do not recover a semantically meaningful skeleton model of the hand. In this chapter, we describe a novel markerlesshand motion tracking method that captures a broad range of articulations in the form of akinematic skele-tonat near-realtime frame rates.

Hand tracking is inherently hard because of the large number of degrees of freedom (DoF) [59], fast motions, self-occlusions, and the homogeneous color distribution of skin.

Most previous realtime markerless approaches (see Section 3.2) capture slow and simple ar-ticulated hand motion since reconstruction of a broader range of complex motions requires oﬄine computation. Our algorithm follows a hybrid approach that combines a generative pose estimator with a discriminative one (Figure 3.1). The input to our method are RGB im-ages from ﬁve calibrated cameras, depth data from a monocular time-of-ﬂight (ToF) sensor and a user-speciﬁc hand model (Section 3.3). The output of our method are the global pose and joint angles of the hand represented using 26 parameters.

Our approach is inspired by the robustness and accuracy of recent hybrid methods for realtime full-body tracking [7]. However, using the same strategy for hand tracking is chal-lenging because of the absence of suﬃciently discriminating image features, self-occlusions caused by ﬁngers, and the large number of possible hand poses.

Figure 3.2 gives an overview of our algorithm. We use multiple co-located RGB cameras and a depth sensor as input to our method. Similar to previous work in full-body motion tracking [7, 156, 167], we instantiate two pose estimators in parallel. First, the generative

3.2 Related Work 21

In document Tracking Hands in Action for Gesture-based Computer Input (sider 43-47)