• No results found

Contributions and Structure

This thesis contributes to both computer vision-based tracking and gesture-based human–

computer interaction research. We list the contributions in detail by dividing them into two categories: (1) tracking hands in action, and (2) gesture-based computer input. Please see Section 1.5 for a full list of publications where some of these contributions were originally reported.

1.4.1 Part I: Tracking Hands in Action

In Part I, we contribute tocomputer visionresearch by presenting new non-contact, mark-erless algorithms for tracking hands in action. In Chapter 2 we define the problem and intro-duce basic terminology and concepts that are essential to understanding our contributions.

1.4 Contributions and Structure 5 Chapters 3–6 present four different tracking algorithms each suited for a particular sce-nario. The supported tracking scenarios can be identified based on three criteria:

No. of Cameras: Multiple cameras (Chapter 3, 4) or single camera (Chapters 5, 6)

Run-time: Interactive (Chapter 3) or real-time (Chapter 5)

Scene Complexity: Hands-only (Chapter 5) vs. hands and objects (Chapter 6) Together these methods support a range of tracking scenarios previously not supported by other methods: (1) we can track hands in static desktop-based settings more accurately and robustly than previous approaches, (2) we can track hands in real-time from a single depth camera thereby allowing moving egocentric setups, (3) we can, to our knowledge for the first time, also jointly track hands interacting with objects in real-time from a single depth sensor.

In Chapter 3 we focus on multi-camera tracking of only hands at interactive frame rates.

We first discuss a traditional pose optimization framework that uses special representations forgenerativetracking. We show that using only this approach for tracking hands results in catastrophic failure. We propose a hybrid approach that combines generative tracking with a novel part-based,discriminative pose retrieval strategy. We further improve accuracy of this method by presenting a new shape representation called the 3D Sum of Anisotropic Gaussians (SAG) in Chapter 4. To evaluate these contributions, we introduce an extensive, annotated benchmark dataset consisting of challenging hand motion sequences. Results from validation on this dataset shows that our new shape representation together with the hybrid approach is superior to previous work and allows robust and accurate real-time hand tracking.

In Chapter 5, we shift our attention to tracking hands using a single depth camera. We contribute by proposing a novel shape representation for depth that allows efficient, accu-rate, and robust tracking of a hand at real-time frame rates. This representation is compact, mathematically smooth and allows us to formulate pose estimation as a 2.5D generative op-timization problem in depth. While pose tracking on this representation could run in excess of 120 frames per second (FPS) using gradient-based local optimization, this often results in a wrong local pose optimum. For added robustness we incorporate evidence from trained randomized decision forests that label depth pixels into predefined parts of the hand. The part labels include discriminative detection evidence into generative pose estimation. This enables the tracker to better recover from erroneous local pose optima and prevents temporal jitter common to detection-only approaches. The robustness of this approach allows to track the full articulated 3D pose of the hand under different poses such as pinching and those with self-occlusions. Because it uses only a single depth camera, our approach is one of the first methods to track from moving head-mounted cameras and other similar egocentric viewpoints.

Finally, in Chapter 6 we present a first-of-its-kind method to address the harder prob-lem of jointly tracking hands and objects using a single RGB-D camera at real-time frame rates. Jointly tracking hands and objects poses new challenges due to the difficulty in seg-menting hands from objects, and handling additional occlusions due to objects. We propose a multi-layered random forest architecture to address the segmentation problem and incor-prate additional energy terms specific to the hand grasping objects. Once again, extensive evaluation and comparisons show that our method achieves high accuracy in spite of running at 30 FPS. To our knowledge, this is the first method to supportreal-timejoint tracking of hands and objects.

1.4.2 Part II: Gesture-based Computer Input

In Part II, we contribute to HCIresearch by presenting new forms of gesture-based com-puter input enabled by markerless hand and finger tracking. In Chapter 7, we present our first approach to continuous gesture-based computer input. We show how gestures elicited from users (i.e., through elicitation studies) can be used to create interaction techniques suitable for 3D navigation tasks using purely freehand gestures. User studies indicated that our interaction techniques were comparable to existing techniques supported by devices like the mouse. Elicitation studies, however, have limitations which we discuss.

Informed by the lessons learned in creating continuous freehand gestures, we present an approach for computational gesture design in Chapter 8. Computational gesture de-sign refers to the process of automatically dede-signing gestures for an interaction task to suit designer-specified criteria. We present one of the first approaches for computational gesture design which is informed by the characteristics of hand trackers such as the those presented in Part I. We base our computational approach on data about the dexterity of the hand which includes speed and accuracy of the movement of fingers, comfortable motion ranges of fin-gers, and individuation of fingers. Our investigation was informed by an extensive user study that measured the components of dexterity in the context of markerless hand tracking. We present design recommendations based on the data we collected. We show how the data on dexterity can be used to inform the computational design of mid-air gestures. In particular, we focus on mid-air text entry and show that an approach similar to fingerspelling can lead to predicted text entry rates of over 50 words per minute (WPM). We formulate mid-air text entry as acombinatorial optimization problemand show that our data can drive the optimiza-tion of gestures based on criteria chosen by the designer. We finally present validaoptimiza-tion of the approach on users. Although we applied our approach to a discrete input task (i.e., text entry) our dexterity model is broadly applicable to continuous input tasks such as 3D navigation or pointing.

1.5 List of Publications 7