Dextrous VR in Professional Settings: the Importance of Stereoscopic Display and Hand-Image Collocation

(1)

Dextrous VR in Professional Settings: the Importance of Stereoscopic Display and Hand-Image Collocation

John Waterworth

Department of Informatics Umeå University S-901 87 UMEÅ, Sweden.

+46 90 786 6731

Abstract. Virtual reality (VR) is becoming increasingly important in a variety of professional settings. Our particular interest is in applications requiring the examination and manipulation of detailed, volumetric medical images by surgeons and other medical staff. It is important to determine how best to maximise accuracy and speed of interaction without unrealistic technical or financial requirements. In this study, we compared performance on a trial task in a virtual environment, with and without stereoscopic display and with and without hand-image collocation. These are the most immediately tractable approaches to enhancing dexterity. Although both factors affected speed and accuracy of task completion, adding stereoscopy to desktop VR gave significantly greater benefits than adding hand-image collocation. Surprisingly, there was no additional benefit from combining the two. The work contributes to a better understanding of the factors that are important to the successful proliferation of dextrous VR in professional work settings.

1 Introduction

VR varies in the degree of physical (and, hence, perceptual) immersion enjoyed by (or imposed on) the user, and the trend for professional applications is towards less immersive solutions. Our particular concern to date has been with surgeons and radiologists, who prefer to use VR in their everyday work settings, without a head- mounted display or specialised clothing. We believe this to be a general characteristic of most professional uses of VR. Another feature of such professionals is that they require great accuracy in the display of information and in the way they interact with that information; they require what has been called dextrous VR [10]. Unfortunately, typical desktop VR reduces or eliminates several of the sensori-motor cues that contribute to such dextrous behaviour. Most obvious amongst these are binocular stereopsis, head motion parallax, haptic feedback, and hand-image collocation.

Binocular stereopsis depends on the fact that people have two eyes displaced horizontally on their heads. In normal visual perception, the disparity between the two views contributes to a sense of depth. Head motion parallax refers to the fact that when we move our heads from side to side, close objects appear to move more than distant objects. Object parallax refers to the fact that the same principle works without head

(2)

movement if we move objects at different distances relative to each other. Haptic feedback refers to the cues we get "by feel", mostly through the inter-related proprioceptive and mechanico-receptive systems [2]. Hand-image collocation refers to the fact that, in the physical world, we see and feel our hands in the same place during dextrous work. It is reasonable to suppose that we can also perform more dextrously in VR if the real positions in which we feel our hands to be during interactions correspond to the virtual image presented to our eyes; in other words, matching real and virtual space should have advantages for dextrous interaction.

The experiment described below was designed to investigate the importance of two of these cues, binocular stereopsis and hand-image collocation, for dextrous interaction in VR. We chose them because they are rather more tractable, and more economical, to introduce than haptic feedback or head tracking. We also used a particular set-up, involving viewing the 3D scenes through a mirror, which we believe reduces the need for head-tracking in dextrous work by minimising disparities between cues from the stereoscopic display and those from natural accommodation of the eyes, while also allowing hand-image collocation [12]. We were interested in the contribution of each technique in itself, and also of possible additional advantages from combining the two cues in a relatively natural interaction paradigm. We also wished to investigate any differences in learning how to interact in the environment as a function of the two techniques, taken individually and in combination. The study was intended to establish whether binocular stereopsis, or hand-image collocation, or both, or neither, is necessary for dextrous interaction, and to assess their relative importance.

It is well established that improvements in accuracy of performance arise with the addition of head tracking to an interface [3, 4, 13], as long as the resultant lag is acceptably low. Lag is known to impact unfavourable on human performance with interactive systems [9] and the presence of perceptible lag is a serious impediment to accurate work in VR. Unfortunately, head tracking also tends to increase lag, so that benefiting from head tracking requires computational steps such as optimal linear filtering [6] and the use of relatively simple scenes. However, head tracking can be very effective in imparting a sense of depth, through the parallax cues it provides.

Compared to head tracking, others [13] suggest, on the basis of subjective reports of viewing graphically simple objects, that "stereopsis may add only marginally to the perception of three dimensionality of objects" (p.42). But most studies of binocular stereopsis without head tracking have demonstrated its contribution to both speed and accuracy of performance [8, 14].

The contribution of binocular stereopsis and head tracking to performance in desktop VR was investigated recently [1]. It was found that while stereopsis improved speed it did not significantly improve accuracy of interaction. On the other hand, head tracking improved accuracy but did not significantly affect speed of task completion. It was also found that performance time was actually worse with stereopsis plus head tracking than with stereopsis alone (see also [11]), which suggests that lag may have become an important factor when both cues were provided or, more likely, that head movement times (to obtain the head motion parallax cues) added significantly to overall task performance time.

(3)

Two recent studies [13, 1] conclude that head tracking is more beneficial than stereopsis, which is reasonable since accuracy must remain the prime consideration in assessing dextrous interaction. One possible reason for this finding may be simply that stereoscopic cues are less important when other cues are available. But it may also be that the benefits of head tracking are emphasised when the stereoscopic cues provided are in conflict with other cues, such as accommodation, as was the case in the two studies cited above.

Numbers of the population lack the ability to fuse binocular images from the eyes, yet most of these can judge depth from other cues. Obviously, binocular stereo displays are unlikely to help such people much, but other cues at the interface might allow them to interact in a dextrous way. From the results described above with simple scenes [1], head tracking may be a suitable additional cue, although these results were not entirely unequivocal. Another problem with head tracking is that, for complex scenes, a noticeable lag in updating the display will be produced after every head movement [10]. To work accurately with complex data (as in our target medical data settings) where any lag produced by head tracking is magnified, users must develop a style of interaction which involves holding the head still for a period after each head movement. While users can still work accurately in this way, it will obviously increase task performance time and will tend to lead to fatigue in prolonged use. Without this strategy, inaccuracy will combine with fatigue and even nausea [7], although the latter effect is less likely with non-immersive head-tracked displays.

Another potentially-important candidate cue to depth is hand-image collocation, particularly for dextrous interaction. Most existing applications place the virtual image behind the computer screen or, more commonly, they use negative parallax to bring the display in front of the screen. The problem with the latter is that conflict is created between the stereoscopy (if it is present) and lens-accommodation as a function of depth, since the eye is always focused on the screen surface which lies behind the stereo image. The hands will also obscure part of the image during interaction. There will also be some conflict between head motion parallax and stereoscopic depth, which could explain the lack of benefit from their combination (as found in [1]). The main problem with positive parallax is that hand-image collocation is lost, since the hands cannot be placed inside the screen, where the objects appear to be, and again accommodation cues will conflict with binocular stereopsis.

A few attempts have been made to avoid these conflicting cues, which arise simply because of the physical intervention of the manually-impenetrable glass screen of the computer display, and yet maintain hand-image collocation. The most obvious of these is to use a mirror to place the image where the hands can reach without obscuring the view. Figure 1 illustrates this idea graphically. Because the virtual screen (as seen in the mirror) lies on a plane that goes through the centre of the work volume (the convenient space accessible to the user's hands, and the focus of attention) there is relatively little conflict between stereoscopic depth cues (including the position of the virtual tool held by the user) and accommodation of the eye. The further the object extends or is moved behind or in front of this plane, the greater the visual cue disparity.

But when working in a natural orientation for close dextrous work, the conflict is

(4)

minimised. The user reaches into a work space without significant conflicts between hand and visual cues, or between different visual cues. The other main aim of the present study was to assess to what extent minimally-conflicting hand-image collocation benefits dextrous interaction, since this can be implemented relatively easily and cheaply (as compared to stereoscopic vision and head-tracking).

Fig. 1. Using a mirror to provide hand-image collocation without visual conflicts (from [12])

The importance of hand-image collocation has not been demonstrated in the VR literature, although it is reasonable to assume that in dextrous VR vision and proprioception will enhance each other beyond the contribution of either in isolation.

2 The Experiment

The experiment was designed to assess the influence on dextrous performance of stereo versus mono display, and of the presence versus the absence of hand-image collocation. We expected both to be important, with hand-image collocation perhaps having the greater impact. A previous study [1] had found stereopsis to be less important than head tracking. We also expected that combining the two techniques would have benefits over and above any benefits from using one alone. This prediction was based on the fact that these cues normally work together in precise manual work in the three dimensions of physical space.

To test these predictions, we examined performance with hand-image collocation but without stereoscopy, with stereoscopy but without hand-image collocation, with both techniques, and with neither.

2.1 Apparatus and Test Task

We used equipment provided by the VRLab located at Umeå University in Northern Sweden for this study, consisting of an SGI Onyx-2 with a 21-inch colour monitor, screen resolution 1280 X 1024 pixels (1026 X 768 pixels in stereo conditions).

Hand-image collocation was provided by the Dextroscope™ (formerly known as the Virtual Workbench) developed by Luis Serra and his team at Kent Ridge Digital

(5)

Labs (KRDL) in Singapore [12]. The Dextroscope™ (Figure 2a) uses the mirror principle outlined above. Conditions without hand-image collocation used a monitor in the conventional position. In other respects, the technical conditions were exactly the same, and comprise the standard set-up used with the Dextroscope™.

A Polhemus FasTrack™ with pen-like receiver (held in the dominant hand) was used by participants for precise work, and a simple button and position sensor (held in the other hand) for moving the virtual object of interest. CrystalEyes™ time- multiplexed LCD shutter glasses were used to provide stereo capability. Participants viewed both stereo and monoscopic displays through the glasses, with zero disparity in the latter case. The refresh rate was 96Hz (48Hz per eye) in all conditions,

Fig. 2a and b. The Dextroscope™ (a) and the Dexterity Game (b).

Our experimental task was first described by Poston and Serra [10], and appears to be very similar to the one used by Barfield et al. [1]. The Dexterity Game (See Figure 2b) consists of a virtual version of the familiar "pass the loop over the wire without touching it" game. The player manipulates a tool, which corresponds to the "loop and handle" shown in the figure, with his or her dominant hand. The task is to traverse from one end of the virtual wire to the other while "touching" the wire as little as possible.

The non-dominant hand holds another tool which allows the player to adjust the overall position of the wire and frame. When a virtual touch is detected, a sound is heard and the wire changes colour. There is no haptic feedback when the wire is touched.

With this equipment and task, there is almost no perceptible lag (frame rate was around 30fps during task performance).

2.2 Method

The experiment was a two-way (2X2) within subjects design, where each participant experienced all four combinations of the two levels of both independent variables (stereoscopy and hand-image collocation).

The Dexterity Game was used as the test task, and the dependent variables were speed to complete the game and number of errors recorded during completion.

(6)

We recruited 24 volunteers to participate in the study. They came from varied backgrounds and were of both genders, with ages ranging from 18 to 56, mean age 30 years. They had no prior experience with interactive virtual environments.

Each participant completing the Dexterity Game three times in each of the four conditions listed above. We thus recorded 24X3X4 data points for each of our dependent variables. Order of presentation was balanced by Latin Square, to avoid confounding order with condition of presentation. This also allowed us to analyse the effect of order separately, and so examine any learning effects over the three trials of each condition. Each participant experienced one condition three times, separated by a rest break of about 2 minutes. A longer break, of about five minutes, separated the four conditions. The complete session took between one hour 15 minutes and two hours per participant, with each single test (one of three repetitions of the game per condition) taking between 5 and 10 minutes.

Participants entered the lab one at a time to avoid any advantage from watching others perform the tests. Instructions explaining the task were read to each participant.

We stressed the importance of making as few errors as possible, rather than of finishing as quickly as possible. They were instructed to perform as quickly as they could without making errors, if possible, and not to give up if they found some parts of the tests difficult.

After these instructions, each participant was seated at the Dextroscope™ or at a desk of similar height, facing the computer monitor. Before each test run, participants were allowed 30 seconds to become somewhat accustomed to the current condition.

The same Polhemus FasTrak™ device was used for all four conditions, with the sensor placed in one of two marked positions on the desk or the Dextroscope™, depending on the condition. Although many participants expressed an interest during the trials in knowing how others had performed, we declined to comment until a participant had completed all trials. Timing and error data were collected automatically.

3 Results

As expected, the slowest condition was without either stereoscopic vision or hand- image collocation, and the fastest condition was where both were provided, as shown in Figure 3a. More interesting are the results for the other two conditions, where it appears that stereopsis alone resulted in faster times than hand-image collocation alone.

Statistical analysis (two-way ANOVA) revealed that stereopsis had a very highly significant effect on completion time (F[1, 23] = 61.54, p < 0.0001) and that hand- image collocation had a highly significant effect (F[1, 23] = 24.67, p < 0.001).

However, the interaction of the two was not significant (F[1, 23] = 0.56, p > 0.1), contrary to our expectation.

The results for accuracy, which are arguably more important than speed in dextrous VR, followed a similar pattern, although examination of Figure 3b suggests that hand- image collocation is somewhat more beneficial to accuracy than to speed. But once again, stereo vision appears to be more important than hand-image collocation.

(7)

Statistical analysis revealed, as expected, that accuracy was significantly greater with stereopsis than without (F[1, 23] = 165.18, p < 0.0001) and with hand-image collocation than without (F[1, 23] = 65.19, p < 0.0001). But once again contrary to our expectations, the interaction effect was not significant (F[1, 23] = 1.75, p > 0.1).

0 500 1000 1500 2000 2500 3000 3500

Stereo/Hand- Eye- Coordination

Stereo/No Hand-Eye- Coordination

Non- Stereo/Hand-

Eye- Coordination

Non-Stereo/No Hand-Eye- Coordination

Time(sec.)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Stereo/Hand- Eye- Coordination

Stereo/No Hand-Eye- Coordination

Non- Stereo/Hand-

Eye- Coordination

Non-Stereo/No Hand-Eye- Coordination

Errors(total)

Fig. 3a and b. Total time for each condition (a) and Total errors for each condition (b)

A Sheffé F-test showed that every condition was different from every other (see Table 1), confirming that stereo without hand-image collocation was significantly better than hand-image collocation without stereo (p > 0.01).

Table 1. Sheffé F-test between conditions [* = p < 0.01,** = p < 0.001,*** = p < 0.0001]

HEC/No Stereo No HEC/Stereo No HEC/No Stereo

HEC/Stereo 22.161** 7.599* 72.984***

HEC/No Stereo 3.806* 14.711**

No HEC/Stereo 33.482**

Next, we looked for differences in learning between the experimental factors, by comparing the three different trials that each person completed for every condition.

ANOVA revealed a significant difference in error rates across the three trials for all conditions (F[2, 46] = 6.35, p < 0.005), but no interaction effect (F[3, 69] = 0.19, p >

0.1). This suggests no difference between the four conditions in the nature of the learning effect, as shown in Figure 4a.

0 100 200 300 400 500 600 700

1 2 3

Trial

Stereo/Hand-Eye- Coordination Non-Stereo/Hand- Eye-Coordination Stereo/No Hand- Eye-Coordination Non-Stereo/No Hand-Eye- Coordination

0 20 40 60 80 100 120

1 2 3

Trial

Fig. 4a and b. Learning Effects for the Four Conditions in absolute (a) and percentage (b) terms

(8)

4 Discussion

As expected, both hand-image collocation and stereoscopic vision improved performance on our test task. The effects were dramatic: when both cues were used performance was more than twice as fast as when neither was used, and errors were reduced by a factor of 9.

We found that of the two techniques, stereopsis reduced both errors and completion time more than did hand-image collocation. This was not as predicted - we expected the opposite, partly on the basis of other recent results [1] where the improvement due to stereoscopy was small.

We also expected that the combination of the two factors would be disproportionately beneficial, because of closer approximation to natural depth perception - in which multiple cues are integrated to provide a robust impression of depth. On the other hand, one might also speculate that when binocular stereopsis is not available, for example, hand-image collocation would have a bigger impact. Neither of these effects was suggested by the results. However, this finding is compatible with the results of [1], who found no interaction effects on accuracy between stereopsis and head tracking.

The lack of any difference in learning effect across the four conditions was also surprising, given the large differences we found in performance measures. However, recasting errors in percentage terms, where the number of errors on the first trial is regarded as 100, reveals a picture more consonant with the other results (see Figure 4b). It appears that the relative improvement in error rates for successive trials is greatest with both stereopsis and hand-image collocation, and smallest with neither, and once again stereo vision alone appears to yield greater benefits than hand-image collocation alone.

Although we should view this last conjecture with caution, since the interaction effect was not significant, it is reasonable to extrapolate from the figure and suggest that with more trials, the difference in learning between the four conditions would continue to widen, with by far the best performance resulting from the combination of cues. This argues against the idea that participants would learn to rely more heavily on fewer cues. It seems that the more cues provided the better, and this effect becomes greater with practice, not less.

Observing participants attempt the Dexterity Game brought home to us how difficult it is to work dextrously in three dimensions. Even when both cues are present the game is not easy, despite the fact that there may be no deviation of the virtual wire in the third dimension (depth, in the z-plane), if the participant does not adjust the orientation of the virtual wire frame with the non-dominant hand. And although the non-dominant hand can be used to reorient the wire, most participants did not seem to find this beneficial and made little use of this facility. This is somewhat surprising since moving the virtual object would provide parallax cues to depth that might be used to compensate for the absence of head motion parallax cues.

A frequent complaint about the conditions without hand-image collocation was that it was less comfortable than using the Dextroscope™, and resulted in more fatigue over

(9)

the time period of a single trial (up to about 10 minutes). But the two situations were very similar in terms of the positions of the arms and head adopted during the trials.

Reduced fatigue may be more a mental than a physical side-effect of hand-image collocation, since it is closer to the way we carry out dextrous work in the physical world. In other words, subjective fatigue is a likely result of sensori-motor cue conflicts, and the conflicts were less in the collocated conditions. Although our participants cannot feel the data they are viewing, they do feel the tool they use, and with hand-image collocation they see data and tool in a position that corresponds to where they feel their hands to be.

It is worth noting that results from experiments using VR are plagued by a persistent confounding of factors, since as more realism is added to VR presentations, performance tends to degrade. However, we are fairly confident about the present findings since both hand-image collocation and stereo vision introduced a difference in image fidelity that we judge to be insignificant. The slight shadowing that results from the use of the mirror is barely noticeable (and no participants commented on it), and our scenes were simple enough that even in the stereo conditions movement was perceptually smooth (at around 30 frames per second).

This is not to say that cue conflicts were absent in our study. Although the mirror approach to hand-image collocation minimises the conflict between stereo cues and accommodation of the eye, it does not eliminate it. There will be some conflict whenever the depth of the virtual object of interest does not lie exactly on the plane of the virtual screen. And whilst having no head tracking reduces lag, it obviously introduces a conflict between proprioceptive cues to head position and visual cues.

It seems unlikely that individuals without stereoscopic vision do not make relatively heavier use of other cues. However, we cannot say from the present study, since we unfortunately did not collect information about our participants' everyday depth perception. Another weakness of this work was that we did not record data about the dominant hand of our participants. Nor did we record how often the non-dominate hand was used to adjust the virtual wire frame. Although it was informally observed to occur infrequently, it may be that more adjustments were made in some conditions.

We can conclude that both hand-image collocation and stereoscopic vision aid dextrous interaction in desktop VR significantly, but that stereopsis is the more significant of the two. Both can be conveniently introduced into VR used for data visualisation in professional settings. However, although perhaps more natural, there was no additional benefit from combining the two cues, over and above that of each in isolation. There was a suggestion that learning benefitted particularly from this combination of cues, but this was not a statistically significant result over the three trials in which each participant experienced this combined condition. More extensive tests are needed to settle this question.

(10)

6 Acknowledgements

The experiment was carried out by Kristoffer Larsson and Jonas Norberg, two undergraduate students in Informatics at Umeå University. We thank the participants in this study for their valuable time and considerable efforts. We are also grateful to Luis Serra and staff of KRDL, Singapore and Anders Backman of Umeå VRLab for technical assistance. Helpful comments on earlier drafts were provided by Andreas Lund, David Modjeska, Luis Serra, Eva Lindh Waterworth and several anonymous others.

7 References

1 . Barfield, W. Hendrix, C. and Bystrom, K.-E. (1999). Effects of Stereopsis and Head Tracking on Performance Using Desktop Virtual Environment Displays. Presence, 8, 2, 237- 240, April 1999.

2. Bullinger, H-J., Bauer, W. and Braun, M. (1997). Virtual Environments. In Galvendy, S.

(ed.) Handbook of Human Factors and Ergonomics, 2nd edition. New York: Wiley, 1725- 1759.

3. Deering, M. (1992), High Resolution Virtual Reality. Computer Graphics 26, 195-201.

4. Deering, M. (1996). The HoloSketch VR Sketching System. Communications of the ACM, 39, 5, 54-61, May 1996.

5 . Englund, K. and Mellbring, J. (1999). Virtual Reality för Produktutveckling i Industrin.

Department of Innovation-technology, Chalmers University, Gothenburg, Sweden.

6. Freidman, M., Starner, T. and Pentland, A. (1992). Device Synchronization using an Optimal Linear Filer. 1992 Symposium on Interactive 3D Graphics, Computer Graphics, 57-62.

7 . Kalawski, R. S. (1993). The Science of Virtual Reality and Virtual Environments.

Wokingham, England: Addison-Wesley.

8 . Kim, W. S., Ellis, S. R., Tyler, M., Hannaford, B. and Stark, L. (1987). A quantitative evaluation of perspective and stereoscopic displays in three-axis manual tracking tasks. IEEE Transactions on Systems, Man and Cybernetics, SMC-17, 61-71.

9. MacKenzie, I. S. and Ware, C. (1993). Lag as a Determinant of Human Performance in Interactive Systems. INTERCHI'93 Conference on Human Factors in Computing Systems (Amsterdam, April 24-29). New York: ACM, 488-493.

10. Poston, T. and Serra, L. (1996). Dextrous Virtual Work. Communications of the ACM, 39, 5, 37-45, May 1996.

11.Rekimoto, J. (1995). A vision-based head tracker for fish tank VR without head gear. Virtual Reality Annual International Symposium'95, 94-100.

12. Serra, L., Poston, T., Ng, H., Chua, B. C. and Waterworth, J. A. (1995). Interaction Techniques for a Virtual Workspace. Paper and video presented at the International Conference on Artificial Reality and Tele-Existence (ICAT)/Conference on Virtual Reality Software and Technology (VRST) '95, Maakuhari Messe, Japan, November 1995.

13. Ware, C., Arthur, K. and Booth, K. S. (1993). Fish tank virtual reality. INTERCHI'93 Conference on Human Factors in Computing Systems (Amsterdam, April 24-29). New York:

ACM, 37-42.

14. Yeh, Y. and Silverstein, L. D. (1992). Spatial judgements with monoscopic and stereoscopic presentation of perspective displays. Human Factors, 34, 583-600.