Acquisition and Restitution - 3D Video: from Capture to Interactive Display

19.1. Introduction

The human eye is able to perceive up to 10 orders of magnitude of light intensity (10¹⁰ cd m⁻²), but only 5 simultaneously (see [FER 01] and Chapter 2). This order of magnitude is reduced to 2 when displaying images on standard screens. Images acquired up to now, known aslow dynamic range (LDR) images, contain a limited range of light intensities. This restriction is highlighted in scenes involving back lighting, for example. For this reason, the development of images with high color dynamics, orhigh dynamic range (HDR), is increasingly important.

This type of image has recently been the subject of considerable research effort, focusing on acquisition, storage, display and use. Speciﬁc HDR cameras already exist, but are either still at the experimental stage or too costly. Methodologies have been established to compensate for the absence of speciﬁc material. An introduction to HDR images and classic acquisition methods is presented in Chapter 2, in which we see that static image capture has been the subject of particular attention. Current sensors allow us to acquire and directly store a wider dynamic range of colors (up to 16 bits for still cameras). HDR video has recently attracted much attention, but video

Chapter written by Jennifer BONNARD, Gilles VALETTE, Céline LOSCOS and Jean-Michel NOURRIT.

sensors remain limited in terms of color intensity ranges (mainly represented in 12 bits). For now, it is difﬁcult to transmit and store HDR video data in the absence of effective formats. The domain of HDR video is also relatively conﬁdential. The number of known solutions for acquisition is very limited, as we will see in this chapter. For our purposes, the phrase “3D video” will refer to multiscopic video content (see Chapter 4).

This chapter is divided into two main sections, concerning acquisition and rendering, respectively. In section 19.2, we provide a classiﬁcation of acquisition methods based on the domain in question, organized according to criteria: number of views in the scene, simultaneous or spread acquisition, acquisition of a static scene or a scene with variable representations over time.

As no display technology currently permits HDR rendering, in section 19.3, we consider the possibilities of adapting existing technologies.

19.2. HDR and 3D acquisition

As we saw in Chapter 2, multiplying viewpoints during acquisition gives us the immediate ability to generate depth perception. Consequently, if we have hardware capable of native HDR data acquisition for a scene and repeat acquisition from several view points, it becomes possible to directly operate 3D HDR capture. This repetition might be obtained by moving or duplicating hardware, allowing simultaneous capture of different points of view and thus enabling HDR 3D video capture.

Unfortunately, little HDR-enabled hardware is currently available, and existing hardware is not suited to HDR video capture. Spheron¹ has developed panoramic HDR view capture equipment and an HDR video camera, although the latter is still at prototype stage and considerable quantities of data are involved. Weiss² offers a fully automated device, the Civetta, allowing acquisition of spherical HDR images over 360^◦ with a resolution of 100 megapixels.

The current impossibility of obtaining native HDR data in the context of multi-viewpoint acquisition means we must use an HDR value estimation method to produce 3D HDR images. Methods for obtaining HDR images using LDR capture materials consist of combining several exposures of the same scene in order to conserve [AGG 04, MER 07] or

1 www.spheron.com.

2 www.weiss-ag.com.

estimate [DEB 97, MAN 95] the best brightness values for each zone in the image. The simultaneous manipulation of several images requires data to be calibrated both geometrically and colorimetrically. The level of precision used in this calibration phase varies between methods. In certain cases, we must estimate and apply the inverse response curve of the camera (see Chapter 2).

These methods, using multiple exposures, present strong analogies with the use of multiple viewpoints of the same scene when acquiring depth, or the acquisition of several instants of a dynamic scene to produce video. These analogies are shown in Figure 19.1(a), where 3D HDR video methods are divided using three axes: one corresponding to different exposures, a second to different viewpoints and a third to different instants. Note that the origin of these axes is not set at 0, but at 1: one exposition, one viewpoint and one instant. Each of the axes deﬁnes a speciﬁc type of acquisition: HDR images, 3D images and video. By choosing two axes, we create a plane showing other speciﬁc types of acquisition: HDR video, 3D video or 3D HDR images.

Finally, the whole space (three axes) corresponds to 3D HDR video.

Number of different exposures

Number of different viewpoints

Figure 19.1.a) Spatial division of 3D HDR video methods. The origin of the axes does not correspond to a value of 0, but to 1; b) plane methods corresponding to 3D HDR images, according to the number of exposures and viewpoints used. The methods on the vertical axis are purely 2D HDR, and those on the horizontal axis are purely 3D

We will use this division into 1D and 2D subspaces to present different models described in literature on the subject. We will consider only one 1D subspace, HDR images, as the others are well known (videos) or covered elsewhere in this book (3D images). The same applies to the 2D subspace covering 3D video. We will, however, specify methods for the other 2D subspaces: 3D HDR imaging and HDR video. We will ﬁnish by discussing

the possibility of extending some of these methods to the whole space, adding one or two dimensions in order to obtain 3D HDR video.

The methods presented in this chapter are classiﬁed in Figure 19.1(b) according to the number of viewpoints and the number of different exposures used during acquisition to construct HDR data. Section 19.2.1 presents methods that aim to acquire images with a single camera. Section 19.2.2 discusses a method that allows the acquisition of HDR video. In section 19.2.3, we will consider methods involving 3D HDR content.

Figure 19.2.General overview of HDR reconstruction methods based on the acquisition of multiple exposure images. Three stages are involved: (1) acquisition of n LDR images I₀, . . . , I_n with different exposures from one or more viewpoints; (2) pixel mapping on these images by aligning images acquired from the same viewpoint, recalibrating data if the content changes, or correspondence mapping if the content is the same but the viewpoint differs;

(3) reconstruction of one or more HDR imagesE_kusing recalibrated LDR data.

HDR imageE_kcorresponds to the viewpoint of LDR imageI_k

For any space, the HDR reconstruction methods considered in this chapter mostly follow the acquisition pattern illustrated in Figure 19.2, divided into three stages. In the ﬁrst stage, a series of LDR images are obtained with different exposures. Stage 2 consists of pixel mapping, followed by stage 3, which uses the HDR value reconstruction algorithm. The numbernof LDR input images and the number of HDR output images vary depending on the chosen method. Typically, in the 1D subspace, the viewpoint will be the same for all imagesI_k and a single image,E, will be generated. In the HDR video 2D subspace, there will be as many generated imagesE_k as there are images in the ﬁnal video sequence. Images I_k will vary in terms of viewpoint and exposure, and their number will not necessarily be the same as the number of

generated images E_k. In the 3D HDR image subspace, the images I_k will represent the same content, but from different exposures and points of view.

Generally, the number of generated HDR imagesE_k will be the same as the number of input images I_k. Similarly, the mapping process varies based on the input data I_k and the HDR reconstruction objectives. The mapping process consists of aligning images if the viewpoint and content are the same I_k, data recalibration if the viewpoint is the same or similar but the content is different and correspondence mapping if the content is the same but the viewpoint differs.

19.2.1. 1D subspace: HDR images

Numerous studies have considered the reconstruction of HDR values based on the acquisition of several images with different exposures from the same viewpoint [DEB 97, MAN 95, MIT 99]. Other approaches are mentioned in [LOS 10] and [REI 10]. Certain photographic cameras have an autobracketing function, which allows users to acquire images with different exposures using an automatic procedure, e.g. underexposed, normally exposed and overexposed views as shown in Figure 19.3. Depending on the camera, up to nine differently exposed images may be acquired using this method (see Chapter 2). In cases where this function is not available, the exposure time may be adjusted manually in order to acquire the required number of images of a scene. Whatever method is chosen, use of a tripod and a timer (or remote control) is recommended in order to stabilize the device and minimize the risk of shifts between images, leading to better results.

a) Underexposed b) Intermediate c) Overexposed

Figure 19.3.Images of different exposures acquired using the autobracketing function on a photographic camera

As we have already seen, in the absence of native HDR acquisition methods, we need to use an HDR value estimation method. We will presume that we have access to a series of images taken from the same viewpoint, but

with different exposures. These images are perfectly aligned, and a point of the scene is projected at the same pixel coordinates(i, j)for all images. We have an additional set of information concerning the amount of light, recorded by the camera, coming from this point. The estimation of the HDR value for this point consists of combining these sets of information. A common method used for this operation was developed by Debevec and Malik [DEB 97], and consists of calculating a weighted average E(i, j) (see equation [19.1]) of luminance values (HDR values) for the three color components for corresponding pixels in each image, with a weighting functionwbased on the pixel saturation level:

E(i, j) = 'n

k=1w(I_k(i, j))

f⁻¹(I_k(i,j)) Δt_k

k=1w(I_k(i, j)) [19.1]

whereN is the total number of images,Ik(i, j)is the color value of the pixel with coordinates (i, j)in imageIk acquired with an exposure timeΔtk and f⁻¹ is the inverse function of the camera response (see Chapter 2). This function may be ignored if RAW data are used directly, in which case the data may be considered to be linear.

Different weighting functions w have been proposed to take under- or overexposed pixels into account. A state of the art of these methods is presented by Granadoset al.[GRA 10]; each method is differentiated by the type of formula applied. A graphical representation of the performance of these methods is also given, showing that their method and the method put forward by Mitsunaga and Nayar [MIT 99] produce the best results.

In [AGU 12], the method put forward by Granadoset al., based on the maximum likelihood estimation [GRA 10], was also shown to produce the best results. Aguerrebereet al.[AGU 12] proposed a new weighting function, allowing all pixels, including saturated pixels, to be taken into account;

according to the authors, these pixels contain useful information for HDR data estimations.

Even when a tripod is used to guarantee acquisition stability, the fact that acquisitions occur at successive instants introduces sensitivity to the presence of moving objects or persons, which (or who) will be in a different position in each image. Several methods have been developed to detect and take this movement into account [JAC 08, GAL 09, GRA 08, GRO 06, SAN 04, WAR 03]. In the same context, Khan et al. [KHA 06] and Pedone et al.[PED 08] have calculated the probability that a pixel will belong to a

static part of the image. Only Orozcoet al.[ORO 12] have obtained an HDR value for all pixels, even those affected by movement, using mutual information or the normalized cross-correlation (NCC).

Instead of multiplying view captures to obtain different exposures, another method consists of acquiring sets of pixels at different exposures in a single operation. Nayar and Mitsunaga [NAY 00] adapted a camera by ﬁxing an optical mask, such as the one shown in Figure 19.4, adjacent to a conventional image detector array. This ﬁlter permits the acquisition of four different exposures of the same image, distributed regularly by groups of four pixels.

The ﬁnal HDR image is then constructed either by aggregation or by interpolation. The ﬁrst method allows calculation of the mean value of four neighboring pixels, a value which is then assigned to the center of the group of pixels. Considering an original image of size N ×N, using this method, the ﬁnal image will be of size (N −1)×(N −1). In the second case, the pixels in the image are divided into two categories: on-grid points (black disks in Figure 19.4) corresponding to the center of pixels, and off-grid points (black squares in Figure 19.4) corresponding to the intersection point of four pixels. This gives us a value for each pixel center, so there is no loss in resolution. For each of the two groups, saturated pixels are distinguished from non-saturated pixels. First, the off-grid points are calculated from the non-saturated on-grid points, then all of the off-grid points are interpolated to obtain on-grid points.

Figure 19.4.Representation of an optical mask used to acquire four different exposures [NAY 00]: the disks represent on-grid points,

and the squares represent off-grid points

19.2.2. 2D subspace: HDR videos

Several exposures and several instants are required to obtain HDR video.

According to an idea put forward by Kang et al. [KAN 03], we may use different acquisition instants to obtain different exposures. In this method, the

acquisition procedure alternates long and short exposure times from one image to the next. Reconstructed HDR values for a given image at timet_iare obtained using data from the image at t_i₋₁ and the image at t_i₊₁. In this context, pixel shifts may be due to a change in camera viewpoint and to changes in the content of a scene from one instant to the next. Kang et al.’s pixel mapping method is based on the use of optical ﬂow to estimate the movement of a pixel from one image to the next, an estimation that is then reﬁned using homography. Once these displacements have been correctly estimated, it becomes possible to combine the values of corresponding pixels to obtain an HDR image. The results may include artifacts when there is rapid movement, as acquisition is limited to 15 images per second because of the alternating exposure times and optical ﬂow is efﬁcient mostly in a near neighborhood. Another limiting factor is the reduced number of exposure times available when reconstructing an image.

HDR video acquisition is also possible by obtaining several exposures for each instant, as with Nayar and Mitsunaga’s optical ﬁlter [NAY 00] (see section 19.2.1). Tocciet al.[TOC 11] have developed another type of camera, using three sensors that receive a different percentage of the incident light by prism diffraction. Three images with different exposures are thus obtained for a single capture, with no shifts between images. Unlike Debevec and Malik’s method [DEB 97], which used all pixel values from different acquired images, in this case only the pixels with the highest exposure are taken into account. The pixel at the same position in the image with lower exposure is only taken into account when a pixel is saturated, reducing the quantity of data to manage in lower exposure images, generally affected by different sensor-related noise.

19.2.3. 2D subspace: 3D HDR images

All HDR image acquisition techniques may be extended to 3D by multiplying viewpoints. In this way, we obtain multiple exposures for each viewpoint, and thus, after estimation, an HDR image for each viewpoint.

These are recombined during restitution to obtain a 3D HDR image. Clearly, while this principle is viable, the number of images to acquire makes it costly, except when using the systems developed by Nayar and Mitsunaga [NAY 00]

or Tocci et al. [TOC 11], which only require a single capture for multiple exposures. For standard capture devices, one way of improving this situation would be to vary exposure at the same time as the viewpoint, thus obtaining one exposure per viewpoint. However, this solution includes problems with luminance matching, as a point in the scene will not be projected onto the

same pixel in different images. Mapping therefore needs to be carried out before estimating brightness values. In this section, we consider the matching methods used in HDR reconstruction.

19.2.3.1. Stereo matching for HDR reconstruction

Many different methods exist for pixel matching. In this particular context, the input data contain a variety of intensity values. Dark, or saturated, zones have poor or erroneous data that vary across the sequence of considered images. Moreover, if this sequence is captured using several lenses, the data will have a higher degree of variability. We therefore need to establish a procedure for calibrating data to make it consistent (see section 19.2.3.2) and adapt or propose new matching algorithms. In this section, we explore four recent methods for tackling this problem.

Lin and Chang [LIN 09] aimed to match pixels contained in two images acquired from different viewpoints with different exposures, supplied by Middlebury³. To do this, they applied Sunet al.’s algorithm [SUN 03], based on belief propagation, after modifying the images to obtain a shared exposure time. This algorithm establishes a correspondence between pixels using three Markov random ﬁelds, corresponding, respectively, to three important problems that must be addressed during the matching phase: disparities, discontinuities and occlusions in the different images. While Lin and Chang [LIN 09] only used one set of stereoscopic data, Sunet al.’s method [SUN 03] has also been tested on multiscopic image sets (5 and 11 viewpoints), where an additional cost function is minimized in order to match pixels with the lowest cost.

Sun et al.[SUN 10] also proposed a solution for matching pixels taken from stereoscopic images acquired with two exposure times (Middlebury images³). As we saw in Chapter 7, different similarity measurements may be taken into account for matching purposes. In this case, the authors chose to use NCC, which is invariant to exposure changes. Different similarity measurements have been compared for mapping pixels taken from images with different exposures [BLE 08, ORO 12], and the NCC method currently produces the best results. Its invariance to changes in brightness under certain conditions was demonstrated by Troccoli et al. [TRO 06], who used it to improve results obtained using Kang and Szeliski’s method [KAN 04]. To do this, two matching operations were carried out, the ﬁrst with NCC and the second with the sum of square differences (SSD) in the luminance space to

3 http://vision.middlebury.edu/stereo/data/.

reﬁne initial results. This method used N viewpoints and four different exposures.

Luet al.[LU 11] considered 3D HDR reconstruction and have not directly addressed the production of 3D HDR images. They proposed the use of projection to assist matching, as shown in Figure 19.5, using a dome of 20 cameras to obtain images with two exposures simultaneously. Ten cameras acquired images with one exposure time and the ten others with the second exposure time. If we know pixelP₂in imageI₂, its projectionX in the scene to acquire is also known (the point belongs to the optical center/pixel line).

By inverse projection onto images acquired by other cameras, it is possible to ﬁnd the points corresponding to this 3D point in all images in which it features. PointsP₁ andP₃thus correspond to pointP₂in imagesI₁andI₃. The zero normalized cross-correlation (ZNCC) is then used to limit correspondences and improve their coherency.

In document 3D Video: from Capture to Interactive Display (sider 144-162)