What the eye did not see–a fusion approach to image coding

Authors: Ali Alsam, Hans Jakob Rivertz, and Puneet Sharma.

Full title: What the eye did not see–a fusion approach to image coding.

Published in: ISVC 2012, Advances in Visual Computing, Lecture Notes in Computer Science (LNCS), Springer-Verlag Berlin Heidelberg.

What the eye did not see–a fusion approach to image coding

Ali Alsam, Hans Jakob Rivertz, and Puneet Sharma Department of Informatics & e-Learning(AITeL)

Sør-Trøndelag University College(HiST) Trondheim, Norway

email: er.puneetsharma@gmail.com

Abstract. The concentration of the cones and ganglion cells is much higher in the fovea than the rest of the retina. This non-uniform sampling results in a retinal image that is sharp at the ﬁxation point, where a person is looking, and blurred away from it. This diﬀerence between the sampling rates at the diﬀerent spatial locations presents us with the question of whether we can employ this biological characteristic to achieve better image compression. This can be achieved by compressing an image less at the ﬁxation point and more away from it. It is, however, known that the vision system employs more that one ﬁxation to look at a single scene which presents us with the problem of combining images pertaining to the same scene but exhibiting diﬀerent spatial contrasts.

This article presents an algorithm to combine such a series of images by using image fusion in the gradient domain. The advantage of the algorithm is that unlike other algorithms that compress the image in the spatial domain our algorithm results in no artifacts. The algorithm is based on two steps, in the ﬁrst we modify the gradients of an image based on a limited number of ﬁxations and in the second we integrate the modiﬁed gradient. Results based on measured and predicted ﬁxations verify our approach.

1 Introduction

From the very beginning of photography, cameras were designed and iteratively improved with the aim of mimicking the human visual system. From this per-spective, a camera is thought of as a machined eye–a device that is sensitive to il-lumination. Equally, we normally think of algorithms such as white-balancing [1], adaptation [2] and tone mapping [3] as being similar to the biological processes of the vision system.

A camera is of course not a human visual system. The two are diﬀerent in a number ways some of which are relevant to the work presented in this article. Primarily, while digital camera manufacturers are striving to produce devices with progressively higher resolution, the human brain has evolved to be eﬃcient, i.e. use less information to reach greater conclusions. Thus while the camera sensor has a uniform number of pixels per unit area, the human eye has

a much higher resolution in the fovea which is the center part of the retina [4]. It is well known that the fovea is responsible for our central, sharpest vision while the cone distribution in the rest of the retina results in blurred vision [4].

In the process of exploring a scene, the brain directs the eyes to diﬀerent spatial locations. At those locations, known as ﬁxations the eyes pause and gather the visual information [5]. Due to the concentration of photo-receptors at the fovea, we can think of each pause as the time taken to capture an image that is sharp at the ﬁxation point and blurred away from it. Given that the average distribution per unit area and spatial location of the cones in the retina is known, it is possible to model the spatial contrast of the retinal image at each ﬁxation.

For a given scene, the number of ﬁxations and their locations vary. The question of whether ﬁxations are guided by image features has been addressed extensively in vision research; and some conclusions are widely accepted. Specif-ically, experiments have shown that for a given image, people tend to look at the same regions [6, 7], they tend to look at the central part [8, 7] and that certain image attributes such as luminance and colour contrasts tend to attract ﬁxa-tions [9, 10]. Furthermore, ﬁxaﬁxa-tions can be measured using eye trackers and the experimental data shows conclusively that for a general image the human visual system employs more than one ﬁxation [6].

Based on a given digital image and a number of measured or predicted ﬁxa-tions, we can model the foveation eﬀect, i.e a sharp region at the ﬁxation point and blurring away from it. The result of such a model would be a number of images with diﬀerent spatial contrast. As an example, see ﬁgure 1 where we have modeled the foveation eﬀect based on 3 diﬀerent ﬁxations. Given such an image series we might wonder how the vision system integrates the diﬀerent foveation results into a seamless visual experience; and subsequently how we can design signal processing algorithms that oﬀer such functionality.

In this article, we present an algorithm which integrates a number of diﬀer-ently foveated images in the gradient domain. The algorithm starts by calculating the gradients of the input image. Having done that a number of ﬁxation loca-tions are used to calculate the corresponding foveated gradients. Here we use the foveation function described by Geisler and Perry [11]. As a second step, the gradients are combined using the fast colour to gray algorithm by Alsam and Drew [12]. The Alsam and Drew algorithm [12] combines the gradients from n channels into a single gradient by arguing that the maximum horizontal and vertical diﬀerences over all the channels result in the maximum contrast. Thus the gradient fusion step is guaranteed to result in a gradient where the maxi-mum diﬀerences pertaining to the ﬁxations locations are maintained. As a ﬁnal step the resultant gradient is integrated using the modiﬁed Frankot-Chellappa-algorithm [13] proposed by Alsam and Rivertz [14].

The need for a fast algorithm to combine foveated images is best motivated in the image compression domain where improvements in statistically based image compression, i.e. methods that are based on data analysis have long slowed down. The use of human vision steered compression is seen by researchers as the

most promising path toward further improvements. In this regard, the algorithm presented in this article can be used as part of an image compression pipeline with very promising results. From our initial tests, we have noticed that the algorithm results in reduced storage requirements without the added artifacts associated with frequency based compressions in the wavelets domain.

Like other foveation driven algorithms, our method is dependent on accurate estimation of the ﬁxation points. Thus in our experimental section, we present results based on measured ﬁxation data as well as predictions based on the visual saliency algorithm by Itti et al. [15].

(a) Foveated image 1 (b) Foveated image 2 (c) Foveated image 3

Fig. 1.Figures show the foveated images for three ﬁxations, here the ﬁxation points are represented as red dots.

2 The ﬁlter and the integration.

Experiments for measuring the contrast sensitivity of the human eye have been carried out [16, 17]. Based on these experiments, the contrast threshold has been modeled through the function

CT(f, θ) =CT₀ exp

α fθ+θ₂ θ₂ .

Here, f is the spatial frequency measured in degrees,θ is the retinal eccentric-ity.CT₀ is the minimal contrast threshold,θ₂ is the half-resolution eccentricity constant, andαis the spatial frequency decay constant. The values used in [18]

areα= 0.106,θ₂= 2.3, andCT₀= 1/64.

Given a normalized gray scale imagez₀:Ω→[0,1]. Denote its width byw, measured in pixels. An observer views the image from a distanced, measured in

pixels. The maximal spatial frequency of the image is given byf_d=_{4 arctan}^w w 2d. If ris the distance measured in pixels from a ﬁxation point, thenθ(r) = arctan^r

d. The gradient∇z₀ is modiﬁed by setting the its magnitude to zero if its is less thanCT(f_d, θ) for some of the ﬁxation points.

We make a new contrast threshold function based onf=f_dand the ﬁxation points, (x₁, y₁),(x₂, y₂), . . . ,(x_n, y_n).

CT(x, y) = min(CT₁(x, y), CT₂(x, y), . . . , CT_n(x, y)), whereCT_k(x, y) =CT

f_d, θ

(x−x_k)²+ (y−y_k)²

,k = 1,2, . . . , n. This step is equivalent to the Alsam and Drew method [12].

The direction of the original and modiﬁed gradients are ˆu = ∇z₀/|∇z₀|. The length of the new gradient is|∇z|=|∇z₀|ifCT(x, y)<|∇z₀|, otherwise

|∇z|= 0. We now reconstruct the contrast by using the integration method of Alsam and Rivertz [14] where we minimize the functional:

W(z) =λ

|z−z₀|² dx dy+

|z_x−p|²+|z_y−q|² dx dy.

This minimization results in an image whose gradients are as close as possible to (p, q), under the constraint that the luminance is close to the original image.

The imagez in the Fourier domain can be taken as Z(u, v) =λZ₀−i(uP+vQ)

λ+u²+v² ,

whereP andQcorrespond to the Fourier transforms ofp, andq.

3 Results

To test the proposed method, we used images and corresponding ﬁxations data from the study by Judd et al. [6]. The results for two images and the associated ﬁxations are shown in ﬁgures 2 to 3. In the left column the foveated images for three ﬁxations are shown. Here, the ﬁxation points are represented as red dots. In agreement with the predicted results for the application of the contrast function by Wang and Bovik [18], we notice that the regions around the ﬁxation points are sharper than the rest. The images in the right column show the original image, the result obtained by combining the foveated images using the proposed method, and the diﬀerence between the result and the original image. We notice that the result image is sharp in the regions corresponding to the three ﬁxation points, we further notice that the image represents a good approximation of the original with greater diﬀerences in the parts that the observer deemed to be less salient. Here we remark that the diﬀerence between the original and the result can be optimized by controlling theλparameter deﬁned in the previous section.

In ﬁgure 4, the left column contains the foveated images obtained by using the ﬁrst three salient points from the visual saliency algorithm by Itti et al. [15]

and the right column contains the original image, the result obtained by using the proposed method, and the diﬀerence between the result and the original image. For this experiment, we notice that the results are very similar to those obtained for the ﬁrst test image. We underline, however, that the choice of ﬁxation locations and the number of salient regions is clearly related to the results that we obtain, where the higher the number of ﬁxations and the more spread they are in the image plane the closer the result is going to resemble the original.

Finally, in ﬁgures 5(a) to 5(f), we show the bitrates obtained by saving the original image and corresponding result image in JPEG format with diﬀerent quality values, ranging from 10 to 100 based on six diﬀerent images. Here we no-tice that for the same compression quality the new images require lower storage space. Given that the foveation function reduces the high frequency elements of the original image, we can argue that this result is not surprising. The advan-tages of this approach are, however, more subtle than a simple removal of high frequency elements- we have removed high frequencies locally- in regions where the foveation function predicts that the observer couldn’t see with the sharp part of their vision.

4 Conclusion

This article presents an algorithm to combine a series of diﬀerently foveated images pertaining to an identical scene. This is achieved by using image fusion in the gradient domain. The advantage of the algorithm is that unlike other algorithms that compress the image in the spatial domain our algorithm results in no artifacts. The algorithm is based on two steps, in the ﬁrst we modify the gradients of an image based on a limited number of ﬁxations and in the second we integrate the modiﬁed gradient. Results based on measured and predicted ﬁxations verify our approach. The need for a fast algorithm to combine foveated images is best motivated in the image compression domain where improvements in statistically based image compression, i.e. methods that are based on data analysis have long slowed down. The use of human vision steered compression is seen by researchers as the most promising path toward further improvements. In this regard, the algorithm presented in this article can be used as part of an image compression pipeline with very promising results. From our initial tests, we have noticed that the algorithm results in reduced storage requirements without the added artifacts associated with frequency based compressions in the wavelets domain.

References

1. Chikane, V., Fuh, C.S.: Automatic white balance for digital still cameras. Journal Of Information Science and Engineering22(2006) 497–509

2. Hurley, J.B.: Shedding light on adaptation. Journal of General Physiology119 (2002) 125–128

3. Qiu, G., Guan, J., Duan, J., Chen, M.: Tone mapping for hdr image using opti-mization a new closed form solution. In: ICPR 2006. 18th International Conference on Pattern Recognition. Volume 1. (2006) 996 –999

4. Cormack, L.K.: Computational models of early human vision. In: Handbook of Image and Video Processing. Elsevier Academic Press (2005) 325–345

5. Rajashekar, U., van der Linde, I., Bovik, A.C., Cormack, L.K.: Gaﬀe: A gaze-attentive ﬁxation ﬁnding engine. IEEE Transactions on Image Processing17(2008) 564 –573

6. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: International Conference on Computer Vision (ICCV). (2009)

7. Alsam, A., Sharma, P.: Analysis of eye ﬁxations data. In: Proceedings of the IASTED International Conference, Signal and Image Processing (SIP 2011). (2011) 342–349

8. Tatler, B.W.: The central ﬁxation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.

Journal of Vision7(2007) 1–17

9. Itti, L., Koch, C.: Computational modelling of visual attention. Nature Reviews Neuroscience2(2001) 194–203

10. Meur, O.L., Callet, P.L., Barba, D., Thoreau, D.: A coherent computational ap-proach to model bottom-up visual attention. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence28(2006) 802–817

11. Geisler, W.S., Perry, J.S.: A real-time foveated multiresolution system for low-bandwidth video communication. In: SPIE Proceedings. Volume 3299. (1998) 1–13 12. Alsam, A., Drew, M.S.: Fast colour2grey. In: 16th Color Imaging Conference:

Color, Science, Systems and Applications, Society for Imaging Science & Technol-ogy (IS&T)/Society for Information Display (SID) joint conference. (2008) 342–346 13. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Transactions on Pattern Aanalysis and Machine Intel-ligence10(1988) 439–451

14. Alsam, A., Rivertz, H.J.: Constrained gradient integration for improved image contrast. In: Proceedings of the IASTED International Conference, Signal and Image Processing (SIP 2011). (2011) 13–18

15. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1998) 1254–1259

16. Banks, M., Sekuler, A., Anderson, S.: Peripheral spatial vision: limits imposed by optics, photoreceptors, and receptor pooling. J. Opt. Soc. Am. A 8 (1991) 1775–1787

17. Arnow, T.L., Geisler, W.S.: Visual detection following retinal damage: Predic-tions of an inhomogeneous retino-cortical model. In: Human Vision and Electronic Imaging. Proceedings of SPIE. Volume 2674. (1996)

18. Wang, Z., Bovik, A.C.: Embedded foveation image coding. IEEE Transactions on Image Processing10(2001) 1397–1410

(a) Foveated image 1

(b) Foveated image 2

(d) Original image

(e) Result

(f) Diﬀerence

Fig. 2. In the left column the foveated images for three ﬁxations are shown. Here, the ﬁxation points are represented as red dots. The images in the right column show the original image, the result obtained by combining the foveated images using the proposed method, and the diﬀerence between the result and the original image. We notice that the result image is sharp in the regions corresponding to the three ﬁxation points, we further notice that the image represents a good approximation of the original with greater diﬀerences in the parts that the observer deemed to be less salient. In the diﬀerence image, the dark regions indicate the locations where the diﬀerences are higher.

(a) Foveated image 1

(b) Foveated image 2

(d) Original image

(e) Result

(f) Diﬀerence

Fig. 3. In the left column the foveated images for three ﬁxations are shown. Here, the ﬁxation points are represented as red dots. The images in the right column show the original image, the result obtained by combining the foveated images using the proposed method, and the diﬀerence between the result and the original image. We notice that the result image is sharp in the regions corresponding to the three ﬁxation points, we further notice that the image represents a good approximation of the original with greater diﬀerences in the parts that the observer deemed to be less salient. In the diﬀerence image, the dark regions indicate the locations where the diﬀerences are higher.

(a) Foveated image 1

(b) Foveated image 2

(d) Original image

(e) Result

(f) Diﬀerence

Fig. 4. In the left column the foveated images obtained by using ﬁrst three salient points from the visual saliency algorithm by Itti et al. [15] are shown. Here, the ﬁx-ation points are represented as red dots. The images in the right column show the original image, the result obtained by combining the foveated images using the pro-posed method, and the diﬀerence between the result and the original image. We notice that the result image is sharp in the regions corresponding to the three ﬁxation points, we further notice that the image represents a good approximation of the original with greater diﬀerences in the parts that the observer deemed to be less salient. In the dif-ference image, the dark regions indicate the locations where the diﬀerences are higher.

(a) image 1 (b) image 2

(e) image 5 (f) image 6

Fig. 5.Figures show the bitrates for saving the original image and corresponding result image in JPEG format with diﬀerent quality values, ranging from 10 to 100 based on six diﬀerent images. Here we notice that for the same compression quality the new images require lower storage space.

A.8 What the eye did not see–a fusion approach to image coding (extended)

Authors: Ali Alsam, Hans Jakob Rivertz, and Puneet Sharma.

Full title: What the eye did not see–a fusion approach to image coding.

Published in: International Journal on Artiﬁcial Intelligence Tools.

International Journal on Artiﬁcial Intelligence Tools Vol. 22, No. 6 (2013) 1360014 (13 pages)

c World Scientiﬁc Publishing Company DOI: 10.1142/S0218213013600142

WHAT THE EYE DID NOT SEE A FUSION APPROACH TO IMAGE CODING

ALI ALSAM, HANS JAKOB RIVERTZ and PUNEET SHARMA^∗ Department of Informatics & e-Learning(AITeL)

Sør-Trøndelag University College(HiST) Trondheim, Norway

∗er.puneetsharma@gmail.com

Received 15 January 2013 Accepted 14 July 2013 Published 20 December 2013

The concentration of the cones and ganglion cells is much higher in the fovea than the rest of the retina. This non-uniform sampling results in a retinal image that is sharp at the ﬁxation point, where a person is looking, and blurred away from it. This diﬀerence between the sampling rates at the diﬀerent spatial locations presents us with the question of whether we can employ this biological characteristic to achieve better image compression. This can be achieved by compressing an image less at the ﬁxation point and more away from it. It is, however, known that the vision system employs more that one ﬁxation to look at a single scene which presents us with the problem of combining images pertaining to the same scene but exhibiting diﬀerent spatial contrasts.

In document Towards three-dimensional visual saliency (sider 146-171)