**Real-Time Geometry Decompression on** **Graphics Hardware**

**Echtzeit-Geometriedekompression auf** **Graphikhardware**

### Der Technischen Fakultät der Universität Erlangen–Nürnberg

### zur Erlangung des Grades

### D–I

### vorgelegt von

**Quirin Nikolaus Meyer**

### Erlangen — 2012

der Universität Erlangen–Nürnberg

Tag der Einreichung: 11.05.2012 Tag der Promotion: 01.08.2012

Dekanin: Prof. Dr.-Ing. habil. Marion Merklein Berichterstatter: Prof. Dr.-Ing. Marc Stamminger

Prof. Dr. rer. nat. Michael Guthe

©2012, Copyright Quirin Nikolaus Meyer All Rights Reserved

Alle Rechte vorbehalten

Real-Time Computer Graphics focuses on generating images fast enough to cause the illusion of a continuous motion. It is used in science, engineering, computer games, image processing, and design. Special purpose graphics hardware, a so-called graphics processing unit (GPU), accelerates the image generation process substantially. erefore, GPUs have become indispens- able tools for Real-Time Computer Graphics.

e purpose of GPUs is to create two-dimensional (2D) images from three- dimensional (3D) geometry. ereby, 3D geometry resides in GPU mem- ory. However, the ever increasing demand for more realistic images con- stantly pushes geometry memory consumption. is makes GPU memory a limiting resource in many Real-Time Computer Graphics applications. An eﬀective way of tting more geometry into GPU memory is to compress ge- ometry.

In this thesis, we introduce novel algorithms for compressing and decom- pressing geometry. We propose methods to compress and decompress 3D positions, 3D unit vectors, and topology of triangle meshes. ereby, we obtain compression ratios from 2:1 to 26:1. We focus on exploiting the high degree of parallelism available on GPUs for decompression. is allows our decompression techniques to run in real-time and impact rendering speed only little. At the same time, our techniques achieve high image quality:

images, generated from compressed geometry, are visually indistinguish- able from images generated from non-compressed geometry. Moreover, our methods are easy to combine with existing rendering techniques. ereby, a wide range of applications may bene t from our results.

Die Echtzeit-Computergraphik beschäigt sich mit der Bilderzeugung, die schnell genug ist, um die Illusion einer kontinuierlichen Bewegung hervorzurufen. Sie ndet Anwendung in den Natur- und Ingenieurswis- senschaen, bei Computerspielen, in der Bildverarbeitung, und in gestal- terischen Disziplinen. Spezielle Graphikhardware, die man als Graph- ics Processing Unit (GPU) bezeichnet, beschleunigt dabei den Bilderzeu- gungsprozess erheblich. Aus diesem Grund sind GPUs zu unersetzlichen Werkzeugen der Echtzeit-Computergraphik geworden.

GPUs dienen dazu, zweidimensionale Bilder (2D) aus dreidimensionaler (3D) Geometrie zu erzeugen. Dabei be ndet sich die 3D Geometrie im GPU-Speicher. Jedoch lässt der stetig steigende Bedarf an noch realität- streueren Bildern auch den Geometriespeicherverbrauch wachsen. GPU- Speicher wird folglich zu einer knappen Ressource in vielen Echtzeit- Computergraphik-Anwendungen. Ein eﬀektives Mittel, um mehr Geome- trie in den GPU-Speicher zu bekommen, ist, die Geometrie in komprim- ierter Form dort abzulegen.

In dieser Arbeit werden neuartige Geometriekompressions- und Geome- triedekompressionsalgorithmen vorgestellt. Die hier vorgeschlagenen Meth- oden dienen zur Komprimierung und Dekomprimierung von 3D Positio- nen, 3D Einheitsvektoren, und der Topologie von Dreiecksnetzen. Dabei werden Kompressionsverhältnisse von 2:1 bis 26:1 erzielt. Den Schwerpunkt der Arbeit bildet die Ausnutzung des auf GPUs verfügbaren hohen Par- allelitätsgrades bei der Dekompression. Dadurch laufen die Dekompres- sionsverfahren in Echtzeit und beein ussen die Rendering-Geschwindigkeit nur geringfügig. Gleichzeitig sind die erzeugten Bilder von hoher Qual- ität. Die aus komprimierter Geometrie erzeugten Bilder sind visuell nicht von Bildern zu unterscheiden, welchen nicht-komprimierte Geometrie zu Grunde liegt. Darüber hinaus sind die hier vorgeschlagenen Verfahren le-

Ergebnissen der vorliegenden Arbeit zu pro tieren.

I would like to thank my advisor Prof. Dr.-Ing. Marc Stamminger for his in- spiration, his patience, his trust, his understanding, his advice, his way of motivating people, and his ability to listen. Moreover, I thank Dr.-Ing. Gerd Sußner of Realtime Technology AG (RTT) for initiating and advising the cooperation between RTT and the Chair of Computer Science 9 (Computer Graphics) at the University of Erlangen-Nuremberg. is cooperation in- spired many ideas found in this thesis. erefore, I would like to express my gratitude towards RTT staﬀ supporting this cooperation, namely Peter Röhner and Cornelia Denk (now at BMW).

Further, I thank Prof. Dr. Günther Greiner for the fruitful discussions on all kinds of mathematical issues. Dr.-Ing. Jochen Süßmuth kindly provided his X E LTEX template. I have to thank him even more for his invaluable advice.

During my thesis, I had the opportunity to publish with great researchers which happen to be wonderful persons at the same time. Besides Marc, Gün- ther, Jochen, and Gerd, I thank Charles Loop, Ph.D. (Microso Research), Prof. Dr. Rolf Wanka (Chair of Computer Science 12) and Prof. Dr. Carsten Dachsbacher (then at University of Stuttgart), as well as Dr.-Ing. Christian Eisenacher, Ben Keinert, Magdalena Prus, Henry Schäfer, Fabian Schönfeld, and Christian Siegl (all from Chair of Computer Science 9).

e fantastic and overwhelming ambiance at our chair was created by my co-authors, as well as Elmar, Frank, Franz, Maddi, Marco, Mickie, Michael, Kai, Roberto, Maria, Manfred, and my oﬃcemates Matthias and Jan.

For providing models, I thank Stanford University (Bunny, Buddha, Ar- madillo, Dragon, Asian Dragon, Statuette), e Digital Michelangelo Project (David), Georgia Institute of Technology (Angel, Horse, Hand), Cyberware Inc. (Dinosaur, Igea, Isis, Rocker Arm), AIM@SHAPE (Neptun), Bangor University, UK (Welsh Dragon), and Cornelia Denk of BMW (Car).

I would like to thank my family, especially my brother Simon and my parents Gertie and Meinhart for encouraging and supporting me much longer than this thesis took. But most of all, I have to thank my girlfriend Linda: not only because she did all the proofreading, but for her love, strength, endurance, trust, and support.

— Quirin Meyer, Erlangen August 2012

**Acknowledgements** **v**

**1** **Introduction** **1**

1.1 Challenges and Bene ts . . . 2

1.2 Contributions and Overview . . . 4

**2** **Background** **7**
2.1 Notation . . . 7

2.2 Real-Time Rendering . . . 9

2.2.1 Triangle Meshes . . . 9

2.2.2 Lighting Models . . . 11

2.2.3 Real-Time Rendering on Graphics Hardware . . . . 12

2.3 Number Formats for Real Numbers . . . 16

2.3.1 Floating-Point Numbers . . . 17

2.3.2 Uniformly Quantized Numbers . . . 20

**3** **Adaptive Level-of-Precision** **23**
3.1 Introduction . . . 24

3.1.1 Contributions . . . 26

3.1.2 Overview . . . 29

3.2 Previous Work . . . 29

3.3 Level-of-Precision . . . 32

3.3.1 Representing Positions . . . 33

3.3.2 Bit-Level Adaption . . . 34

3.3.3 Packed Buﬀers . . . 35

3.3.4 Rendering Quality . . . 36

3.4 Adaptive Precision . . . 38

3.4.1 LOP Selection . . . 39

3.4.2 Adaptive Precision Buﬀer . . . 42

3.4.3 Crack Removal . . . 43

3.5 Constrained Adaptive Precision . . . 44

3.5.1 Shading Error . . . 45

3.5.2 Depth Error . . . 52

3.5.3 Coverage Error . . . 53

3.5.4 Binned Adaptive Precision Buﬀers . . . 54

3.6 GPU Implementation . . . 56

3.6.1 Graphics Pipeline Integration . . . 57

3.6.2 Bit-Level Adaption . . . 58

3.6.3 Rendering Algorithm . . . 60

3.7 Results . . . 61

3.7.1 Quality . . . 62

3.7.2 Memory Usage . . . 66

3.7.3 Rendering Performance . . . 67

3.7.4 Changing Bit-Levels . . . 71

3.8 Conclusion and Future Work . . . 72

**4** **Unit Vector Compression** **75**
4.1 Introduction . . . 76

4.1.1 Motivation . . . 76

4.1.2 Overview . . . 77

4.1.3 Contributions . . . 78

4.2 Discrete Sets of Unit Vectors . . . 79

4.2.1 Maximum Angular Quantization Error . . . 79

4.2.2 Properties of Discrete Sets of Unit Vectors . . . 82

4.3 Lower Bounds for Unit Normal Vectors . . . 83

4.3.1 Accuracy . . . 86

4.3.2 Number . . . 88

4.3.3 Conclusion . . . 89

4.4 e Accuracy of Floating-Point Unit Vectors . . . 90

4.4.1 Redundancies . . . 90

4.4.2 Maximum Angular Quantization Error . . . 93

4.4.3 Summary . . . 101

4.5 Previous Work . . . 102

4.5.1 Parameterization Methods . . . 103

4.5.2 Subdivision Methods . . . 107

4.5.3 Look-Up-Table Methods . . . 108

4.5.4 Conclusion . . . 109

4.6 Parameterization Methods . . . 111

4.7 Parameter Quantization and Error Analysis . . . 119

4.7.1 Domain Quantization . . . 121

4.7.2 Range Quantization . . . 122

4.7.3 Error Analysis of Domain Quantization . . . 123

4.7.4 Error Analysis of Range Quantization . . . 124

4.7.5 Compactness Factor . . . 127

4.8 Results . . . 128

4.8.1 Quantization Errors . . . 129

4.8.2 Bit-Budget of 48 Bits . . . 131

4.8.3 Validation of Errors . . . 133

4.8.4 Comparison with FPUVs . . . 135

4.8.5 Quality . . . 135

4.8.6 Timings . . . 139

4.8.7 GPU Decompression of Per-Vertex Unit Vectors . . 141

4.8.8 Surplus Bits . . . 142

4.9 Conclusion . . . 143

4.10 Further Applications . . . 145

**5** **Triangle Mesh Topology Compression** **147**
5.1 Introduction . . . 148

5.1.1 Contributions . . . 150

5.1.2 Overview . . . 151

5.2 Related Work . . . 151

5.3 Generalized Triangle Strips . . . 154

5.3.1 Creating Belts and Stripi cation . . . 155

5.3.2 Vertex Order . . . 156

5.3.3 Compact Representation of Generalized Triangle Strips . . . 157

5.4 Decompression of Generalized Triangle Strips . . . 158

5.4.1 Data-Parallel Scan . . . 159

5.4.2 Data-Parallel Algorithm . . . 160

5.4.3 Restart Emulation . . . 165

5.4.4 Alternative Decompression Approaches . . . 166

5.5 Incremental Vertices . . . 167

5.6 Data-Parallel Word-Aligned Code . . . 168

5.6.1 Simple-9 Codes . . . 169

5.6.2 Data-Parallel Decompression of Simple-9 Codes . . 170

5.6.3 Alternatives to Simple-9 . . . 172

5.7 Decompression Pipeline . . . 173

5.8 Results and Discussion . . . 174

5.8.1 Compression Rate . . . 176

5.8.2 Decompression Speed . . . 179

5.8.3 Impact of Vertex and Triangle Order . . . 181

5.8.4 Runtime Memory Consumption . . . 182

5.8.5 Alternative Stripifcation Methods . . . 182

5.8.6 Application Fields . . . 182

5.8.7 Comparison . . . 185

5.9 Conclusion and Future Work . . . 186

**6** **Conclusion and Outlook** **189**

**Bibliography** **193**

2.1 Explicit Representation of a Triangle Mesh . . . 9

2.2 OpenGL Graphics Pipeline . . . 13

2.3 Distribution of Mini-Floats . . . 18

3.1 Diﬀerence Between LOP- and LOD-Methods . . . 25

3.2 Diﬀerence Between AP and CAP . . . 27

3.3 Positions Stored as Floats, AP, and CAP . . . 28

3.4 Packed and Unpacked Buﬀers . . . 36

3.5 Adaptive Precision Example . . . 37

3.6 Object Space Error to Camera Space Error . . . 39

3.7 Adaptive Precision Summary . . . 42

3.8 Crack Removal . . . 44

3.9 Shading Artifacts Caused by Altering Positions . . . 45

3.10 Shading Error due to Wrong Normal Vectors . . . 46

3.11 Per-Vertex Minimum Bit-Level . . . 51

3.12 Cause of Depth Errors . . . 52

3.13 Removing Depth Artifacts . . . 53

3.14 Constrained Adaptive Precision . . . 54

3.15 Binned Adaptive Precision Buﬀer . . . 55

3.16 Bit-Levels of a Binned Adaptive Precision Buﬀer . . . 57

3.17 Data-Parallel Unpacking and Packing . . . 59

3.18 Quality Results of the David Model . . . 63

3.19 Quality Results of the Car Model . . . 64

3.20 Popping Artifacts . . . 65

3.21 Rotation Motion Timings . . . 69

3.22 Dolly Motion Timings . . . 70

3.23 Combining LOD and LOP . . . 73

4.1 Non-Uniformly vs. Uniformly Distributed Unit Vectors . . 81

4.2 Artifacts Caused by Poorly Quantized Unit Vectors . . . 84

4.3 Cause of the Artifacts with Poorly Quantized Unit Vectors . 85

4.4 Floating-Point Unit Vectors Distribution . . . 91

4.5 Rotation of FPUVs . . . 92

4.6 Naming Conventions of the Floating-Point Grid . . . 93

4.7 2Ds FPUVs Angular Quantization Error . . . 96

4.8 Example of the Vertical Motion . . . 97

4.9 Example of the Horizontal Motion . . . 98

4.10 Special Case of the Diagonal Motion . . . 100

4.11 Unfolding an Octahedron . . . 114

4.12 Warped Octahedron . . . 115

4.13 Cut-oﬀ Regions of Parallel Projections . . . 118

4.14 Domain Quantization . . . 121

4.15 Range Quantization . . . 123

4.16 Voronoi Regions of Range Quantization . . . 125

4.17 Shading Results at a Bit-Budget of 17 bits . . . 136

4.18 Angular Error at a Bit-Budget of 17 bits . . . 137

4.19 Comparing WCP and OP with a Bit-Budget of 16 Bits . . . 138

4.20 Comparison Between FPUVs and OP Unit Vectors . . . 144

5.1 Welsh Dragon . . . 149

5.2 Overview of our Codec . . . 150

5.3 Belt Traverser . . . 154

5.4 Strip-Codes . . . 157

5.5 Generalized Triangle Strip . . . 158

5.6 M_Scan and Q_Scan . . . 161

5.7 T_Kernel . . . 164

5.8 Data-Parallel Unpacking of Incremental Vertices . . . 168

5.9 Simple-9 Compression Example . . . 170

5.10 S_Scan . . . 170

5.11 D_Scatter . . . 171

5.12 Decompression Pipeline . . . 173

5.13 Test Meshes . . . 175

5.14 Head of Michelangelo’s David . . . 184

2.1 Binary Powers . . . 8

2.2 Floating-Point Number Formats . . . 18

3.1 Details of the David and the Car Data Set . . . 61

3.2 AP and CAP Memory Usage . . . 66

3.3 AP and CAP Rendering Performance . . . 68

4.1 Maximum Angular Quantization Error of Floats . . . 103

4.2 Summary of Parameterization Methods . . . 128

4.3 Parameters and Unit Vectors of Maximum Error . . . 130

4.4 Maximum Error and Compactness Factors . . . 131

4.5 Maximum Error Example . . . 132

4.6 Maximum Error Experiment . . . 133

4.7 Mixed Precisions Experiment . . . 134

4.8 Bits To Maintain FPUV-Error . . . 135

4.9 CPU Compression and Decompression Timings . . . 139

4.10 Surplus Bits of Various Parameterization Methods . . . 143

5.1 Simple-9 Selectors . . . 169

5.2 Mesh Details . . . 176

5.3 Compression Rates and Decompression Performance . . . 177

5.4 Compression Rate Details . . . 178

5.5 Detailed Decompression Timings . . . 180

5.6 Rendering Times with our Layout and OpenCCL . . . 181

5.7 Mesh Details of David’s Head . . . 183

5.8 Timings and Memory Consumption of David’s Head . . . . 185

**Introduction**

In real-time rendering, image sequences are created quickly enough to evoke the illusion of a continuous motion, and the image generation process can in- stantaneously respond to external in uences. e research domain concern- ing with this task is Real-Time Computer Graphics. It has become an irre- placeable part in many application elds, as for example in computer games, image processing, scienti c visualization, design, and engineering. Almost inseparably linked with Real-Time Computer Graphics is special purpose hardware, so called GPUs: GPUs signi cantly speed the image generation process and oentimes make real-time image generation possible in the rst place.

GPUs are almost as ubiquitous as Real-Time Computer Graphics. Wherever there is Real-Time Computer Graphics, there is virtually always a GPU. ey are an integral part of a wide range of devices, such as smart-phones, tablets, laptops, workstations, clusters, and super computers. A GPU may be very powerful for speci c tasks and surpass the performance of traditional central processing units (CPUs) by orders of magnitude.

Despite the impressive power of GPUs, their resources are limited. Depend- ing on the application, the computational speed may be too low, the memory bandwidth too little, or GPU memory does not suﬃce to t the necessary data. e reason for the shortage is that the expectations in GPUs are con- stantly pushed. With faster and more accurate sensors and simulations, the amount of data medical and scienti c visualization applications need to pro- cess skyrockets. Designers and engineers are steadily increasing their expec- tations towards detail and precision. e ever growing demand for realism raises the necessity for more nely resolved data. All of this results in a steady growth of GPU memory requirements. Data compression is an appropriate

and eﬀective mean for counteracting memory space shortage.

In GPU memory, we place all necessary information that are required for rendering an image, such as geometric data and textures, amongst other less space consuming data. Driven by their importance in computer games, compressed textures are supported by modern GPUs. Textures are com- pressed lossy, that means the original data is not equal to the data extracted from the compressed data. e ratio of uncompressed over the compressed data size ranges from 2:1 to 6:1. Further, random access to compressed tex- tures comes at no extra performance cost over uncompressed textures.

While in computer games the major cause for memory shortage is due to textures, visualization and engineering applications need to process large amounts of geometric data in the form of triangle meshes. ese triangle meshes originate from highly detailed construction, sensor, or simulation data that can easily contain several millions of triangles and vertices. As a result, GPU memory consumed by triangle meshes becomes the limiting fac- tor. erefore, it is necessary to come up with solutions that reduce the size of geometric data.

**1.1 Challenges and Bene ts**

In contrast to on-the- y texture decompression, built-in GPU triangle-mesh decompression does not exist. In this thesis, we ll this gap and propose sev- eral decompression methods targeting diﬀerent subsets of geometric data.

ere are three major challenges:

• **Parallelization:**Although geometry compression and decompression
is a well-researched subject, the available algorithms are not suited for
GPUs. is is because GPUs achieve their high performance rates
through parallelization of the deployed rendering algorithms. How-
ever, geometry decompression algorithms are inherently sequential.

As a consequence, they would perform poorly on GPUs and real-time rendering would no longer be possible. erefore, we need to design new decompression methods that run in parallel on a GPU.

• **Real-Time:** Besides reducing memory consumption, the proposed
methods may only have little impact on the rendering performance.

is impedes methods with poor decompression speed, even if they achieve high compression ratios.

• **Rendering Quality:** For the application scenarios we target, lossy
compression is only tolerated, unless it does not degrade image qual-
ity. at means for the attributes associated with each vertex, such as
position or normal vectors, that lossy compression is acceptable only
up to a certain degree. Most importantly,*topology, i.e., the way tri-*
angles are connected, may not change. at rules out solutions that
reduce the number of triangles or vertices in a pre-process.

Parallel geometry decompression allows placing more data in GPU mem- ory. Moreover, it speeds chronically low CPU-to-GPU memory transfers or makes them even dispensable entirely. It further accommodates for two ongoing developments in computer design:

• e computational power increases while data bandwidth cannot keep pace. Processors can only perform instructions as long as they have the necessary data at hand. Otherwise they run idle. Transmit- ting compressed data eﬀectively increases the bandwidth and helps keeping the processor busy.

• e computational power increases because more and more proces- sors are assembled onto a single chip. However, their individual speed increases only little. Sequential algorithms, as used in traditional ge- ometry decompression, are not able to prosper from this develop- ment.

erefore, parallel decompression algorithms are needed to account for these two trends.

**1.2 Contributions and Overview**

Triangle meshes have been the workhorse in Real-Time Computer Graphics for several decades. ey consist of topological information that connects vertices to form triangles. With each vertex, we associate a 3D position. e majority of triangle meshes additionally store one 3D unit vector with each vertex that represents the surface normal vector at the vertex.

We contribute methods that compress positions, unit normal vectors, and topology independently from each other. e methods can be combined with each other and are simple to integrate in existing rendering algorithms.

erefore, our methods help a wide range of real-time rendering applications to reduce their greed for memory.

In Chapter 3, we introduce *level-of-precision (LOP), a novel approach for*
compactly representing vertex positions in GPU memory. ereby, we adapt
the precision of vertex positions based on a view-dependent criterion. GPU
memory for vertex positions is reduced by only storing the currently nec-
essary precision. Our compact representations enable fast random access
from within a vertex program. Once the view-point changes, we adapt a
model’s vertex precision. anks to our new data-parallel algorithms that
run entirely on the GPU, precision is adapted faster than the model is ren-
dered. Furthermore, we allow locally re ning vertex position precision to
avoid artifacts that would otherwise occur when using reduced precision.

e algorithms have been published in [MSGS11].

In Chapter 4, we analyze unit vector representations that are used to encode per-vertex normal vectors. e most common one is to use three oating- point numbers. To our knowledge, we are the rst to derive the error of this representation. Based on our error analysis, we compare existing unit vec- tor compression methods and study their ability to match this error using less memory. We nd out that a parameterization based on projecting unit vectors onto the surface of an octahedron performs best. Unit vectors repre- sented with 52 bits in this representation are more than suﬃcient to achieve the accuracy of three oating-point numbers using 96 bits. We further show that any other unit vector representation saves at most 1.14 bits upon the

octahedron projection. Our compact unit vector representation is decom-
pressed on the GPU at no extra cost. Some of the ndings have also been
published in [MSS* ^{∗}*10].

In Chapter 5, we present a lossless triangle mesh topology compression scheme. It is designed to leverage GPU data parallelism during decompres- sion. We order triangles coherently to form generalized triangle strips. We unpack generalized triangle strips eﬃciently, using a novel data-parallel al- gorithm. Additional compression bene ts come from a variable bit-length code, for which we propose a data-parallel decompression algorithm. While uncompressed triangles require 96 bits per triangle, we obtain 3.7 to 7.6 bits per triangle. anks to the high degree of parallelism, our decompression algorithm is extremely fast and achieves up to 1.7 billion triangles per sec- ond. At the time of nishing this thesis, a paper describing the algorithm has been submitted to a journal [MKSS12].

We provide background information required for this thesis in Chapter 2.

In Chapters 3 – 5, we thoroughly describe our contributions. Finally, we conclude the thesis with an outlook on future work (cf. Chapter 6).

**Background**

Our goal is to combine geometry decompression with real-time rendering.

Both elds are well-established education and research domains which draw upon a vast body of literature. In this chapter, we single out fundamental topics that are most important for this thesis. We provide a concise back- ground and explain concepts that reoccur constantly throughout the main chapters.

In Section 2.2, we summarize elementary concepts of real-time rendering.

is includes triangle mesh representation and its memory consumption, lighting models, graphics pipeline, and its hardware implementation. Large parts of geometry are de ned by real numbers. To eﬃciently compress real numbers, a solid understanding of formats approximating real numbers on a computer is given in Section 2.3. But rst, we establish some notation.

**2.1 Notation**

We comprehensively adhere to the following conventions:

• *Scalar numbers*are written in italics, e.g.,*x, ΔQ, and so forth.*

• *Matrices and vectors*are written in bold case, e.g.,**v. To refer to the***com-*
*ponents of a vector, we use the notation common in C-style program-*
ming languages for structure record types (i.e.,struct): For example,
to access the*x*component of a vector**v, we usev***.x. roughout the*
thesis, we mostly deal with column vectors, e.g.,**v**= (v.x,**v.y,v.z)*** ^{T}*.

• e*inner products*of two vectors**a**and**b**is denoted by angle brackets,
i.e.,*ha,***bi.**

**Symbol** Ki Mi Gi Ti Pi
**Factor** exp_{} exp_{} exp_{} exp_{} exp_{}.

**Table 2.1: Binary Powers.**The abbreviations shown above represent high powers of two.

• Vertical bars*k·k**p*indicate vector and matrix norms. e index*p*spec-
i es the particular norm, e.g.,*p*= for the Euclidean vector norm.

• Elements of *nite sets*and*arrays*are indexed using the array nota-
tion of C-style programming languages. For a set we write*S* =
*{S[*],*S[*],*S[*], . . .*}*and for an array*A*= (A[],*A[*],*A[*], . . .). Note
that indexes start with 0. Sets or arrays with scalar elements are writ-
ten in italics. Matrix and vector element types are written bold-faced.

• *Intervals* are denoted by brackets and parenthesis depending on
whether they are open or closed. For example,[a,*b]*is a closed in-
terval from*a*through*b, and*[a,*b)*is an interval that is closed at the
*a-end and open atb.*

We oen use powers of two, which are usually represented by superscripts,
i.e., * ^{N}*. is may, however, become cumbersome to read, particularly in
the presence of multiple levels of superscripts, as for example

^{N}*. It gets even more complicated, once superscripts themselves contain subscripts, e.g., *

^{M}

^{N}*. In other disciplines, such as mathematics or physics, the same problem occurs when dealing with the natural exponential function*

^{Mu}*e*

*. It is oentimes replaced byexp(x) :=*

^{x}*e*

*. We adopt this convention and de ne power-of-two as a function*

^{x}exp_{}:*x7→*^{x}*.*
Its inverse function is the base-two logarithmlog_{}*y.*

When dealing with compression and decompression techniques, it is in-
evitable to provide data size information. e unit for*bytes*is abbreviated
using the letter*B. We abbreviate large sizes withbinary pre xes. ey are de-*
rived from powers of two. For example, B=exp_{}() = KiB, where

1

1

0
7
8
6
4
5
3
**T** 2

0 1 2 3 4 5 6 7

9

4 5 6 9 8 7 0

3

2 4

6

9 8

1

2

4 9

8

7 1

2 9

9

**Figure 2.1: Explicit Representation of a Triangle Mesh.**The explicit representation is an
array**T**. Each element**T[***i*]consists of three indices.

*Ki*abbreviates*kilobinary. But see Table 2.1 for a list of binary powers. For*
powers of , we use the SI pre xes, e.g., M for ^{}, G for ^{}, and so forth.

We use*compression ratios*(e.g., 3:1, or 5.5:1) to relate the size of compressed
data to uncompressed data. e higher a compression ratio the smaller the
compressed data. We make also use*compression rates*(e.g., bits per triangle
(bpt), bits per unit vector) to break down the bene t of compression to in-
dividual elements. e lower the compression rate the better. To measure
decompression speed we use*decompression rate*(e.g., triangles per second,
unit vectors per second). e smaller the decompression rate the better.

**2.2 Real-Time Rendering**

In real-time rendering, the time for generating an image (also called frame) may not surpass 67 ms [AMHH08]. e image is generated from a 3D de- scription of geometry, which is typically represented by triangle meshes (cf. Section 2.2.1). To produce appealing images, the geometry is exposed to physically motivated lighting simulations (cf. Section 2.2.2). Real-time rendering is possible with an optimized graphics pipeline implemented on graphics hardware (cf. Section 2.2.3).

**2.2.1 Triangle Meshes**

In real-time rendering,*triangle meshes*are the most common form of rep-
resenting geometry. A triangle mesh is a set of triangles. A*triangle*itself

consists of three diﬀerent*vertices. e vertices of a triangle are connected*
through*edges. Triangles are connected by sharing common vertices.*

Each vertex has one or more*vertex attributes, or shortly referred to asat-*
*tributes. Common types of attributes include the position of the vertex in*
3D space, a 3D unit normal vector, 2D texture coordinates, 3D tangent and
bi-tangent vectors, and color values. is list is, of course, incomplete. Any
kind of attribute imaginable can be associated with a vertex.

Triangle meshes are oen represented in what we refer to as*explicit repre-*
*sentation. is representation is also called to astriangle-vertex representa-*
*tion*or*indexed triangle-set. In the explicit representation, we store the vertex*
attributes in so-called*vertex arrays. Vertex arrays that belong to the same*
triangle mesh all have*N** _{V}*elements, where

*N*

*is the number of vertices.*

_{V}In the array of triangles**T, we store for each triangleT[i]**three*vertex indices*
**T[i].v**,**T[i].v**, and**T[i].v**. ereby,**T[i].v**,**T[i].v**, and**T[i].v**reference el-
ements of the vertex arrays. e array consists of*N**T*elements, where*N**T*is
the number of triangles of the mesh. An example of an array of triangles and
the associated topology is shown in Figure 2.1.

Note that the order of the triangles in the array**T**is not important for the
nal appearance of the model. We can therefore order the triangles in any
order we want. e order of the vertices of one triangle can also be inter-
changed, but there is one restriction: the*orientation*must remain, as many
algorithms rely on a distinct order. A triangle, represented by a row vector of
three vertex indices(v*,v**,v*), does not change its orientation if we use the
permutations(v*,v**,v*)and(v*,v**,v*), i.e., we “rotate” the vertex indices.

Other permutations change the orientation and are therefore not allowed.

e order of the elements in the attribute arrays does not aﬀect the appear-
ance of the model, either. When reordering the elements in the attribute ar-
rays, we have to adjust the vertex indices of**T**to point to the right attributes.

**Memory Consumption**

e types of models that we use in this thesis consist of a triangle array, an array of vertex positions, and an array of vertex unit normal vectors. Typi- cally, three integers are used to store a triangle, i.e., 12 B. A vertex position and a unit normal vector use three single precision oating-point numbers, i.e., 12 B, each. e memory consumption of an uncompressed mesh is

*K*=*N**T**·*B+*N**V**·**·*B*.*
For large meshes,*N**T**≈**·N**V*, therefore,

*K≈**·N**T**·*B*.*

Hence, vertex positions and unit normal vector compromise about 25 % of the size of the triangle mesh, each. e triangle array consumes the remain- ing portion of about 50 %

**2.2.2 Lighting Models**

In order to produce realistic images, we compute the amount of out-going
color intensity at a point on the surface of a triangle mesh using*lighting mod-*
*els. Lighting models are expressed in terms ofshading equations. For each*
color channel (typically red, green, and blue), we compute the intensity val-
ues independently.

One common lighting model is the*diﬀuse lighting model. It is also referred to*
as*Lambertian re ectance*or*Lambert’s cosine law. It reoccurs in many other*
lighting models as part of their diﬀuse term. e amount of outgoing diﬀuse
light at a point**p**on a surface is

*f*_{Diffuse}=*I*_{D}*·*max(*hn,***l***i,*)*.* (2.1)

e vectors**n**and**l**have unit length, i.e.,*knk*=*klk*=.**l**directs towards
the light source and**n**is the unit normal vector at**p.***I**D*combines the diﬀuse
re ectance characteristics of the material at the point**p**and the amount of
light emanating from the light source for one color channel.

e*Blinn-Phong* lighting model [Bli77] builds upon diﬀuse re ection. It
adds specular highlights to the surface, i.e.,

*f*_{Blinn}=*f*_{Diffuse}+*I**S**· hn,***hi**^{s}*,* (2.2)
where*s*is the specular exponent. e halfway-vector**h**is halfway between
the**l**and the unit vector**v**pointing towards the camera, i.e.,

**h**= **v**+**l**
*kv*+**lk**

*.*

Similar to*I**D*,*I**S*is the specular color that emanates from the surface. It com-
bines parameters from the light source with material properties.

ese two lighting models are commonly used in real-time rendering and are used to generate plausible images in numerous rendering applications.

For more lighting models, we refer to relevant literature for more de-
tails [SAG* ^{∗}*05, AMHH08, PH10].

**2.2.3 Real-Time Rendering on Graphics Hardware**

rough the course of the previous decades, an algorithm called*rasteriza-*
*tion* has very successfully been deployed in the eld of real-time render-
ing. ereby, 3D geometry, typically represented as triangle meshes, is pro-
jected triangle aer triangle onto a 2D image plane and then converted to a
raster image. e success of rasterization in the eld of real-time render-
ing over other image generation algorithms, such as ray tracing [Whi80]

or micro-polygon pipelines [CCC87], is that special purpose hardware, so called GPUs, dramatically speed the image generation process.

As the methods presented in this thesis are in the domain of real-time ren-
dering on graphics hardware, we summarize how the image generation pro-
cess applied for rasterization is decomposed into several steps, the so-called
*graphics pipeline. We brie y explain how contemporary graphics hardware*
used for real-time rendering works and how it can be used for more general
purpose tasks.

Primitive Assembly

Rasterizer Stream Output

Per-Fragment Operation Fragment Shader

Vertex Shader

Processed Vertices

Primitives

Fragments

Color, Depth-Value

Vertex Buﬀers

Index Buﬀers

Vertex Buﬀers

Frame Buﬀer

**Figure 2.2: OpenGL Graphics Pipeline.** The graphics pipeline is decomposed into pro-
grammable stages (green boxes) and xed-function stages (red boxes). It uses inputs and
output buﬀers located in GPU memory (orange boxes).

**Graphics Pipeline**

Rasterization is decomposed into several steps, forming the so-called*graph-*
*ics pipeline. Common inputs to the pipeline are 3D geometry, lighting mod-*
els, and materials. e pipeline’s output is a 2D image.

e pipeline steps are standardized and they are accessed by the pro- grammer through application programming interfaces (APIs), such as OpenGL [SA11] and Direct3D [Mic10]. e implementations developed during the course of this thesis make use of OpenGL 4.0 – 4.2. An overview of the pipeline stages that are important for this thesis is shown in Figure 2.2.

e pipeline has programmable stages, shown as green boxes, and xed- function stages, shown by red boxes. Input and output data is stored in GPU memory drawn with orange boxes.

In this thesis, we make use of the vertex and fragment shader stage. e other
programmable stages, i.e., geometry shader, tessellation control shader, and
tessellation evaluation shader stage, are not used in this thesis and therefore
not shown in Figure 2.2. However, algorithms using these stages can also in-
corporate methods developed in this thesis. Programs for the programmable
stages are implemented using shading languages. We use OpenGL shading
language (GLSL) [Kes11] to specify*vertex*and*fragment programs.*

We specify geometry using the explicit representation. We upload the tri-
angle array**T**into an*index buﬀer*and the attribute arrays to*vertex buﬀers.*

Beside triangles, other primitives such as points or lines are supported, too.

e set of attributes associated with each vertex serves as input to the ver-
tex shader stage. ere, a*vertex program*is executed independently for each
vertex. A vertex program outputs an arbitrary set of per-vertex attributes.

e vertex shader stage is fully programmable using GLSL. Most vertex pro-
grams transform per-vertex positions and unit normal vectors speci ed in
local object space coordinates to world space or camera space. Moreover,
each vertex position is transformed to*clip space. Clip space is a vector space*
that graphics hardware uses to determine the 2D fragment location and its
depth value in screen-space.

e vertex attributes outputted by the vertex shader stage are assembled to
primitives, e.g., triangles in our case, by the*primitive assembly*stage. e pro-
grammer can optionally write the transformed primitives to a vertex buﬀer
using the*stream output*stage. In most cases, the primitives are handed over
to the*rasterizer*stage. ere, each primitive is rasterized, i.e., it is converted
to a set of fragments in screen space. For each fragment, the rasterizer inter-
polates the attributes outputted from the vertex shader stage. When activat-
ing stream output, the program may prevent the pipeline from executing the
rasterizer and all subsequent stages. is is useful when streamed-out vertex
buﬀers are used as input for subsequent rendering passes.

e interpolated attributes are used as input to the*fragment shader* stage.

Like the vertex shader stage, the fragment shader stage is fully programmable
with GLSL using*fragment programs. A fragment program is executed for*
each fragment individually and it outputs the fragment’s nal color and

depth value (i.e., the distance from the fragment to the camera).

Most of our rendering algorithms use vertex programs to compute camera- space coordinates for unit normal vectors and positions as per-vertex at- tributes. e interpolated attributes are used in a fragment program to eval- uate shading equations.

e fragments computed in the fragment shader stage are combined with
results that are already in the frame buﬀer. e most important operation
is depth testing. For each pixel in the frame buﬀer, the depth value of the
fragment closest to the camera is stored in the*depth buﬀer*(or*z-buﬀer). If*
a new fragment’s depth value is closer to the camera than the existing depth
value, the old pixel color and depth value are updated.

**Graphics Hardware and Compute APIs**

Both xed function stages and programmable stages of the graphics pipeline
are implemented eﬃciently on GPUs. On modern GPUs, the programmable
stages operate in a*data-parallel*way: A single instruction is executed on mul-
tiple data elements simultaneously. is is also referred to as the single in-
struction, multiple data (SIMD) paradigm.

For example, as the vertex shader stage transforms vertices independently
from each other, all vertices can be transformed in parallel. e same ver-
tex program is executed on diﬀerent vertices. Likewise, the fragment shader
stage processes all fragments in parallel. Each incarnation of a vertex or frag-
ment program that operates on diﬀerent data is called a*thread. Shading lan-*
guage program threads are mapped to the computational cores of a GPU
transparently to the programmer.

A GPU has signi cantly more computational cores than a CPU. For exam-
ple, an Nvidia GeForce 580 GTX consists of 512 cores, whereas CPUs typi-
cally have two to eight cores. at is why GPUs are referred to as*many-core*
architectures and CPUs to as*multi-core*architectures.

For data located on the GPU, very high memory throughput-rates are pos- sible (up to 192.4 GB per second), and the computational cores can reach a

theoretical peak performance of 1581 billion single-precision oating-point operations per second. In contrast, current 2nd generation Intel Core i7 CPUs with six cores running at 2.8 GHz obtain a bandwidth of 32 GB per second and a theoretical peak performance of 134.4 billion single-precision oating-point operations per second. At rst sight, this seems a tremen- dous performance advantage for GPUs. It should be noted, that GPUs only achieve high performance rates for data-parallel algorithms. Not every prob- lem can be recast in a data-parallel fashion. In general, sequential tasks per- form much better on CPUs.

ere are algorithms other than those realized with the rendering pipeline
that can be expressed in a data-parallel fashion. ese algorithms can also
bene t from the computational power of GPUs. at is why the function-
ally of GPUs is also exposed through more general purpose data-parallel
compute APIs, such as OpenCL [Mun11] or Nvidia’s*compute uni ed device*
*architecture (CUDA)*[Nvi11b]. We use CUDA for this thesis (cf. Chapter 5).

Instead of vertex or fragment programs, we specify programs called*kernels.*

e kernel instructions are executed by spawning*threads. From a program-*
mer’s perspective, each thread maps onto one computational core of a GPU
and the threads are executed in parallel.

**2.3 Number Formats for Real Numbers**

In this section, we provide a brief overview of oating-point numbers de-
ned in the IEEE 754-2008 standard [IEE08]. We focus on*how* oats are
stored and*what*real numbers they are able to represent. Based on the deeper
understanding of oating point numbers provided in Section 2.3.1, it will
turn out during the course of this thesis that*uniformly quantized numbers*(cf.

Section 2.3.2) are more appropriate for*transmitting*vertex attributes. How-
ever, we use oating-point numbers for computations, as they are widely
supported by both GPUs and CPUs.

**2.3.1 Floating-Point Numbers**

e IEEE 754-2008 standard speci es several oating-point formats which
are widely supported on today’s computer architectures. We limit the expla-
nation to the*normalized binary oating-point*format, as it is predominantly
used on GPUs [Bly06, SA11]. Further, we exclude topics, such as converting
real values to oats, rounding-modes, and arithmetic precision of oating-
point operations, and refer the reader to advanced literature on oating point
number systems [Gol91, Knu97, MBdD* ^{∗}*10].

A*normalized binary oating-point number*is a triple(s,*e,m)*with

• a*sign bit s∈ {,*}. If*s*=, the number is positive;

• an*exponent e*with*N** _{e}*bits, where

*e∈*[*−*exp_{}(N*e**−*) +*, . . . ,*exp_{}(N*e**−*)];

• a mantissa*m*with*N**m*bits, where*m∈*[, . . . ,exp_{}(N*m*)*−*].

We convert a normalized binary oating-point number to a real number with the following mapping:

(s,*e,m)7→*(*−)*^{s}*·*exp_{}(e)*·*
(

+ *m*

exp_{}(N*m*)
)

*,*

for *−*exp_{}(N_{e}*−*) +*<e<*exp_{}(N_{e}*−*)*.*

We want to clarify the term*normalized binary oating-point. ebinary*is
due to the radix 2 as the base of the exponent. e word *normalized*is
because of the term +located le of the fraction. Normalization makes
oating-point numbers unique, i.e., a real-number that is representable by a
normalized binary oating-point number maps to exactly one unique triple
(s,*e,m)*. From now on, the term *oat*abbreviates*normalized binary oating-*
*point number.*

A zero is encoded by a special triple (s,*−*exp_{}(N*e**−*) +,). A triple
(s,*−*exp_{}(N*e**−*) +,*m)*with*m6*= is a so-called*subnormal*number. e
standard de nes a special mapping for them, however, subnormal numbers
play no role in this thesis, as GLSL ushes them to zero [Kes11].

**Format** *N**e* *N**m* **Total** *f*_{min} *f*_{max}
**Mini** =*.*·^{−}^{} = *.*·^{}
**Half** *≈**.**·*^{−}^{} *≈**.*·^{}
**Single** *≈* *.**·*^{−}^{} *≈**.*·^{}

**Double** *≈**.*·^{−}^{} *≈* *.*·^{}

**Table 2.2: Floating-Point Number Formats.**Three formats —*Half,**Single, and**Double*—
are supported by current graphics hardware. The format named*mini*is introduce for ex-
planatory purposes.*N**e*and*N**m*are the number of bits for the exponent and the mantissa,
respectively.*Total*is the total number of bits including the sign bit. The values*f*_{min}and
*f*_{max}are the smallest and largest nite positive value the format can represent.

_{}^{} _{}^{}

*−*_{}^{}

*−*^{}_{}

*−*

*−*

**Figure 2.3: Distribution of Mini-Floats.** The horizontal line is the real line. All possible
mini- oats in the range of*−*toare marked by vertical lines. The thick lines highlight the
boundaries of intervals with common sample densities. Note the non-uniform distribution
of the oating point numbers.

Accuracy and range of oats depend on the number of bits spent for mantissa and exponent. From the de nition, we directly derive the smallest oat

*f*_{min}=exp_{}(*−*exp_{}(N*e**−*) +)*,*
and the largest nite oat, respectively,

*f*_{max}= (*−*exp_{}(*−N**m*))*·*exp_{}(exp_{}(N*e**−*)*−*)*.*

e smallest nite oat is*−f*_{max}and the largest negative normalized oat is

*−f*_{min}. Table 2.2 lists the formats half, single, and double, that are currently
supported by GPUs [SA11].

e table further contains the*mini* oat format that we use for explanation
purposes, but it has no practical relevance. It has a sign bit, three bits for the

mantissa, and three bits for the exponent. Figure 2.3 shows the distribution
of the mini- oats in the range from*− to . e distribution is similar for*
other normalized binary oating-point formats. Every vertical line corre-
sponds to a oat. In between of two oats there are an in nite number of
real values. e most important feature that can be seen from the gure is
that oats are distributed non-uniformly across the real line: In the inter-
val from[exp_{}(i)*,*exp_{}(i+))there are as many oats as in the interval
[exp_{}(i+)*,*exp_{}(i+))which covers a segment of twice the length. For
example, the interval[^{}*/**,*)contains eight oats along a range of^{}*/*. e
same number of oats is also in the interval from[*,*), which is twice as
long. In general, in the interval[exp_{}(i)*,*exp_{}(i+)]the space between
two oats is constant, i.e.,

*ε*_{float,i}=exp_{}(i*−N** _{m}*)

*.*(2.3)

Hence, the closer we approach zero the higher the sample density becomes.

Likewise the resolution of oats coarsens towards*|f*_{max}*|.*

Also note the gap between*−*_{}^{} and _{}^{} in the mini- oat format of Figure 2.3
with only one sample in between, i.e., 0. e sampling rate increases the
closer we approach zero, but aer reaching*|f*_{min}*|, it drops abruptly by a factor*
ofexp_{}(N*m*)and leaves a large gap. A similar gap exists for all other oating-
point formats. is gap may be lled with subnormal numbers, however, not
by GLSL, which does not support subnormal numbers.

For most rendering tasks, single precision is considered to be suﬃcient.

With the advent of Direct3D 10 [Bly06] GPUs adhere to the IEEE 758 stan- dard. GLSL uses the standard [Kes11], too. Double precision is avail- able, however, single precision performance is typically two [Nvi11b] to ve times [AMD11] faster on current hardware. In GLSL, half-precision is not natively supported as data type. Input attributes to vertex programs and tex- ture formats may however use half precision to save bandwidth. For carrying out computations they are converted to single precision [SA11].

**2.3.2 Uniformly Quantized Numbers**

In contrast to oats,*uniformly quantized numbers*are spaced uniformly. We
store a uniformly quantized number using an unsigned integer*i*containing
*N** _{u}*bits. A uniformly quantized number maps to a real value that is between
a lower bound

*u*minand an upper bound

*u*max, i.e.,

unpack:*i7→* *u*max*−u*min

exp_{}(N*u*)*−**·i*+*u*min*,* (2.4)
where*i* *∈* [*. . . ,*exp_{}(N)*−*]. Obviously, the distance between two real
numbers is always

*ε*= *u*max*−u*min

exp_{}(N*u*)*−**.* (2.5)

To convert a real number*r*into a uniformly quantized number, we use the
following mapping:

pack:*r7→*

⌊ *r−u*min

*u*max*−u*min *·*(exp_{}(N*u*)*−*) +

*σ(r)*

⌋

*.* (2.6)

ereby,*σ(r)*is the sign-function, i.e.,
*σ(r) :r7→*

{ *r≥*

*−* *r<**.* (2.7)

e plus sign splits Equation (2.6) into a term that inverts Equation (2.4) and
a rounding term, i.e.,^{}*/**σ(r)*. By the rounding term we make sure that we
map to the one uniformly quantized number that is closest to*r, aer applying*
the oor function*b·c. Note that (2.6) can also be used ifr*is a oat.

When converting a real number*r*into a uniformly quantized number*i* =
pack(r),*i*is mapped to another real number*q*=unpack(i). We say that*q*
is the number to which*r*gets quantized. is can be expressed as a function
quantize:*r7→*unpack(pack(r))*.* (2.8)
Note that*r*and*q*diﬀer at most by the quantization error of^{ε}*/*.

As opposed to oats, GPUs do not oﬀer hardware instructions for opera- tions on uniformly quantized numbers. erefore, we convert uniformly quantized numbers to oats of suﬃcient precision and perform arithmetic calculations with oats. is is convenient as GPUs provide a rich optimized instruction set for oats. Alternatively, we could implement arithmetic func- tions directly on uniformly quantized numbers in soware with integer op- erations. However, this is rather complex from a programmer’s perspective, and the resulting code is executed more slowly than the equivalent code that uses oating-point instructions.

**Adaptive Level-of-Precision**

e vertices of a triangle mesh can have all kinds of attributes: unit normal vectors, tangent vectors, colors, texture coordinates, etc., but there is one attribute that is used for sure: vertex positions. Hence, all triangle meshes bene t in terms of memory consumption when compressing positions.

Compressed positions are not exclusively useful for triangle meshes or
other polyhedral representations. In point-based graphics [GP07], geom-
etry is de ned by a set of non-connected positions. Surfaces, such as
Bézier patches, B-Spline surfaces [Far02], non-uniform rational B-splines
(NURBS) [PT97], T-splines [SZBN03], subdivision surfaces [PR08], and al-
gebraic surfaces [Sed85], use positions to represent*control points.*

e memory consumption of the positional data of the meshes used in this thesis amounts for one fourth of the total data. erefore, a signi cant mem- ory reduction of the positions immediately results in a considerable data re- duction for the entire mesh.

In this chapter, we provide simple and eﬃcient ways of reducing the memory
consumption of vertex positions. e main idea is to quantize vertex posi-
tions according to their distance to the camera: the more distant the vertex
positions are to the camera, the less precision and therefore less memory they
require. We adapt precision by adding or removing bits from vertex posi-
tions. To allow precision adaption in real-time, we present fast data-parallel
algorithms. Moreover, our data structures for storing vertex positions allow
random access, despite being compressed. However, a reduced precision
may result in image errors. We analyze these visual artifacts and avoid them
by constraining the quantization of*critical*vertices.

We achieve compression ratios for vertex positions from 2.5:1 up to 5:1. e reduced complexity in uences rendering quality and speed only little. More- over, our techniques are easy to integrate in existing rendering algorithms.

**3.1 Introduction**

A common way of saving memory for triangle meshes is to use *level-of-*
*detail (LOD) methods. Mesh complexity is adapted by varying the number of*
vertices, edges, and triangles. Starting from a*base-mesh, a series of meshes*
is created, each of which represents a diﬀerent level of detail. A mesh from
that series is called a*level of detail*or, more brie y, a*level. A level is either*
coarsened or re ned by removing or adding vertices, edges, and triangles.

We say that a level is*low*(high) if it has few (many) vertices and therefore
edges and triangles. For rendering, one level is selected. e advantage of
using a lower level is not only reduced memory consumption. e reduced
complexity results in faster rendering, too. However, a low level also has less
detail which degrades the nal image quality.

e decision which level to choose must be made carefully and constitutes a crucial part of an LOD-method. It is a tradeoﬀ between delity and per- formance. Mostly, the level is selected such that an observer is not able to distinguish the nest level from the lower level that is chosen for rendering.

A typical criterion is the distance from the mesh to the camera: the further the mesh is away from the camera the coarser the level can be.

One important aspect of LOD methods is the transition from one level to the
other. For example, when the mesh comes closer to the camera, the current
level is no longer suﬃcient for the desired image quality. en, we switch to
the next ner level. However, this is an abrupt change of levels between two
frames. is may result in so-called*popping artifacts. e human eye is par-*
ticularly sensitive to popping artifacts, which should therefore be avoided.

LOD-methods are classi ed into discrete and continuous methods.*Discrete*
*LOD-methods*have a small, nite set of precomputed levels. Two succes-
sive LODs diﬀer by a large amount of vertices, edges, and triangles. In con-

**number of vertices**

precision

**number of vertices**

precision

ne LOD coarse LOD

**(a)**LOD-methods

number of vertices

**pr****ecision**

number of vertices

**pr****ecision**

ne LOP coarse LOP

**(b)**LOP-methods

**Figure 3.1: Diﬀerence Between LOP- and LOD-Methods.** The gray box shows the set of
potential vertices. (a) The green boxes show the subset of vertices used for a particular
LOD. LOD-methods vary the number of vertices. (b) The blue boxes show the subset of
vertices of a particular LOP. LOP-methods vary the numerical precision of the vertices.

trast, the diﬀerence between two*continuous LODs*is as low as a single ver-
tex. While the former ones are usually simpler to implement, the later ones
suﬀer less from popping artifacts. An overview of various LOD methods is
provided in the book by Luebke and co-authors [LRC* ^{∗}*03].

Figure 3.1a recaps the idea of LOD-methods: e gray box in Figure 3.1a shows the set of potential vertices that is used for all LODs of the mesh. From that set of potential vertices, each level singles out a sub-set, shown as green boxes in the gure. e width of the boxes correlates with the number of vertices. e abscissa represents the number of vertices used for a model, and the ordinate represents the precision of the vertex attributes. eir product is the memory usage. So far, LOD-methods do not change the numerical precision of the vertex attributes and alter the number of vertices only.

**3.1.1 Contributions**

We leverage the unexploited degree of freedom and vary the numerical pre-
cision of vertex positions to save memory. We adapt the*level-of-precision*
*(LOP). Less precision results in lower memory consumption. e blue boxes*
in Figure 3.1b indicate the diﬀerent LOPs. eir height re ects the numeri-
cal precision of the vertex attributes. A coarse LOP uses fewer bits than a ne
LOP for the vertex attributes. We refer to the level of precision as*bit-level.*

Note that LOP is not an LOD-method. We study LOP-methods with a con- stant number of vertices. LOP-methods can, however, be combined with LOD methods to achieve additional memory savings.

LOP can be used for all vertex attributes, but we consider LOP for vertex
positions only. While a lower precision uses less memory, rendering qual-
ity can suﬀer. We trade memory usage and rendering quality by adapting
the bit-level*interactively: we keep only those bits in GPU memory that are*
required for the currently used precision.

We introduce two approaches building upon the LOP idea:

• *adaptive precision (AP)*and

• *constrained adaptive precision (CAP).*

Both methods represent positions as uniformly quantized numbers (cf. Sec- tion 2.3.2). During rendering, we choose the bit-level such that it is as low as possible in order to reduce memory while maintaining a high image quality.

Our AP-method strives for*perfect coverage. at means pixels covered in the*
baseline image are not the same as in the LOP-image. With*baseline image, we*
refer to the image generated with positions represented as single-precision
oating-point numbers.*LOP-image*is an image generated with a lower bit-
level using AP or CAP. Perfect coverage is not generally obtainable and, in
most cases, we get a*coverage error. However, we can choose the bit-level such*
that the coverage error is below a fraction of a pixel.

Yet, AP can be prone to artifacts other than coverage errors, especially for low bit-levels. We observe that errors in the pixel color can occur. We analyze

number of vertices

precision

number of vertices

precision

**(a)**Adaptive Precision (AP) **(b)**Contstrained Adaptive Precision (CAP)
**Figure 3.2: Diﬀerence Between AP and CAP.** While AP uses one common bit-level for all
vertex positions, CAP restricts the bit-level of some vertices to a minimum bit-level.

these artifacts and nd out that they are caused by wrong shading compu- tations and erroneous depth values. We show that they can be removed if some vertices use a higher bit-level than others.

For triangle meshes that are prone to these artifacts, we propose to use con-
strained adaptive precision (CAP). ereby, we constrain the bit-level of
each vertex to a*minimum bit-level. We present preprocessing algorithms for*
determining the minimum bit-level for the positions of a mesh.

To realize AP and CAP, we further make the following contributions:

• Changing the LOP is conducted by successively removing and adding low-order bits of positions. It is carried out directly on the GPU by fast data-parallel algorithms. erefore, only little CPU-to-GPU commu- nication is required.

• Positions are stored in a compact form, which allows for fast random access in a vertex program.

e diﬀerence between the two approaches is shown in Figure 3.2. AP adapts the LOP of all positions simultaneously. All positions use the same level of precision. CAP adapts the precision of all vertices, too. However, it accounts for the minimum bit-levels of the vertices.

During rendering, both of our approaches choose the bit-level based on a view-dependent criterion. at makes memory usage also view-dependent.

In typical application scenarios, we observe compression ratios from 2.5:1

**(a)**Float **(b)**AP **(c)**CAP

**Figure 3.3: Positions Stored as Floats, AP, and CAP.** The bottom row shows a close-up of
the radiator grill of the Car model. (a) With positions stored with oating-point precision,
the image is rendered in 14.5 ms. (b) AP renders the Car in 9.8 ms and consumes 28 %
of the memory of oating-point positions. AP is, however, prone to shading artifacts (see
bottom row). (c) CAP does not suﬀer from the artifacts and only needs 38 % of the memory
of oating-point positions. The image is rendered in 14.8 ms.

up to 5:1 over positions stored with single-precision oating-point numbers.

Moreover, positions compressed in our format deliver competitive real-time frame-rates that are oentimes higher than those achieved with oating- point positions. LOP-methods are powerful and eﬀective. At the same time, they are easy to implement.

Figure 3.3 shows a result of our AP- and CAP-method compared against po- sitions stored as single-precision oating-point numbers. While AP delivers a coverage error less than half a pixel and substantially compresses vertex positions, it is prone to shading artifacts, particularly for models with ne detail (b). We remove these artifacts using CAP (c) and obtain images that are visually indistinguishable from images generated with oating-point po- sitions (a). At the same time we still achieve signi cant memory savings.

**3.1.2 Overview**

is chapter is organized as follows: aer reviewing prior art in Section 3.2, we introduce the concept of LOP in Section 3.3. en, we present the LOP methods that handle coverage, shading, and depth errors in Sections 3.4 and 3.5. In Section 3.6, we introduce data-parallel algorithms for precision adaption and show how our data-structures integrate into an OpenGL ren- dering pipeline. Aer presenting and discussing results (Section 3.7), we conclude with an outlook on further applications of LOP in Section 3.8.

**3.2 Previous Work**

We provide a brief overview of existing approaches that focus on compres- sion of positional data. We limit the discussion to GPU methods as well as to contributions dealing with precision issues. Moreover, we discuss methods that take shading error considerations into account.

**Vertex Clustering**

*Multi-resolution*techniques are frequently applied to reduce the geometric
complexity. At vertex level,*vertex cluster* algorithms [RB93] partition the
object’s bounding geometry into uniform cells. Vertices contained in one
cell are replaced by a single vertex representative whose position is subject to
optimizations [Lin00]. An eﬃcient GPU implementation for this clustering
exists [DT07], but it is merely used to speed-up the generation of discrete
LODs and it is not used to reduce the memory footprint during rendering.

**Progressive Meshes**

A widespread multi-resolution approach is the progressive meshs (PMs) technique by Hoppe [Hop96]. Starting from a base-mesh, successive edge collapses reduce the number of vertices and triangles. is ne grain control