Real-Time Geometry Decompression on Graphics Hardware

(1)

Real-Time Geometry Decompression on Graphics Hardware

Echtzeit-Geometriedekompression auf Graphikhardware

Der Technischen Fakultät der Universität Erlangen–Nürnberg

zur Erlangung des Grades

D–I

vorgelegt von

Quirin Nikolaus Meyer

Erlangen — 2012

(2)

der Universität Erlangen–Nürnberg

Tag der Einreichung: 11.05.2012 Tag der Promotion: 01.08.2012

Dekanin: Prof. Dr.-Ing. habil. Marion Merklein Berichterstatter: Prof. Dr.-Ing. Marc Stamminger

Prof. Dr. rer. nat. Michael Guthe

(3)

(4)

(5)

Real-Time Computer Graphics focuses on generating images fast enough to cause the illusion of a continuous motion. It is used in science, engineering, computer games, image processing, and design. Special purpose graphics hardware, a so-called graphics processing unit (GPU), accelerates the image generation process substantially. erefore, GPUs have become indispens- able tools for Real-Time Computer Graphics.

e purpose of GPUs is to create two-dimensional (2D) images from three- dimensional (3D) geometry. ereby, 3D geometry resides in GPU memory. However, the ever increasing demand for more realistic images constantly pushes geometry memory consumption. is makes GPU memory a limiting resource in many Real-Time Computer Graphics applications. An eﬀective way of tting more geometry into GPU memory is to compress geometry.

In this thesis, we introduce novel algorithms for compressing and decom- pressing geometry. We propose methods to compress and decompress 3D positions, 3D unit vectors, and topology of triangle meshes. ereby, we obtain compression ratios from 2:1 to 26:1. We focus on exploiting the high degree of parallelism available on GPUs for decompression. is allows our decompression techniques to run in real-time and impact rendering speed only little. At the same time, our techniques achieve high image quality:

images, generated from compressed geometry, are visually indistinguishable from images generated from non-compressed geometry. Moreover, our methods are easy to combine with existing rendering techniques. ereby, a wide range of applications may bene t from our results.

(6)

(7)

Die Echtzeit-Computergraphik beschäigt sich mit der Bilderzeugung, die schnell genug ist, um die Illusion einer kontinuierlichen Bewegung hervorzurufen. Sie ndet Anwendung in den Natur- und Ingenieurswis- senschaen, bei Computerspielen, in der Bildverarbeitung, und in gestal- terischen Disziplinen. Spezielle Graphikhardware, die man als Graph- ics Processing Unit (GPU) bezeichnet, beschleunigt dabei den Bilderzeu- gungsprozess erheblich. Aus diesem Grund sind GPUs zu unersetzlichen Werkzeugen der Echtzeit-Computergraphik geworden.

GPUs dienen dazu, zweidimensionale Bilder (2D) aus dreidimensionaler (3D) Geometrie zu erzeugen. Dabei be ndet sich die 3D Geometrie im GPU-Speicher. Jedoch lässt der stetig steigende Bedarf an noch realität- streueren Bildern auch den Geometriespeicherverbrauch wachsen. GPU- Speicher wird folglich zu einer knappen Ressource in vielen Echtzeit- Computergraphik-Anwendungen. Ein eﬀektives Mittel, um mehr Geome- trie in den GPU-Speicher zu bekommen, ist, die Geometrie in komprimierter Form dort abzulegen.

In dieser Arbeit werden neuartige Geometriekompressions- und Geome- triedekompressionsalgorithmen vorgestellt. Die hier vorgeschlagenen Meth- oden dienen zur Komprimierung und Dekomprimierung von 3D Positio- nen, 3D Einheitsvektoren, und der Topologie von Dreiecksnetzen. Dabei werden Kompressionsverhältnisse von 2:1 bis 26:1 erzielt. Den Schwerpunkt der Arbeit bildet die Ausnutzung des auf GPUs verfügbaren hohen Par- allelitätsgrades bei der Dekompression. Dadurch laufen die Dekompres- sionsverfahren in Echtzeit und beein ussen die Rendering-Geschwindigkeit nur geringfügig. Gleichzeitig sind die erzeugten Bilder von hoher Qual- ität. Die aus komprimierter Geometrie erzeugten Bilder sind visuell nicht von Bildern zu unterscheiden, welchen nicht-komprimierte Geometrie zu Grunde liegt. Darüber hinaus sind die hier vorgeschlagenen Verfahren le-

(8)

Ergebnissen der vorliegenden Arbeit zu pro tieren.

(9)

I would like to thank my advisor Prof. Dr.-Ing. Marc Stamminger for his in- spiration, his patience, his trust, his understanding, his advice, his way of motivating people, and his ability to listen. Moreover, I thank Dr.-Ing. Gerd Sußner of Realtime Technology AG (RTT) for initiating and advising the cooperation between RTT and the Chair of Computer Science 9 (Computer Graphics) at the University of Erlangen-Nuremberg. is cooperation in- spired many ideas found in this thesis. erefore, I would like to express my gratitude towards RTT staﬀ supporting this cooperation, namely Peter Röhner and Cornelia Denk (now at BMW).

Further, I thank Prof. Dr. Günther Greiner for the fruitful discussions on all kinds of mathematical issues. Dr.-Ing. Jochen Süßmuth kindly provided his X E LTEX template. I have to thank him even more for his invaluable advice.

During my thesis, I had the opportunity to publish with great researchers which happen to be wonderful persons at the same time. Besides Marc, Gün- ther, Jochen, and Gerd, I thank Charles Loop, Ph.D. (Microso Research), Prof. Dr. Rolf Wanka (Chair of Computer Science 12) and Prof. Dr. Carsten Dachsbacher (then at University of Stuttgart), as well as Dr.-Ing. Christian Eisenacher, Ben Keinert, Magdalena Prus, Henry Schäfer, Fabian Schönfeld, and Christian Siegl (all from Chair of Computer Science 9).

e fantastic and overwhelming ambiance at our chair was created by my co-authors, as well as Elmar, Frank, Franz, Maddi, Marco, Mickie, Michael, Kai, Roberto, Maria, Manfred, and my oﬃcemates Matthias and Jan.

For providing models, I thank Stanford University (Bunny, Buddha, Ar- madillo, Dragon, Asian Dragon, Statuette), e Digital Michelangelo Project (David), Georgia Institute of Technology (Angel, Horse, Hand), Cyberware Inc. (Dinosaur, Igea, Isis, Rocker Arm), AIM@SHAPE (Neptun), Bangor University, UK (Welsh Dragon), and Cornelia Denk of BMW (Car).

(10)

I would like to thank my family, especially my brother Simon and my parents Gertie and Meinhart for encouraging and supporting me much longer than this thesis took. But most of all, I have to thank my girlfriend Linda: not only because she did all the proofreading, but for her love, strength, endurance, trust, and support.

— Quirin Meyer, Erlangen August 2012

(11)

Acknowledgements v

1 Introduction 1

1.1 Challenges and Bene ts . . . 2

1.2 Contributions and Overview . . . 4

2 Background 7 2.1 Notation . . . 7

2.2 Real-Time Rendering . . . 9

2.2.1 Triangle Meshes . . . 9

2.2.2 Lighting Models . . . 11

2.2.3 Real-Time Rendering on Graphics Hardware . . . . 12

2.3 Number Formats for Real Numbers . . . 16

2.3.1 Floating-Point Numbers . . . 17

2.3.2 Uniformly Quantized Numbers . . . 20

3 Adaptive Level-of-Precision 23 3.1 Introduction . . . 24

3.1.1 Contributions . . . 26

3.1.2 Overview . . . 29

3.2 Previous Work . . . 29

3.3 Level-of-Precision . . . 32

3.3.1 Representing Positions . . . 33

3.3.2 Bit-Level Adaption . . . 34

3.3.3 Packed Buﬀers . . . 35

3.3.4 Rendering Quality . . . 36

3.4 Adaptive Precision . . . 38

3.4.1 LOP Selection . . . 39

3.4.2 Adaptive Precision Buﬀer . . . 42

3.4.3 Crack Removal . . . 43

(12)

3.5 Constrained Adaptive Precision . . . 44

3.5.1 Shading Error . . . 45

3.5.2 Depth Error . . . 52

3.5.3 Coverage Error . . . 53

3.5.4 Binned Adaptive Precision Buﬀers . . . 54

3.6 GPU Implementation . . . 56

3.6.1 Graphics Pipeline Integration . . . 57

3.6.2 Bit-Level Adaption . . . 58

3.6.3 Rendering Algorithm . . . 60

3.7 Results . . . 61

3.7.1 Quality . . . 62

3.7.2 Memory Usage . . . 66

3.7.3 Rendering Performance . . . 67

3.7.4 Changing Bit-Levels . . . 71

3.8 Conclusion and Future Work . . . 72

4 Unit Vector Compression 75 4.1 Introduction . . . 76

4.1.1 Motivation . . . 76

4.1.2 Overview . . . 77

4.2 Discrete Sets of Unit Vectors . . . 79

4.2.1 Maximum Angular Quantization Error . . . 79

4.2.2 Properties of Discrete Sets of Unit Vectors . . . 82

4.3 Lower Bounds for Unit Normal Vectors . . . 83

4.3.1 Accuracy . . . 86

4.3.2 Number . . . 88

4.3.3 Conclusion . . . 89

4.4 e Accuracy of Floating-Point Unit Vectors . . . 90

4.4.1 Redundancies . . . 90

4.4.2 Maximum Angular Quantization Error . . . 93

4.4.3 Summary . . . 101

4.5 Previous Work . . . 102

4.5.1 Parameterization Methods . . . 103

4.5.2 Subdivision Methods . . . 107

(13)

4.5.3 Look-Up-Table Methods . . . 108

4.5.4 Conclusion . . . 109

4.6 Parameterization Methods . . . 111

4.7 Parameter Quantization and Error Analysis . . . 119

4.7.1 Domain Quantization . . . 121

4.7.2 Range Quantization . . . 122

4.7.3 Error Analysis of Domain Quantization . . . 123

4.7.4 Error Analysis of Range Quantization . . . 124

4.7.5 Compactness Factor . . . 127

4.8 Results . . . 128

4.8.1 Quantization Errors . . . 129

4.8.2 Bit-Budget of 48 Bits . . . 131

4.8.3 Validation of Errors . . . 133

4.8.4 Comparison with FPUVs . . . 135

4.8.5 Quality . . . 135

4.8.6 Timings . . . 139

4.8.7 GPU Decompression of Per-Vertex Unit Vectors . . 141

4.8.8 Surplus Bits . . . 142

4.9 Conclusion . . . 143

4.10 Further Applications . . . 145

5 Triangle Mesh Topology Compression 147 5.1 Introduction . . . 148

5.1.2 Overview . . . 151

5.2 Related Work . . . 151

5.3 Generalized Triangle Strips . . . 154

5.3.1 Creating Belts and Stripi cation . . . 155

5.3.2 Vertex Order . . . 156

5.3.3 Compact Representation of Generalized Triangle Strips . . . 157

5.4 Decompression of Generalized Triangle Strips . . . 158

5.4.1 Data-Parallel Scan . . . 159

5.4.2 Data-Parallel Algorithm . . . 160

5.4.3 Restart Emulation . . . 165

(14)

5.4.4 Alternative Decompression Approaches . . . 166

5.5 Incremental Vertices . . . 167

5.6 Data-Parallel Word-Aligned Code . . . 168

5.6.1 Simple-9 Codes . . . 169

5.6.2 Data-Parallel Decompression of Simple-9 Codes . . 170

5.6.3 Alternatives to Simple-9 . . . 172

5.7 Decompression Pipeline . . . 173

5.8 Results and Discussion . . . 174

5.8.1 Compression Rate . . . 176

5.8.2 Decompression Speed . . . 179

5.8.3 Impact of Vertex and Triangle Order . . . 181

5.8.4 Runtime Memory Consumption . . . 182

5.8.5 Alternative Stripifcation Methods . . . 182

5.8.6 Application Fields . . . 182

5.8.7 Comparison . . . 185

5.9 Conclusion and Future Work . . . 186

6 Conclusion and Outlook 189

Bibliography 193

(15)

2.1 Explicit Representation of a Triangle Mesh . . . 9

2.2 OpenGL Graphics Pipeline . . . 13

2.3 Distribution of Mini-Floats . . . 18

3.1 Diﬀerence Between LOP- and LOD-Methods . . . 25

3.2 Diﬀerence Between AP and CAP . . . 27

3.3 Positions Stored as Floats, AP, and CAP . . . 28

3.4 Packed and Unpacked Buﬀers . . . 36

3.5 Adaptive Precision Example . . . 37

3.6 Object Space Error to Camera Space Error . . . 39

3.7 Adaptive Precision Summary . . . 42

3.8 Crack Removal . . . 44

3.9 Shading Artifacts Caused by Altering Positions . . . 45

3.10 Shading Error due to Wrong Normal Vectors . . . 46

3.11 Per-Vertex Minimum Bit-Level . . . 51

3.12 Cause of Depth Errors . . . 52

3.13 Removing Depth Artifacts . . . 53

3.14 Constrained Adaptive Precision . . . 54

3.15 Binned Adaptive Precision Buﬀer . . . 55

3.16 Bit-Levels of a Binned Adaptive Precision Buﬀer . . . 57

3.17 Data-Parallel Unpacking and Packing . . . 59

3.18 Quality Results of the David Model . . . 63

3.19 Quality Results of the Car Model . . . 64

3.20 Popping Artifacts . . . 65

3.21 Rotation Motion Timings . . . 69

3.22 Dolly Motion Timings . . . 70

3.23 Combining LOD and LOP . . . 73

4.1 Non-Uniformly vs. Uniformly Distributed Unit Vectors . . 81

4.2 Artifacts Caused by Poorly Quantized Unit Vectors . . . 84

(16)

4.3 Cause of the Artifacts with Poorly Quantized Unit Vectors . 85

4.4 Floating-Point Unit Vectors Distribution . . . 91

4.5 Rotation of FPUVs . . . 92

4.6 Naming Conventions of the Floating-Point Grid . . . 93

4.7 2Ds FPUVs Angular Quantization Error . . . 96

4.8 Example of the Vertical Motion . . . 97

4.9 Example of the Horizontal Motion . . . 98

4.10 Special Case of the Diagonal Motion . . . 100

4.11 Unfolding an Octahedron . . . 114

4.12 Warped Octahedron . . . 115

4.13 Cut-oﬀ Regions of Parallel Projections . . . 118

4.14 Domain Quantization . . . 121

4.15 Range Quantization . . . 123

4.16 Voronoi Regions of Range Quantization . . . 125

4.17 Shading Results at a Bit-Budget of 17 bits . . . 136

4.18 Angular Error at a Bit-Budget of 17 bits . . . 137

4.19 Comparing WCP and OP with a Bit-Budget of 16 Bits . . . 138

4.20 Comparison Between FPUVs and OP Unit Vectors . . . 144

5.1 Welsh Dragon . . . 149

5.2 Overview of our Codec . . . 150

5.3 Belt Traverser . . . 154

5.4 Strip-Codes . . . 157

5.5 Generalized Triangle Strip . . . 158

5.6 M_Scan and Q_Scan . . . 161

5.7 T_Kernel . . . 164

5.8 Data-Parallel Unpacking of Incremental Vertices . . . 168

5.9 Simple-9 Compression Example . . . 170

5.10 S_Scan . . . 170

5.11 D_Scatter . . . 171

5.12 Decompression Pipeline . . . 173

5.13 Test Meshes . . . 175

5.14 Head of Michelangelo’s David . . . 184

(17)

2.1 Binary Powers . . . 8

2.2 Floating-Point Number Formats . . . 18

3.1 Details of the David and the Car Data Set . . . 61

3.2 AP and CAP Memory Usage . . . 66

3.3 AP and CAP Rendering Performance . . . 68

4.1 Maximum Angular Quantization Error of Floats . . . 103

4.2 Summary of Parameterization Methods . . . 128

4.3 Parameters and Unit Vectors of Maximum Error . . . 130

4.4 Maximum Error and Compactness Factors . . . 131

4.5 Maximum Error Example . . . 132

4.6 Maximum Error Experiment . . . 133

4.7 Mixed Precisions Experiment . . . 134

4.8 Bits To Maintain FPUV-Error . . . 135

4.9 CPU Compression and Decompression Timings . . . 139

4.10 Surplus Bits of Various Parameterization Methods . . . 143

5.1 Simple-9 Selectors . . . 169

5.2 Mesh Details . . . 176

5.3 Compression Rates and Decompression Performance . . . 177

5.4 Compression Rate Details . . . 178

5.5 Detailed Decompression Timings . . . 180

5.6 Rendering Times with our Layout and OpenCCL . . . 181

5.7 Mesh Details of David’s Head . . . 183

5.8 Timings and Memory Consumption of David’s Head . . . . 185

(18)

(19)

Introduction

In real-time rendering, image sequences are created quickly enough to evoke the illusion of a continuous motion, and the image generation process can in- stantaneously respond to external in uences. e research domain concern- ing with this task is Real-Time Computer Graphics. It has become an irre- placeable part in many application elds, as for example in computer games, image processing, scienti c visualization, design, and engineering. Almost inseparably linked with Real-Time Computer Graphics is special purpose hardware, so called GPUs: GPUs signi cantly speed the image generation process and oentimes make real-time image generation possible in the rst place.

GPUs are almost as ubiquitous as Real-Time Computer Graphics. Wherever there is Real-Time Computer Graphics, there is virtually always a GPU. ey are an integral part of a wide range of devices, such as smart-phones, tablets, laptops, workstations, clusters, and super computers. A GPU may be very powerful for speci c tasks and surpass the performance of traditional central processing units (CPUs) by orders of magnitude.

Despite the impressive power of GPUs, their resources are limited. Depend- ing on the application, the computational speed may be too low, the memory bandwidth too little, or GPU memory does not suﬃce to t the necessary data. e reason for the shortage is that the expectations in GPUs are constantly pushed. With faster and more accurate sensors and simulations, the amount of data medical and scienti c visualization applications need to process skyrockets. Designers and engineers are steadily increasing their expectations towards detail and precision. e ever growing demand for realism raises the necessity for more nely resolved data. All of this results in a steady growth of GPU memory requirements. Data compression is an appropriate

(20)

and eﬀective mean for counteracting memory space shortage.

In GPU memory, we place all necessary information that are required for rendering an image, such as geometric data and textures, amongst other less space consuming data. Driven by their importance in computer games, compressed textures are supported by modern GPUs. Textures are compressed lossy, that means the original data is not equal to the data extracted from the compressed data. e ratio of uncompressed over the compressed data size ranges from 2:1 to 6:1. Further, random access to compressed textures comes at no extra performance cost over uncompressed textures.

While in computer games the major cause for memory shortage is due to textures, visualization and engineering applications need to process large amounts of geometric data in the form of triangle meshes. ese triangle meshes originate from highly detailed construction, sensor, or simulation data that can easily contain several millions of triangles and vertices. As a result, GPU memory consumed by triangle meshes becomes the limiting factor. erefore, it is necessary to come up with solutions that reduce the size of geometric data.

1.1 Challenges and Bene ts

In contrast to on-the- y texture decompression, built-in GPU triangle-mesh decompression does not exist. In this thesis, we ll this gap and propose several decompression methods targeting diﬀerent subsets of geometric data.

ere are three major challenges:

• Parallelization:Although geometry compression and decompression is a well-researched subject, the available algorithms are not suited for GPUs. is is because GPUs achieve their high performance rates through parallelization of the deployed rendering algorithms. How- ever, geometry decompression algorithms are inherently sequential.

As a consequence, they would perform poorly on GPUs and real-time rendering would no longer be possible. erefore, we need to design new decompression methods that run in parallel on a GPU.

(21)

• Real-Time: Besides reducing memory consumption, the proposed methods may only have little impact on the rendering performance.

is impedes methods with poor decompression speed, even if they achieve high compression ratios.

• Rendering Quality: For the application scenarios we target, lossy compression is only tolerated, unless it does not degrade image quality. at means for the attributes associated with each vertex, such as position or normal vectors, that lossy compression is acceptable only up to a certain degree. Most importantly,topology, i.e., the way tri- angles are connected, may not change. at rules out solutions that reduce the number of triangles or vertices in a pre-process.

Parallel geometry decompression allows placing more data in GPU memory. Moreover, it speeds chronically low CPU-to-GPU memory transfers or makes them even dispensable entirely. It further accommodates for two ongoing developments in computer design:

• e computational power increases while data bandwidth cannot keep pace. Processors can only perform instructions as long as they have the necessary data at hand. Otherwise they run idle. Transmit- ting compressed data eﬀectively increases the bandwidth and helps keeping the processor busy.

• e computational power increases because more and more processors are assembled onto a single chip. However, their individual speed increases only little. Sequential algorithms, as used in traditional geometry decompression, are not able to prosper from this develop- ment.

erefore, parallel decompression algorithms are needed to account for these two trends.

(22)

1.2 Contributions and Overview

Triangle meshes have been the workhorse in Real-Time Computer Graphics for several decades. ey consist of topological information that connects vertices to form triangles. With each vertex, we associate a 3D position. e majority of triangle meshes additionally store one 3D unit vector with each vertex that represents the surface normal vector at the vertex.

We contribute methods that compress positions, unit normal vectors, and topology independently from each other. e methods can be combined with each other and are simple to integrate in existing rendering algorithms.

erefore, our methods help a wide range of real-time rendering applications to reduce their greed for memory.

In Chapter 3, we introduce level-of-precision (LOP), a novel approach for compactly representing vertex positions in GPU memory. ereby, we adapt the precision of vertex positions based on a view-dependent criterion. GPU memory for vertex positions is reduced by only storing the currently necessary precision. Our compact representations enable fast random access from within a vertex program. Once the view-point changes, we adapt a model’s vertex precision. anks to our new data-parallel algorithms that run entirely on the GPU, precision is adapted faster than the model is rendered. Furthermore, we allow locally re ning vertex position precision to avoid artifacts that would otherwise occur when using reduced precision.

e algorithms have been published in [MSGS11].

In Chapter 4, we analyze unit vector representations that are used to encode per-vertex normal vectors. e most common one is to use three oating- point numbers. To our knowledge, we are the rst to derive the error of this representation. Based on our error analysis, we compare existing unit vector compression methods and study their ability to match this error using less memory. We nd out that a parameterization based on projecting unit vectors onto the surface of an octahedron performs best. Unit vectors represented with 52 bits in this representation are more than suﬃcient to achieve the accuracy of three oating-point numbers using 96 bits. We further show that any other unit vector representation saves at most 1.14 bits upon the

(23)

octahedron projection. Our compact unit vector representation is decom- pressed on the GPU at no extra cost. Some of the ndings have also been published in [MSS^∗10].

In Chapter 5, we present a lossless triangle mesh topology compression scheme. It is designed to leverage GPU data parallelism during decompression. We order triangles coherently to form generalized triangle strips. We unpack generalized triangle strips eﬃciently, using a novel data-parallel algorithm. Additional compression bene ts come from a variable bit-length code, for which we propose a data-parallel decompression algorithm. While uncompressed triangles require 96 bits per triangle, we obtain 3.7 to 7.6 bits per triangle. anks to the high degree of parallelism, our decompression algorithm is extremely fast and achieves up to 1.7 billion triangles per second. At the time of nishing this thesis, a paper describing the algorithm has been submitted to a journal [MKSS12].

We provide background information required for this thesis in Chapter 2.

In Chapters 3 – 5, we thoroughly describe our contributions. Finally, we conclude the thesis with an outlook on future work (cf. Chapter 6).

(24)

(25)

Background

Our goal is to combine geometry decompression with real-time rendering.

Both elds are well-established education and research domains which draw upon a vast body of literature. In this chapter, we single out fundamental topics that are most important for this thesis. We provide a concise background and explain concepts that reoccur constantly throughout the main chapters.

In Section 2.2, we summarize elementary concepts of real-time rendering.

is includes triangle mesh representation and its memory consumption, lighting models, graphics pipeline, and its hardware implementation. Large parts of geometry are de ned by real numbers. To eﬃciently compress real numbers, a solid understanding of formats approximating real numbers on a computer is given in Section 2.3. But rst, we establish some notation.

2.1 Notation

We comprehensively adhere to the following conventions:

• Scalar numbersare written in italics, e.g.,x, ΔQ, and so forth.

• Matrices and vectorsare written in bold case, e.g.,v. To refer to thecom- ponents of a vector, we use the notation common in C-style program- ming languages for structure record types (i.e.,struct): For example, to access thexcomponent of a vectorv, we usev.x. roughout the thesis, we mostly deal with column vectors, e.g.,v= (v.x,v.y,v.z)^T.

• einner productsof two vectorsaandbis denoted by angle brackets, i.e.,ha,bi.

(26)

Symbol Ki Mi Gi Ti Pi Factor exp_ exp_ exp_ exp_ exp_.

Table 2.1: Binary Powers.The abbreviations shown above represent high powers of two.

• Vertical barsk·kpindicate vector and matrix norms. e indexpspec- i es the particular norm, e.g.,p= for the Euclidean vector norm.

• Elements of nite setsandarraysare indexed using the array notation of C-style programming languages. For a set we writeS = {S[],S[],S[], . . .}and for an arrayA= (A[],A[],A[], . . .). Note that indexes start with 0. Sets or arrays with scalar elements are written in italics. Matrix and vector element types are written bold-faced.

• Intervals are denoted by brackets and parenthesis depending on whether they are open or closed. For example,[a,b]is a closed interval fromathroughb, and[a,b)is an interval that is closed at the a-end and open atb.

We oen use powers of two, which are usually represented by superscripts, i.e., ^N. is may, however, become cumbersome to read, particularly in the presence of multiple levels of superscripts, as for example ^N^M. It gets even more complicated, once superscripts themselves contain subscripts, e.g., ^N^Mu. In other disciplines, such as mathematics or physics, the same problem occurs when dealing with the natural exponential functione^x. It is oentimes replaced byexp(x) := e^x. We adopt this convention and de ne power-of-two as a function

exp_:x7→^x. Its inverse function is the base-two logarithmlog_y.

When dealing with compression and decompression techniques, it is in- evitable to provide data size information. e unit forbytesis abbreviated using the letterB. We abbreviate large sizes withbinary pre xes. ey are de- rived from powers of two. For example,  B=exp_() = KiB, where

(27)

1

0 7 8 6 4 5 3 T 2

0 1 2 3 4 5 6 7

9

4 5 6 9 8 7 0

3

2 4

6

9 8

1

2

4 9

8

7 1

2 9

9

Figure 2.1: Explicit Representation of a Triangle Mesh.The explicit representation is an arrayT. Each elementT[i]consists of three indices.

Kiabbreviateskilobinary. But see Table 2.1 for a list of binary powers. For powers of , we use the SI pre xes, e.g., M for ^, G for ^, and so forth.

We usecompression ratios(e.g., 3:1, or 5.5:1) to relate the size of compressed data to uncompressed data. e higher a compression ratio the smaller the compressed data. We make also usecompression rates(e.g., bits per triangle (bpt), bits per unit vector) to break down the bene t of compression to individual elements. e lower the compression rate the better. To measure decompression speed we usedecompression rate(e.g., triangles per second, unit vectors per second). e smaller the decompression rate the better.

2.2 Real-Time Rendering

In real-time rendering, the time for generating an image (also called frame) may not surpass 67 ms [AMHH08]. e image is generated from a 3D de- scription of geometry, which is typically represented by triangle meshes (cf. Section 2.2.1). To produce appealing images, the geometry is exposed to physically motivated lighting simulations (cf. Section 2.2.2). Real-time rendering is possible with an optimized graphics pipeline implemented on graphics hardware (cf. Section 2.2.3).

2.2.1 Triangle Meshes

In real-time rendering,triangle meshesare the most common form of representing geometry. A triangle mesh is a set of triangles. Atriangleitself

(28)

consists of three diﬀerentvertices. e vertices of a triangle are connected throughedges. Triangles are connected by sharing common vertices.

Each vertex has one or morevertex attributes, or shortly referred to asat- tributes. Common types of attributes include the position of the vertex in 3D space, a 3D unit normal vector, 2D texture coordinates, 3D tangent and bi-tangent vectors, and color values. is list is, of course, incomplete. Any kind of attribute imaginable can be associated with a vertex.

Triangle meshes are oen represented in what we refer to asexplicit repre- sentation. is representation is also called to astriangle-vertex representa- tionorindexed triangle-set. In the explicit representation, we store the vertex attributes in so-calledvertex arrays. Vertex arrays that belong to the same triangle mesh all haveN_Velements, whereN_Vis the number of vertices.

In the array of trianglesT, we store for each triangleT[i]threevertex indices T[i].v,T[i].v, andT[i].v. ereby,T[i].v,T[i].v, andT[i].vreference elements of the vertex arrays. e array consists ofNTelements, whereNTis the number of triangles of the mesh. An example of an array of triangles and the associated topology is shown in Figure 2.1.

Note that the order of the triangles in the arrayTis not important for the nal appearance of the model. We can therefore order the triangles in any order we want. e order of the vertices of one triangle can also be inter- changed, but there is one restriction: theorientationmust remain, as many algorithms rely on a distinct order. A triangle, represented by a row vector of three vertex indices(v,v,v), does not change its orientation if we use the permutations(v,v,v)and(v,v,v), i.e., we “rotate” the vertex indices.

Other permutations change the orientation and are therefore not allowed.

e order of the elements in the attribute arrays does not aﬀect the appearance of the model, either. When reordering the elements in the attribute arrays, we have to adjust the vertex indices ofTto point to the right attributes.

(29)

Memory Consumption

e types of models that we use in this thesis consist of a triangle array, an array of vertex positions, and an array of vertex unit normal vectors. Typi- cally, three integers are used to store a triangle, i.e., 12 B. A vertex position and a unit normal vector use three single precision oating-point numbers, i.e., 12 B, each. e memory consumption of an uncompressed mesh is

K=NT·B+NV··B. For large meshes,NT≈·NV, therefore,

K≈·NT·B.

Hence, vertex positions and unit normal vector compromise about 25 % of the size of the triangle mesh, each. e triangle array consumes the remain- ing portion of about 50 %

2.2.2 Lighting Models

In order to produce realistic images, we compute the amount of out-going color intensity at a point on the surface of a triangle mesh usinglighting mod- els. Lighting models are expressed in terms ofshading equations. For each color channel (typically red, green, and blue), we compute the intensity values independently.

One common lighting model is thediffuse lighting model. It is also referred to asLambertian re ectanceorLambert’s cosine law. It reoccurs in many other lighting models as part of their diffuse term. e amount of outgoing diffuse light at a pointpon a surface is

f_Diffuse=I_D·max(hn,li,). (2.1)

e vectorsnandlhave unit length, i.e.,knk=klk=.ldirects towards the light source andnis the unit normal vector atp.IDcombines the diﬀuse re ectance characteristics of the material at the pointpand the amount of light emanating from the light source for one color channel.

(30)

eBlinn-Phong lighting model [Bli77] builds upon diﬀuse re ection. It adds specular highlights to the surface, i.e.,

f_Blinn=f_Diffuse+IS· hn,hi^s, (2.2) wheresis the specular exponent. e halfway-vectorhis halfway between theland the unit vectorvpointing towards the camera, i.e.,

h= v+l kv+lk

.

Similar toID,ISis the specular color that emanates from the surface. It com- bines parameters from the light source with material properties.

ese two lighting models are commonly used in real-time rendering and are used to generate plausible images in numerous rendering applications.

For more lighting models, we refer to relevant literature for more details [SAG^∗05, AMHH08, PH10].

2.2.3 Real-Time Rendering on Graphics Hardware

rough the course of the previous decades, an algorithm calledrasteriza- tion has very successfully been deployed in the eld of real-time rendering. ereby, 3D geometry, typically represented as triangle meshes, is pro- jected triangle aer triangle onto a 2D image plane and then converted to a raster image. e success of rasterization in the eld of real-time rendering over other image generation algorithms, such as ray tracing [Whi80]

or micro-polygon pipelines [CCC87], is that special purpose hardware, so called GPUs, dramatically speed the image generation process.

As the methods presented in this thesis are in the domain of real-time rendering on graphics hardware, we summarize how the image generation process applied for rasterization is decomposed into several steps, the so-called graphics pipeline. We brie y explain how contemporary graphics hardware used for real-time rendering works and how it can be used for more general purpose tasks.

(31)

Primitive Assembly

Rasterizer Stream Output

Per-Fragment Operation Fragment Shader

Vertex Shader

Processed Vertices

Primitives

Fragments

Color, Depth-Value

Vertex Buﬀers

Index Buﬀers

Vertex Buﬀers

Frame Buﬀer

Figure 2.2: OpenGL Graphics Pipeline. The graphics pipeline is decomposed into programmable stages (green boxes) and xed-function stages (red boxes). It uses inputs and output buﬀers located in GPU memory (orange boxes).

Graphics Pipeline

Rasterization is decomposed into several steps, forming the so-calledgraph- ics pipeline. Common inputs to the pipeline are 3D geometry, lighting mod- els, and materials. e pipeline’s output is a 2D image.

e pipeline steps are standardized and they are accessed by the programmer through application programming interfaces (APIs), such as OpenGL [SA11] and Direct3D [Mic10]. e implementations developed during the course of this thesis make use of OpenGL 4.0 – 4.2. An overview of the pipeline stages that are important for this thesis is shown in Figure 2.2.

e pipeline has programmable stages, shown as green boxes, and xed- function stages, shown by red boxes. Input and output data is stored in GPU memory drawn with orange boxes.

(32)

In this thesis, we make use of the vertex and fragment shader stage. e other programmable stages, i.e., geometry shader, tessellation control shader, and tessellation evaluation shader stage, are not used in this thesis and therefore not shown in Figure 2.2. However, algorithms using these stages can also in- corporate methods developed in this thesis. Programs for the programmable stages are implemented using shading languages. We use OpenGL shading language (GLSL) [Kes11] to specifyvertexandfragment programs.

We specify geometry using the explicit representation. We upload the triangle arrayTinto anindex buﬀerand the attribute arrays tovertex buﬀers.

Beside triangles, other primitives such as points or lines are supported, too.

e set of attributes associated with each vertex serves as input to the vertex shader stage. ere, avertex programis executed independently for each vertex. A vertex program outputs an arbitrary set of per-vertex attributes.

e vertex shader stage is fully programmable using GLSL. Most vertex programs transform per-vertex positions and unit normal vectors speci ed in local object space coordinates to world space or camera space. Moreover, each vertex position is transformed toclip space. Clip space is a vector space that graphics hardware uses to determine the 2D fragment location and its depth value in screen-space.

e vertex attributes outputted by the vertex shader stage are assembled to primitives, e.g., triangles in our case, by theprimitive assemblystage. e programmer can optionally write the transformed primitives to a vertex buﬀer using thestream outputstage. In most cases, the primitives are handed over to therasterizerstage. ere, each primitive is rasterized, i.e., it is converted to a set of fragments in screen space. For each fragment, the rasterizer inter- polates the attributes outputted from the vertex shader stage. When activat- ing stream output, the program may prevent the pipeline from executing the rasterizer and all subsequent stages. is is useful when streamed-out vertex buﬀers are used as input for subsequent rendering passes.

e interpolated attributes are used as input to thefragment shader stage.

Like the vertex shader stage, the fragment shader stage is fully programmable with GLSL usingfragment programs. A fragment program is executed for each fragment individually and it outputs the fragment’s nal color and

(33)

depth value (i.e., the distance from the fragment to the camera).

Most of our rendering algorithms use vertex programs to compute camera- space coordinates for unit normal vectors and positions as per-vertex attributes. e interpolated attributes are used in a fragment program to eval- uate shading equations.

e fragments computed in the fragment shader stage are combined with results that are already in the frame buffer. e most important operation is depth testing. For each pixel in the frame buffer, the depth value of the fragment closest to the camera is stored in thedepth buffer(orz-buffer). If a new fragment’s depth value is closer to the camera than the existing depth value, the old pixel color and depth value are updated.

Graphics Hardware and Compute APIs

Both xed function stages and programmable stages of the graphics pipeline are implemented eﬃciently on GPUs. On modern GPUs, the programmable stages operate in adata-parallelway: A single instruction is executed on multiple data elements simultaneously. is is also referred to as the single instruction, multiple data (SIMD) paradigm.

For example, as the vertex shader stage transforms vertices independently from each other, all vertices can be transformed in parallel. e same vertex program is executed on diﬀerent vertices. Likewise, the fragment shader stage processes all fragments in parallel. Each incarnation of a vertex or fragment program that operates on diﬀerent data is called athread. Shading language program threads are mapped to the computational cores of a GPU transparently to the programmer.

A GPU has signi cantly more computational cores than a CPU. For example, an Nvidia GeForce 580 GTX consists of 512 cores, whereas CPUs typically have two to eight cores. at is why GPUs are referred to asmany-core architectures and CPUs to asmulti-corearchitectures.

For data located on the GPU, very high memory throughput-rates are possible (up to 192.4 GB per second), and the computational cores can reach a

(34)

theoretical peak performance of 1581 billion single-precision oating-point operations per second. In contrast, current 2nd generation Intel Core i7 CPUs with six cores running at 2.8 GHz obtain a bandwidth of 32 GB per second and a theoretical peak performance of 134.4 billion single-precision oating-point operations per second. At rst sight, this seems a tremen- dous performance advantage for GPUs. It should be noted, that GPUs only achieve high performance rates for data-parallel algorithms. Not every problem can be recast in a data-parallel fashion. In general, sequential tasks perform much better on CPUs.

ere are algorithms other than those realized with the rendering pipeline that can be expressed in a data-parallel fashion. ese algorithms can also bene t from the computational power of GPUs. at is why the function- ally of GPUs is also exposed through more general purpose data-parallel compute APIs, such as OpenCL [Mun11] or Nvidia’scompute uni ed device architecture (CUDA)[Nvi11b]. We use CUDA for this thesis (cf. Chapter 5).

Instead of vertex or fragment programs, we specify programs calledkernels.

e kernel instructions are executed by spawningthreads. From a programmer’s perspective, each thread maps onto one computational core of a GPU and the threads are executed in parallel.

2.3 Number Formats for Real Numbers

In this section, we provide a brief overview of oating-point numbers de- ned in the IEEE 754-2008 standard [IEE08]. We focus onhow oats are stored andwhatreal numbers they are able to represent. Based on the deeper understanding of oating point numbers provided in Section 2.3.1, it will turn out during the course of this thesis thatuniformly quantized numbers(cf.

Section 2.3.2) are more appropriate fortransmittingvertex attributes. How- ever, we use oating-point numbers for computations, as they are widely supported by both GPUs and CPUs.

(35)

2.3.1 Floating-Point Numbers

e IEEE 754-2008 standard speci es several oating-point formats which are widely supported on today’s computer architectures. We limit the explanation to thenormalized binary oating-pointformat, as it is predominantly used on GPUs [Bly06, SA11]. Further, we exclude topics, such as converting real values to oats, rounding-modes, and arithmetic precision of oating- point operations, and refer the reader to advanced literature on oating point number systems [Gol91, Knu97, MBdD^∗10].

Anormalized binary oating-point numberis a triple(s,e,m)with

• asign bit s∈ {,}. Ifs=, the number is positive;

• anexponent ewithN_ebits, where

e∈[−exp_(Ne−) +, . . . ,exp_(Ne−)];

• a mantissamwithNmbits, wherem∈[, . . . ,exp_(Nm)−].

We convert a normalized binary oating-point number to a real number with the following mapping:

(s,e,m)7→(−)^s·exp_(e)· (

+ m

exp_(Nm) )

,

for −exp_(N_e−) +<e<exp_(N_e−).

We want to clarify the termnormalized binary oating-point. ebinaryis due to the radix 2 as the base of the exponent. e word normalizedis because of the term +located le of the fraction. Normalization makes oating-point numbers unique, i.e., a real-number that is representable by a normalized binary oating-point number maps to exactly one unique triple (s,e,m). From now on, the term oatabbreviatesnormalized binary oating- point number.

A zero is encoded by a special triple (s,−exp_(Ne−) +,). A triple (s,−exp_(Ne−) +,m)withm6= is a so-calledsubnormalnumber. e standard de nes a special mapping for them, however, subnormal numbers play no role in this thesis, as GLSL ushes them to zero [Kes11].

(36)

Format Ne Nm Total f_min f_max Mini    =.·⁻^ = .·^ Half    ≈.·⁻^ ≈.·^ Single    ≈ .·⁻^ ≈.·^

Double    ≈.·⁻^ ≈ .·^

Table 2.2: Floating-Point Number Formats.Three formats —Half,Single, andDouble— are supported by current graphics hardware. The format namedminiis introduce for ex- planatory purposes.NeandNmare the number of bits for the exponent and the mantissa, respectively.Totalis the total number of bits including the sign bit. The valuesf_minand f_maxare the smallest and largest nite positive value the format can represent.

 _^ _^  

−_^

−^_

−

−

Figure 2.3: Distribution of Mini-Floats. The horizontal line is the real line. All possible mini- oats in the range of−toare marked by vertical lines. The thick lines highlight the boundaries of intervals with common sample densities. Note the non-uniform distribution of the oating point numbers.

Accuracy and range of oats depend on the number of bits spent for mantissa and exponent. From the de nition, we directly derive the smallest oat

f_min=exp_(−exp_(Ne−) +), and the largest nite oat, respectively,

f_max= (−exp_(−Nm))·exp_(exp_(Ne−)−).

e smallest nite oat is−f_maxand the largest negative normalized oat is

−f_min. Table 2.2 lists the formats half, single, and double, that are currently supported by GPUs [SA11].

e table further contains themini oat format that we use for explanation purposes, but it has no practical relevance. It has a sign bit, three bits for the

(37)

mantissa, and three bits for the exponent. Figure 2.3 shows the distribution of the mini- oats in the range from− to . e distribution is similar for other normalized binary oating-point formats. Every vertical line corre- sponds to a oat. In between of two oats there are an in nite number of real values. e most important feature that can be seen from the gure is that oats are distributed non-uniformly across the real line: In the interval from[exp_(i),exp_(i+))there are as many oats as in the interval [exp_(i+),exp_(i+))which covers a segment of twice the length. For example, the interval[^/,)contains eight oats along a range of^/. e same number of oats is also in the interval from[,), which is twice as long. In general, in the interval[exp_(i),exp_(i+)]the space between two oats is constant, i.e.,

ε_float,i=exp_(i−N_m). (2.3)

Hence, the closer we approach zero the higher the sample density becomes.

Likewise the resolution of oats coarsens towards|f_max|.

Also note the gap between−_^ and _^ in the mini- oat format of Figure 2.3 with only one sample in between, i.e., 0. e sampling rate increases the closer we approach zero, but aer reaching|f_min|, it drops abruptly by a factor ofexp_(Nm)and leaves a large gap. A similar gap exists for all other oating- point formats. is gap may be lled with subnormal numbers, however, not by GLSL, which does not support subnormal numbers.

For most rendering tasks, single precision is considered to be suﬃcient.

With the advent of Direct3D 10 [Bly06] GPUs adhere to the IEEE 758 standard. GLSL uses the standard [Kes11], too. Double precision is available, however, single precision performance is typically two [Nvi11b] to ve times [AMD11] faster on current hardware. In GLSL, half-precision is not natively supported as data type. Input attributes to vertex programs and texture formats may however use half precision to save bandwidth. For carrying out computations they are converted to single precision [SA11].

(38)

2.3.2 Uniformly Quantized Numbers

In contrast to oats,uniformly quantized numbersare spaced uniformly. We store a uniformly quantized number using an unsigned integericontaining N_ubits. A uniformly quantized number maps to a real value that is between a lower bounduminand an upper boundumax, i.e.,

unpack:i7→ umax−umin

exp_(Nu)−·i+umin, (2.4) wherei ∈ [. . . ,exp_(N)−]. Obviously, the distance between two real numbers is always

ε= umax−umin

exp_(Nu)−. (2.5)

To convert a real numberrinto a uniformly quantized number, we use the following mapping:

pack:r7→

⌊ r−umin

umax−umin ·(exp_(Nu)−) + 

σ(r)

⌋

. (2.6)

ereby,σ(r)is the sign-function, i.e., σ(r) :r7→

{ r≥

− r<. (2.7)

e plus sign splits Equation (2.6) into a term that inverts Equation (2.4) and a rounding term, i.e.,^/σ(r). By the rounding term we make sure that we map to the one uniformly quantized number that is closest tor, aer applying the oor functionb·c. Note that (2.6) can also be used ifris a oat.

When converting a real numberrinto a uniformly quantized numberi = pack(r),iis mapped to another real numberq=unpack(i). We say thatq is the number to whichrgets quantized. is can be expressed as a function quantize:r7→unpack(pack(r)). (2.8) Note thatrandqdiﬀer at most by the quantization error of^ε/.

(39)

As opposed to oats, GPUs do not oﬀer hardware instructions for operations on uniformly quantized numbers. erefore, we convert uniformly quantized numbers to oats of suﬃcient precision and perform arithmetic calculations with oats. is is convenient as GPUs provide a rich optimized instruction set for oats. Alternatively, we could implement arithmetic func- tions directly on uniformly quantized numbers in soware with integer operations. However, this is rather complex from a programmer’s perspective, and the resulting code is executed more slowly than the equivalent code that uses oating-point instructions.

(40)

(41)

Adaptive Level-of-Precision

e vertices of a triangle mesh can have all kinds of attributes: unit normal vectors, tangent vectors, colors, texture coordinates, etc., but there is one attribute that is used for sure: vertex positions. Hence, all triangle meshes bene t in terms of memory consumption when compressing positions.

Compressed positions are not exclusively useful for triangle meshes or other polyhedral representations. In point-based graphics [GP07], geometry is de ned by a set of non-connected positions. Surfaces, such as Bézier patches, B-Spline surfaces [Far02], non-uniform rational B-splines (NURBS) [PT97], T-splines [SZBN03], subdivision surfaces [PR08], and al- gebraic surfaces [Sed85], use positions to representcontrol points.

e memory consumption of the positional data of the meshes used in this thesis amounts for one fourth of the total data. erefore, a signi cant memory reduction of the positions immediately results in a considerable data reduction for the entire mesh.

In this chapter, we provide simple and eﬃcient ways of reducing the memory consumption of vertex positions. e main idea is to quantize vertex positions according to their distance to the camera: the more distant the vertex positions are to the camera, the less precision and therefore less memory they require. We adapt precision by adding or removing bits from vertex positions. To allow precision adaption in real-time, we present fast data-parallel algorithms. Moreover, our data structures for storing vertex positions allow random access, despite being compressed. However, a reduced precision may result in image errors. We analyze these visual artifacts and avoid them by constraining the quantization ofcriticalvertices.

(42)

We achieve compression ratios for vertex positions from 2.5:1 up to 5:1. e reduced complexity in uences rendering quality and speed only little. More- over, our techniques are easy to integrate in existing rendering algorithms.

3.1 Introduction

A common way of saving memory for triangle meshes is to use level-of- detail (LOD) methods. Mesh complexity is adapted by varying the number of vertices, edges, and triangles. Starting from abase-mesh, a series of meshes is created, each of which represents a diﬀerent level of detail. A mesh from that series is called alevel of detailor, more brie y, alevel. A level is either coarsened or re ned by removing or adding vertices, edges, and triangles.

We say that a level islow(high) if it has few (many) vertices and therefore edges and triangles. For rendering, one level is selected. e advantage of using a lower level is not only reduced memory consumption. e reduced complexity results in faster rendering, too. However, a low level also has less detail which degrades the nal image quality.

e decision which level to choose must be made carefully and constitutes a crucial part of an LOD-method. It is a tradeoﬀ between delity and performance. Mostly, the level is selected such that an observer is not able to distinguish the nest level from the lower level that is chosen for rendering.

A typical criterion is the distance from the mesh to the camera: the further the mesh is away from the camera the coarser the level can be.

One important aspect of LOD methods is the transition from one level to the other. For example, when the mesh comes closer to the camera, the current level is no longer suﬃcient for the desired image quality. en, we switch to the next ner level. However, this is an abrupt change of levels between two frames. is may result in so-calledpopping artifacts. e human eye is particularly sensitive to popping artifacts, which should therefore be avoided.

LOD-methods are classi ed into discrete and continuous methods.Discrete LOD-methodshave a small, nite set of precomputed levels. Two successive LODs diﬀer by a large amount of vertices, edges, and triangles. In con-

(43)

number of vertices

precision

number of vertices

precision

ne LOD coarse LOD

(a)LOD-methods

number of vertices

precision

number of vertices

precision

ne LOP coarse LOP

(b)LOP-methods

Figure 3.1: Diﬀerence Between LOP- and LOD-Methods. The gray box shows the set of potential vertices. (a) The green boxes show the subset of vertices used for a particular LOD. LOD-methods vary the number of vertices. (b) The blue boxes show the subset of vertices of a particular LOP. LOP-methods vary the numerical precision of the vertices.

trast, the diﬀerence between twocontinuous LODsis as low as a single vertex. While the former ones are usually simpler to implement, the later ones suﬀer less from popping artifacts. An overview of various LOD methods is provided in the book by Luebke and co-authors [LRC^∗03].

Figure 3.1a recaps the idea of LOD-methods: e gray box in Figure 3.1a shows the set of potential vertices that is used for all LODs of the mesh. From that set of potential vertices, each level singles out a sub-set, shown as green boxes in the gure. e width of the boxes correlates with the number of vertices. e abscissa represents the number of vertices used for a model, and the ordinate represents the precision of the vertex attributes. eir product is the memory usage. So far, LOD-methods do not change the numerical precision of the vertex attributes and alter the number of vertices only.

(44)

3.1.1 Contributions

We leverage the unexploited degree of freedom and vary the numerical precision of vertex positions to save memory. We adapt thelevel-of-precision (LOP). Less precision results in lower memory consumption. e blue boxes in Figure 3.1b indicate the diﬀerent LOPs. eir height re ects the numerical precision of the vertex attributes. A coarse LOP uses fewer bits than a ne LOP for the vertex attributes. We refer to the level of precision asbit-level.

Note that LOP is not an LOD-method. We study LOP-methods with a constant number of vertices. LOP-methods can, however, be combined with LOD methods to achieve additional memory savings.

LOP can be used for all vertex attributes, but we consider LOP for vertex positions only. While a lower precision uses less memory, rendering quality can suﬀer. We trade memory usage and rendering quality by adapting the bit-levelinteractively: we keep only those bits in GPU memory that are required for the currently used precision.

We introduce two approaches building upon the LOP idea:

• adaptive precision (AP)and

• constrained adaptive precision (CAP).

Both methods represent positions as uniformly quantized numbers (cf. Sec- tion 2.3.2). During rendering, we choose the bit-level such that it is as low as possible in order to reduce memory while maintaining a high image quality.

Our AP-method strives forperfect coverage. at means pixels covered in the baseline image are not the same as in the LOP-image. Withbaseline image, we refer to the image generated with positions represented as single-precision oating-point numbers.LOP-imageis an image generated with a lower bit- level using AP or CAP. Perfect coverage is not generally obtainable and, in most cases, we get acoverage error. However, we can choose the bit-level such that the coverage error is below a fraction of a pixel.

Yet, AP can be prone to artifacts other than coverage errors, especially for low bit-levels. We observe that errors in the pixel color can occur. We analyze

(45)

number of vertices

precision

number of vertices

precision

(a)Adaptive Precision (AP) (b)Contstrained Adaptive Precision (CAP) Figure 3.2: Diﬀerence Between AP and CAP. While AP uses one common bit-level for all vertex positions, CAP restricts the bit-level of some vertices to a minimum bit-level.

these artifacts and nd out that they are caused by wrong shading computations and erroneous depth values. We show that they can be removed if some vertices use a higher bit-level than others.

For triangle meshes that are prone to these artifacts, we propose to use constrained adaptive precision (CAP). ereby, we constrain the bit-level of each vertex to aminimum bit-level. We present preprocessing algorithms for determining the minimum bit-level for the positions of a mesh.

To realize AP and CAP, we further make the following contributions:

• Changing the LOP is conducted by successively removing and adding low-order bits of positions. It is carried out directly on the GPU by fast data-parallel algorithms. erefore, only little CPU-to-GPU commu- nication is required.

• Positions are stored in a compact form, which allows for fast random access in a vertex program.

e diﬀerence between the two approaches is shown in Figure 3.2. AP adapts the LOP of all positions simultaneously. All positions use the same level of precision. CAP adapts the precision of all vertices, too. However, it accounts for the minimum bit-levels of the vertices.

During rendering, both of our approaches choose the bit-level based on a view-dependent criterion. at makes memory usage also view-dependent.

In typical application scenarios, we observe compression ratios from 2.5:1

(46)

(a)Float (b)AP (c)CAP

Figure 3.3: Positions Stored as Floats, AP, and CAP. The bottom row shows a close-up of the radiator grill of the Car model. (a) With positions stored with oating-point precision, the image is rendered in 14.5 ms. (b) AP renders the Car in 9.8 ms and consumes 28 % of the memory of oating-point positions. AP is, however, prone to shading artifacts (see bottom row). (c) CAP does not suﬀer from the artifacts and only needs 38 % of the memory of oating-point positions. The image is rendered in 14.8 ms.

up to 5:1 over positions stored with single-precision oating-point numbers.

Moreover, positions compressed in our format deliver competitive real-time frame-rates that are oentimes higher than those achieved with oating- point positions. LOP-methods are powerful and eﬀective. At the same time, they are easy to implement.

Figure 3.3 shows a result of our AP- and CAP-method compared against positions stored as single-precision oating-point numbers. While AP delivers a coverage error less than half a pixel and substantially compresses vertex positions, it is prone to shading artifacts, particularly for models with ne detail (b). We remove these artifacts using CAP (c) and obtain images that are visually indistinguishable from images generated with oating-point positions (a). At the same time we still achieve signi cant memory savings.

(47)

3.1.2 Overview

is chapter is organized as follows: aer reviewing prior art in Section 3.2, we introduce the concept of LOP in Section 3.3. en, we present the LOP methods that handle coverage, shading, and depth errors in Sections 3.4 and 3.5. In Section 3.6, we introduce data-parallel algorithms for precision adaption and show how our data-structures integrate into an OpenGL rendering pipeline. Aer presenting and discussing results (Section 3.7), we conclude with an outlook on further applications of LOP in Section 3.8.

3.2 Previous Work

We provide a brief overview of existing approaches that focus on compression of positional data. We limit the discussion to GPU methods as well as to contributions dealing with precision issues. Moreover, we discuss methods that take shading error considerations into account.

Vertex Clustering

Multi-resolutiontechniques are frequently applied to reduce the geometric complexity. At vertex level,vertex cluster algorithms [RB93] partition the object’s bounding geometry into uniform cells. Vertices contained in one cell are replaced by a single vertex representative whose position is subject to optimizations [Lin00]. An eﬃcient GPU implementation for this clustering exists [DT07], but it is merely used to speed-up the generation of discrete LODs and it is not used to reduce the memory footprint during rendering.

Progressive Meshes

A widespread multi-resolution approach is the progressive meshs (PMs) technique by Hoppe [Hop96]. Starting from a base-mesh, successive edge collapses reduce the number of vertices and triangles. is ne grain control