• No results found

Streaming Model Based Volume Ray Casting Implementation for Cell Broadband Engine

N/A
N/A
Protected

Academic year: 2022

Share "Streaming Model Based Volume Ray Casting Implementation for Cell Broadband Engine"

Copied!
23
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Streaming Model Based Volume Ray Casting

Implementation

for Cell Broadband Engine

Jusub Kim Research Assistant Institute for Advanced Computer Studies University of Maryland, College Park

(2)

480MB/s (60MB/s, 1/8x)

480MB/s (6GB/s)

Background

Interactivity in Volume Rendering

Volume rendering is very expensive

process in terms of computation and communication.

Volume data O(n^3) vs. Surface data

O(n^2)

16MB (256x256x256) Disk

Memory

CPU

48 GFLOPS (24 GFLOPS, 1/2x)

Bandwidth and Computing power

required for 30 frames/sec.

(*): Peak Perf. Latest PC

(3)

Background

Volume Visualization: Overview

Volume Vis.

(src:volvis.org)

Geometric Approximation

Isosurface Rendering

Volume Rendering

No explicit geometric extraction

Marching Cubes

[Lorensen and Cline 87]

Src:http://www.essi.fr/~lingrand/

Ray Casting

[Parker et al. 98, Wald et al. 05]

Ray Casting, [Levoy 90]

Best quality, Slowest Splatting

[Westover 90]

Shear-Warping [Lacroute et al. 94]

Texture Mapping [Cabral et al. 94]

Ci, αi

C = C+(1- α) αi Ci α = α + (1- α) αi Approx. of volume rendering eq.

(src:volvis.org)

(4)

Background

Volume Visualization: Direct volume rendering

•Ray Casting,

Slowest, but Best Quality [Levoy 90]

•Splatting, Fast, but limited flexibility [Westover 90]

•Texture-mapping, Hardware Accelerated [Cabral et al. 94]

•Shear-Warp, Fastest Software-based [Lacroute et al. 94]

src:ahpcrc.org

src:sgi.com

Vi

(5)

Background

Parallel and Heterogeneous Compute Resources

•Multi-cores replace single core (Intel will ship 80-core CPUs in a few yrs)

•Optimized for low memory latency and complex

computation flow

Memory Controller

Hub

System DRAM (8GB) PCI- Express:

~8GB/s

~6.4 GB/s

~6.4 GB/s 24GFLOPS

•Fully programmable

•Optimized for High

throughput and data-parallel computation

~400GFLOPS

Graphics DRAM (768MB)

~80 GB/s

(6)

Motivation

Parallel and Heterogeneous Compute Resources

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS SPU MFC LS

SPU

MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC L2

PowerPC L1 core L2 PowerPC L1

core Element Interconnect Bus (EIB) PPE

SPE

Memory Interface Controller

Bus Interface Controller Rambus XDR

Rambus FlexIO

IBM Cell Broadband Engine

•200GFlops,

•200GB/s B/W inside

•25GB/s B/W to main memory

Larrabee in 2010 ?

One chip for CPU and GPU

Every CPU is going to be multi-core, we need algorithmic change in our software to make the most of the hardware!

(7)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment

Our strategy

Segment algorithms according to hardware specialty

One that a Traditional CPU can efficiently handle

One that a High-throughput device can efficiently handle

Run each algorithm in the corresponding hardware, but streamline data movement between the two parts

CPU

(PowerPC, Core2Duo)

High-Throughput Device (SPE, GPU)

Data Data Data

(8)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment

CPU

(PowerPC, Core2Duo)

High-Throughput Device

(SPE, GPU)

Tile # Rendered Tile

Volume Ray Casting

Contributing Ray Segment List (Offset, Length)…

•Rendering

Reconstruction/

Shading/

Compositing

•Early Ray Termination

(9)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Heterogeneous (1PPE+8SPEs)

High Computing Power 204.8GFlops, High Communication B/W 204.8GB/s

Asynchronous DMA engine

Very hard to program

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS SPU MFC LS

SPU

MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC LS

SPU MFC L2

PowerPC L1 core L2 PowerPC L1

core Element Interconnect Bus (EIB) PPE

SPE

Memory Interface Controller

Bus Interface Controller Rambus XDR

Rambus FlexIO

(10)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

How we benefit from the Streaming model

Hide the overhead of traversing a tree to skip the empty space

Pre-fetch made possible since SPEs know what to render

Better SIMD utilization since SPEs know what to render

(11)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Work Decomposition & Mapping

Fine Grain Parallelism

Tile size small enough to contain work list &

the result tile image

Tile size large enough to ensure enough work between synchronization

(12)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Streaming Model

PPE (PowerPC) Tile #

Contributing Ray Segment List (Offset, Length)…

Binary Octree

Careful decision

required for leaf node size

(13)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Streaming Model

SPE Rendered Tile

Contributing Ray Segment List (Offset, Length)…

•Pre-fetching

•Rendering

Reconstruction/

Shading/

Compositing

Four sampling points are processed at once using SIMD instructions

Pre-fetching:

Asynchronous DMA + Double Buffering

Asynchronous rendered tile image transfer

(14)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance Gap between Heterogeneous Cores Case A: PPE is faster than SPE

SPE

•4-entry Mailbox PPE

(PowerPC)

List size for Tile k List size for Tile k+1 List size for Tile k+2

Work List for Tile k Work List for Tile k+1 Work List for Tile k+2

•For each SPE

“BUFFERING”

(15)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE

Tile (i, j) k

k

L(0,0) L(0,1)

L(1,0) L(1,1)

LS(0,0)

L(0,0)

A contributing ray segment in approximation

A contributing ray segment in original

•L(a,b): a contributing ray segment list at (a,b)

•List for a subtile, LS(0,0)=L(0,0) U L(0,1) U L(1,0) U L(1,1)

“APPROXIMATION”

(16)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE

“REFINING”

•Every k-th ray

0 1 1 0 0 1 0

Smallest coordinate (x,y,z) of non- empty leaf node

Hash function key=f(x,y,z) table[key]=1

• Hash Table for a subtile is the Union of ones at 4 surrounding corners

•SPE skips rendering a sampling point if the hash table entry of the subvolume it belongs is 0.

•Note that there is no case where non- empty subvolume is recognized as empty subvolume while the other case can occur

(17)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance (Data sets)

(18)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance (Hidden Traversal Overhead)

(19)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance (Hidden Memory Latency)

Foot Aneurism Engine Fuel

w/o prefetch 224 210 133 76

w/ prefetch 113 99 72 38

Local volume, w/o early ray

termination 161 103 85 39

w/ prefetch, w/o early ray

termination 179 112 88 41

(20)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance (Load balance)

Average STD among 8 SPEs:

1.7%

(21)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

Performance (v.s. Intel Xeon)

(22)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment

Conclusions

New heterogeneous multi-core computer resource

environment in PC architecture requires us to develop new efficient thread-level parallel algorithms in order to fully utilize the hardware potential.

Streaming model with algorithm segmentation according to hardware specialty in volume ray casting allows us to hide algorithmic overhead and memory latency in addition to fully utilizing the hardware.

(23)

Thank you!!!

Referanser

RELATERTE DOKUMENTER

[RPZ02] R EN L., P FISTER H., Z WICKER M.: Object space ewa surface splatting: A hardware accelerated ap- proach to high quality point rendering. L.: Ren- dering antialiased

Most recent approaches build on single-pass ray-casting, where the entire volume is traversed in a single rendering pass using data-dependent looping in the hardware fragment

We introduce a new ray tracing algorithm that exploits the best of previous methods: Similar to bounding volume hierarchies the memory of the acceleration data structure is linear

We compared our algorithm with other recent hardware- based volume rendering algorithms, based on the ray casting paradigm, and on a hybrid version mixing ray casting and

This paper examines the second benefit, large numbers of tightly connected processor nodes, within the context of a parallel ray casting volume rendering algorithm implemented on

In our scheme, a PPE (the PowerPC processor in the Cell B.E.) is responsi- ble for traversing a hierarchical data structure and generat- ing the lists of intersecting voxels along

In this paper, two classes of techniques for efficient re- alizations of the known opacity correction mechanisms in over-sampled volume ray casting using commodity hard- ware

Ljung has published several papers in international conferences and journals including IEEE Visualization, Eurographics conferences, IEEE TVCG and others, on volume rendering of