Streaming Model Based Volume Ray Casting Implementation for Cell Broadband Engine

(1)

Streaming Model Based Volume Ray Casting

Implementation

for Cell Broadband Engine

Jusub Kim Research Assistant Institute for Advanced Computer Studies University of Maryland, College Park

(2)

480MB/s (60MB/s, 1/8x)

480MB/s (6GB/s)

Background

Interactivity in Volume Rendering

 Volume rendering is very expensive

process in terms of computation and communication.

 Volume data O(n^3) vs. Surface data

O(n^2)

16MB (256x256x256) Disk

Memory

CPU

48 GFLOPS (24 GFLOPS, 1/2x)

Bandwidth and Computing power

required for 30 frames/sec.

(*): Peak Perf. Latest PC

(3)

Background

Volume Visualization: Overview

Volume Vis.

(src:volvis.org)

Geometric Approximation

Isosurface Rendering

Volume Rendering

No explicit geometric extraction

Marching Cubes

[Lorensen and Cline 87]

Src:http://www.essi.fr/~lingrand/

Ray Casting

[Parker et al. 98, Wald et al. 05]

Ray Casting, [Levoy 90]

Best quality, Slowest Splatting

[Westover 90]

Shear-Warping [Lacroute et al. 94]

Texture Mapping [Cabral et al. 94]

Ci, αⁱ

C = C+(1- α) αi Ci α = α + (1- α) αi Approx. of volume rendering eq.

(src:volvis.org)

(4)

Background

Volume Visualization: Direct volume rendering

•Ray Casting,

Slowest, but Best Quality [Levoy 90]

•Splatting, Fast, but limited flexibility [Westover 90]

•Texture-mapping, Hardware Accelerated [Cabral et al. 94]

•Shear-Warp, Fastest Software-based [Lacroute et al. 94]

src:ahpcrc.org

src:sgi.com

Vi

(5)

Background

Parallel and Heterogeneous Compute Resources

•Multi-cores replace single core (Intel will ship 80-core CPUs in a few yrs)

•Optimized for low memory latency and complex

computation flow

Memory Controller

Hub

System DRAM (8GB) PCI- Express:

~8GB/s

~6.4 GB/s

~6.4 GB/s 24GFLOPS

•Fully programmable

•Optimized for High

throughput and data-parallel computation

~400GFLOPS

Graphics DRAM (768MB)

~80 GB/s

(6)

Motivation

Parallel and Heterogeneous Compute Resources

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS SPU MFC LS

SPU

MFC LS

SPU MFC LS

SPU MFC L2

PowerPC L1 core L2 PowerPC L1

core Element Interconnect Bus (EIB) PPE

SPE

Memory Interface Controller

Bus Interface Controller Rambus XDR

Rambus FlexIO

IBM Cell Broadband Engine

•200GFlops,

•200GB/s B/W inside

•25GB/s B/W to main memory

Larrabee in 2010 ?

One chip for CPU and GPU

Every CPU is going to be multi-core, we need algorithmic change in our software to make the most of the hardware!

(7)

Volume Ray Casting Implementation for

Heterogeneous Compute Resource Environment



Our strategy

•

Segment algorithms according to hardware specialty

•

One that a Traditional CPU can efficiently handle

One that a High-throughput device can efficiently handle

•

Run each algorithm in the corresponding hardware, but streamline data movement between the two parts

CPU

(PowerPC, Core2Duo)

High-Throughput Device (SPE, GPU)

Data Data Data

(8)

CPU

(PowerPC, Core2Duo)

High-Throughput Device

(SPE, GPU)

Tile # Rendered Tile



Volume Ray Casting

Contributing Ray Segment List (Offset, Length)…

•Rendering

Reconstruction/

Shading/

Compositing

•Early Ray Termination

(9)

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine

 Heterogeneous (1PPE+8SPEs)

 High Computing Power 204.8GFlops, High Communication B/W 204.8GB/s

 Asynchronous DMA engine

 Very hard to program

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS MFC

SPU LS MFC

SPU

LS SPU MFC LS

SPU

MFC LS

SPU MFC LS

SPU MFC L2

PowerPC L1 core L2 PowerPC L1

core Element Interconnect Bus (EIB) PPE

SPE

Memory Interface Controller

Bus Interface Controller Rambus XDR

Rambus FlexIO

(10)

Heterogeneous Compute Resource Environment IBM Cell Broadband Engine



How we benefit from the Streaming model

•

Hide the overhead of traversing a tree to skip the empty space

•

Pre-fetch made possible since SPEs know what to render

•

Better SIMD utilization since SPEs know what to render

(11)



Work Decomposition & Mapping

•

Fine Grain Parallelism

•

Tile size small enough to contain work list &

the result tile image

•

Tile size large enough to ensure enough work between synchronization

(12)



Streaming Model

PPE (PowerPC) Tile #

 Binary Octree

 Careful decision

required for leaf node size

(13)



Streaming Model

SPE Rendered Tile

•Pre-fetching

•Rendering

Reconstruction/

Shading/

Compositing

 Four sampling points are processed at once using SIMD instructions

 Pre-fetching:

Asynchronous DMA + Double Buffering

 Asynchronous rendered tile image transfer

(14)

 Performance Gap between Heterogeneous Cores Case A: PPE is faster than SPE

SPE

•4-entry Mailbox PPE

(PowerPC)

List size for Tile k List size for Tile k+1 List size for Tile k+2 …

Work List for Tile k Work List for Tile k+1 Work List for Tile k+2 …

•For each SPE

“BUFFERING”

(15)

 Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE

Tile (i, j) k

k

L(0,0) L(0,1)

L(1,0) L(1,1)

LS(0,0)

L(0,0)

A contributing ray segment in approximation

A contributing ray segment in original

•L(a,b): a contributing ray segment list at (a,b)

•List for a subtile, LS(0,0)=L(0,0) U L(0,1) U L(1,0) U L(1,1)

“APPROXIMATION”

(16)

 Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE

“REFINING”

•Every k-th ray

0 1 1 0 0 1 0

…

Smallest coordinate (x,y,z) of non- empty leaf node

Hash function key=f(x,y,z) table[key]=1

• Hash Table for a subtile is the Union of ones at 4 surrounding corners

•SPE skips rendering a sampling point if the hash table entry of the subvolume it belongs is 0.

•Note that there is no case where non- empty subvolume is recognized as empty subvolume while the other case can occur

(17)



Performance (Data sets)

(18)



Performance (Hidden Traversal Overhead)

(19)



Performance (Hidden Memory Latency)

Foot Aneurism Engine Fuel

w/o prefetch 224 210 133 76

w/ prefetch 113 99 72 38

Local volume, w/o early ray

termination 161 103 85 39

w/ prefetch, w/o early ray

termination 179 112 88 41

(20)



Performance (Load balance)

Average STD among 8 SPEs:

1.7%

(21)



Performance (v.s. Intel Xeon)

(22)

 Conclusions

•

New heterogeneous multi-core computer resource

environment in PC architecture requires us to develop new efficient thread-level parallel algorithms in order to fully utilize the hardware potential.

•

Streaming model with algorithm segmentation according to hardware specialty in volume ray casting allows us to hide algorithmic overhead and memory latency in addition to fully utilizing the hardware.

(23)

Streaming Model Based Volume Ray Casting Implementation for Cell Broadband Engine

Background

Interactivity in Volume Rendering

Background

Volume Visualization: Overview

Our strategy

•

•

•

Volume Ray Casting

How we benefit from the Streaming model

•

•

•

Work Decomposition & Mapping

•

•

•

Streaming Model

Streaming Model

Performance (Data sets)

Performance (Hidden Traversal Overhead)

Performance (Hidden Memory Latency)

Performance (Load balance)

Performance (v.s. Intel Xeon)

•

•

Thank you!!!