Streaming Model Based Volume Ray Casting
Implementation
for Cell Broadband Engine
Jusub Kim Research Assistant Institute for Advanced Computer Studies University of Maryland, College Park
480MB/s (60MB/s, 1/8x)
480MB/s (6GB/s)
Background
Interactivity in Volume Rendering
Volume rendering is very expensive
process in terms of computation and communication.
Volume data O(n^3) vs. Surface data
O(n^2)
16MB (256x256x256) Disk
Memory
CPU
48 GFLOPS (24 GFLOPS, 1/2x)
Bandwidth and Computing power
required for 30 frames/sec.
(*): Peak Perf. Latest PC
Background
Volume Visualization: Overview
Volume Vis.
(src:volvis.org)
Geometric Approximation
Isosurface Rendering
Volume Rendering
No explicit geometric extraction
Marching Cubes
[Lorensen and Cline 87]
Src:http://www.essi.fr/~lingrand/
Ray Casting
[Parker et al. 98, Wald et al. 05]
Ray Casting, [Levoy 90]
Best quality, Slowest Splatting
[Westover 90]
Shear-Warping [Lacroute et al. 94]
Texture Mapping [Cabral et al. 94]
Ci, αi
C = C+(1- α) αi Ci α = α + (1- α) αi Approx. of volume rendering eq.
(src:volvis.org)
Background
Volume Visualization: Direct volume rendering
•Ray Casting,
Slowest, but Best Quality [Levoy 90]
•Splatting, Fast, but limited flexibility [Westover 90]
•Texture-mapping, Hardware Accelerated [Cabral et al. 94]
•Shear-Warp, Fastest Software-based [Lacroute et al. 94]
src:ahpcrc.org
src:sgi.com
Vi
Background
Parallel and Heterogeneous Compute Resources
•Multi-cores replace single core (Intel will ship 80-core CPUs in a few yrs)
•Optimized for low memory latency and complex
computation flow
Memory Controller
Hub
System DRAM (8GB) PCI- Express:
~8GB/s
~6.4 GB/s
~6.4 GB/s 24GFLOPS
•Fully programmable
•Optimized for High
throughput and data-parallel computation
~400GFLOPS
Graphics DRAM (768MB)
~80 GB/s
Motivation
Parallel and Heterogeneous Compute Resources
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS SPU MFC LS
SPU
MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC L2
PowerPC L1 core L2 PowerPC L1
core Element Interconnect Bus (EIB) PPE
SPE
Memory Interface Controller
Bus Interface Controller Rambus XDR
Rambus FlexIO
IBM Cell Broadband Engine
•200GFlops,
•200GB/s B/W inside
•25GB/s B/W to main memory
Larrabee in 2010 ?
One chip for CPU and GPU
Every CPU is going to be multi-core, we need algorithmic change in our software to make the most of the hardware!
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment
Our strategy
•
Segment algorithms according to hardware specialty•
One that a Traditional CPU can efficiently handleOne that a High-throughput device can efficiently handle
•
Run each algorithm in the corresponding hardware, but streamline data movement between the two partsCPU
(PowerPC, Core2Duo)
High-Throughput Device (SPE, GPU)
Data Data Data
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment
CPU
(PowerPC, Core2Duo)
High-Throughput Device
(SPE, GPU)
Tile # Rendered Tile
Volume Ray Casting
Contributing Ray Segment List (Offset, Length)…
•Rendering
Reconstruction/
Shading/
Compositing
•Early Ray Termination
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Heterogeneous (1PPE+8SPEs)
High Computing Power 204.8GFlops, High Communication B/W 204.8GB/s
Asynchronous DMA engine
Very hard to program
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS MFC
SPU LS MFC
SPU
LS SPU MFC LS
SPU
MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC LS
SPU MFC L2
PowerPC L1 core L2 PowerPC L1
core Element Interconnect Bus (EIB) PPE
SPE
Memory Interface Controller
Bus Interface Controller Rambus XDR
Rambus FlexIO
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
How we benefit from the Streaming model
•
Hide the overhead of traversing a tree to skip the empty space•
Pre-fetch made possible since SPEs know what to render•
Better SIMD utilization since SPEs know what to renderVolume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Work Decomposition & Mapping
•
Fine Grain Parallelism•
Tile size small enough to contain work list &the result tile image
•
Tile size large enough to ensure enough work between synchronizationVolume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Streaming Model
PPE (PowerPC) Tile #
Contributing Ray Segment List (Offset, Length)…
Binary Octree
Careful decision
required for leaf node size
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Streaming Model
SPE Rendered Tile
Contributing Ray Segment List (Offset, Length)…
•Pre-fetching
•Rendering
Reconstruction/
Shading/
Compositing
Four sampling points are processed at once using SIMD instructions
Pre-fetching:
Asynchronous DMA + Double Buffering
Asynchronous rendered tile image transfer
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance Gap between Heterogeneous Cores Case A: PPE is faster than SPE
SPE
•4-entry Mailbox PPE
(PowerPC)
List size for Tile k List size for Tile k+1 List size for Tile k+2 …
Work List for Tile k Work List for Tile k+1 Work List for Tile k+2 …
•For each SPE
“BUFFERING”
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE
Tile (i, j) k
k
L(0,0) L(0,1)
L(1,0) L(1,1)
LS(0,0)
L(0,0)
A contributing ray segment in approximation
A contributing ray segment in original
•L(a,b): a contributing ray segment list at (a,b)
•List for a subtile, LS(0,0)=L(0,0) U L(0,1) U L(1,0) U L(1,1)
“APPROXIMATION”
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance Gap between Heterogeneous Cores Case B: PPE is not fast enough to feed SPE
“REFINING”
•Every k-th ray
0 1 1 0 0 1 0
…
Smallest coordinate (x,y,z) of non- empty leaf node
Hash function key=f(x,y,z) table[key]=1
• Hash Table for a subtile is the Union of ones at 4 surrounding corners
•SPE skips rendering a sampling point if the hash table entry of the subvolume it belongs is 0.
•Note that there is no case where non- empty subvolume is recognized as empty subvolume while the other case can occur
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance (Data sets)
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance (Hidden Traversal Overhead)
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance (Hidden Memory Latency)
Foot Aneurism Engine Fuel
w/o prefetch 224 210 133 76
w/ prefetch 113 99 72 38
Local volume, w/o early ray
termination 161 103 85 39
w/ prefetch, w/o early ray
termination 179 112 88 41
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance (Load balance)
Average STD among 8 SPEs:
1.7%
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment IBM Cell Broadband Engine
Performance (v.s. Intel Xeon)
Volume Ray Casting Implementation for
Heterogeneous Compute Resource Environment
Conclusions
•
New heterogeneous multi-core computer resourceenvironment in PC architecture requires us to develop new efficient thread-level parallel algorithms in order to fully utilize the hardware potential.