Implementation Examples in Scientiﬁc Visualization

(1)

Tutorial: Tensor Approximation in Visualization and Graphics

Implementation Examples in Scientific Visualization

Renato Pajarola, Susanne K. Suter, and Roland Ruiters

(2)

Outline

• Tensor classes in MATLAB and vmmlib

‣ Downloads:

– http://www.sandia.gov/~tgkolda/TensorToolbox

– https://github.com/VMML/vmmlib

‣ Typical tensor operations

‣ Toy examples (see folder: vmmlib_ta_demo)

‣ Test dataset (see folder: vmmlib_ta_demo)

• GPU-based tensor reconstruction

2

(3)

Typical TA Operations

• Create a tensor (memory mapping)

• Unfolding

• TTM

‣ core generation vs. reconstruction

• Create tensor models (Tucker, CP)

• Algorithms (HOSVD, HOOI, HOPM)

3

(4)

Tensor: A Multidimensional Array

• MATLAB: N-way tensor

‣ M = ones(4,3,2); (A 4 x 3 x 2 array)

‣ A = tensor(M,[2 3 4]); (M has 24 elements)

‣ A = tenones([3 4 2]);

‣ A = tenrand([4 3 2]);

• For details on the MATLAB tensor toolbox see toolbox documentation

4

X is a tensor of size 2 x 3 x 4

X(:,:,1) =

1 1 1 1 1 1 X(:,:,2) =

1 1 1 1 1 1

X(:,:,3) =

1 1 1 1 1 1 X(:,:,4) =

1 1 1 1 1 1

I ₁ a

i

₁

= 1 , . . . , I

₁

I ₁ A

i

₂

= 1, . . . , I

₂

I

₂

I

₂

I ₁ A

i

₃

= 1 , . . . , I

₃

I

₃

a ^...

a scalar 1

^st

order tensor or vector 2

^nd

order tensor or matrix 3

^rd

order tensor or volume

(5)

Test Dataset: Hazelnut

5

• A microCT scan of a dried hazelnut (acquired at the UZH)

• I 1 = I 2 = I 3 = 512

• Values: unsigned char (8bit)

I

₂

I

₁

A

I

₃

(6)

A Matrix in vmmlib

6

• I 1 (M) rows

• I 2 (N) columns

• The matrices are per default column-major ordered

• A matrix is an array of I 2 (N) columns, where each column is of size I 1 (M)

matrix< I1, I2, Type >

matrix< 4, 3, unsigned char > m;

Example:

(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11) [vmmlib]

I ₁ A

i

₂

= 1, . . . , I

₂

I

₂

(7)

A Tensor3 in vmmlib

7

A

I

₁

I

₂

I

₃

• A tensor3 A in vmmlib is an array of I 3 matrices each of size I 1 times I 2

• The matrices are per default column-major ordered

• For each tensor3, the explicit size and the type of the values is requested

• A tensor3 is internally allocated and deallocated as pointer while the matrices are not

tensor3< I1, I2, I3, Type >

tensor3< 4, 3, 2, unsigned char > t3;

Example:

(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11) ***

(12, 13, 14) (15, 16, 17) (18, 19, 20) (21, 22, 23) ***

[vmmlib]

(8)

A Tensor4 in vmmlib

8

A

I

₁

I

₂

I

₃

• A tensor4 in vmmlib is an array of I 4 tensor3s

• For each tensor4, the explicit size and the type of the values is requested

tensor4< I1, I2, I3, I4, Type >

tensor4< 4, 3, 2, 2, unsigned char > t4;

Example:

(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11) ***

(12, 13, 14) (15, 16, 17) (18, 19, 20) (21, 22, 23) ***

----

(24, 25, 26) ...

A

I

₁

I

₂

I

₃

[vmmlib]

(9)

Large Data Tensors (in vmmlib)

9

A

I

₁

I

₂

I

₃

! const size_t d = 512;

! typedef tensor3< d,d,d, unsigned char > t3_512u_t;

! typedef t3_converter< d,d,d, unsigned char > t3_conv_t;

! typedef tensor_mmapper< t3_512u_t, t3_conv_t > t3map_t;

!

! std::string in_dir = "./dataset";

! std::string file_name = "hnut512_uint.raw";

! t3_512u_t t3_hazelnut;

! t3_conv_t t3_conv;

! t3map_t t3_mmap( in_dir, file_name, true, t3_conv ); //true -> read-only

! t3_mmap.get_tensor( t3_hazelnut );

[vmmlib]

(10)

Get Slices of a Tensor3

matrix< 512, 512, values_t > slice;

t3.get_frontal_slice_fwd( 256, slice );

10

frontal slices horizontal slices lateral slices

matrix< 512, 512, values_t > slice;

t3.get_horizontal_slice_fwd( 256, slice ); matrix< 512, 512, values_t > slice;

t3.get_lateral_slice_fwd( 256, slice );

[vmmlib]

(11)

Forward Tensor Unfolding (Matricization)

11

I₂ I₃

A

I₁

I₂

I₃ I₂

I₃ I₁

I₂

I₁

A

I₁

I₂ I₃

I₃

I₃ I₃

I₂ I₂

I₁ I₁

A

₍₃₎

A

₍₁₎

A

₍₂₎

I₂ ·I₁ I₁ ·I₃ I₃ ·I₂

Forward Cyclic Unfolding

tensor3< I1, I2, I3, values_t > t3

matrix< I1, I3*I2, values_t > unf_front_fwd;

t3.frontal_unfolding_fwd( unf_front_fwd );

matrix< I2, I1*I2, values_t > unf_horiz_fwd;

t3.horizontal_unfolding_fwd( unf_horiz_fwd );

matrix< I3, I2*I1, values_t > unf_lat_fwd;

t3.lateral_unfolding_fwd( unf_lat_fwd );

forward unfolded tensor (frontal) (0, 1, 2, 12, 13, 14)

(3, 4, 5, 15, 16, 17) (6, 7, 8, 18, 19, 20) (9, 10, 11, 21, 22, 23)

forward unfolded tensor (horizontal) (0, 12, 3, 15, 6, 18, 9, 21)

(1, 13, 4, 16, 7, 19, 10, 22) (2, 14, 5, 17, 8, 20, 11, 23)

forward unfolded tensor (lateral)

(0, 3, 6, 9, 1, 4, 7, 10, 2, 5, 8, 11)

(12, 15, 18, 21, 13, 16, 19, 22, 14, 17, 20, 23) after [Kiers, 2000]

[vmmlib]

(12)

Backward Tensor Unfolding (Matricization)

12 I₁

I₂

I₃

A

I₁

I₂ I₂

I₃ I₃

I₂

I₃

I₃ I₃ I₃

I₁ I₁ I₁

I₂ I₂ I₂

A

₍₂₎

A

₍₁₎

A

₍₃₎

I₂ ·I₃

I₃ ·I₁

I₁ ·I₂

Backward Cyclic Unfolding

tensor3< I1, I2, I3, values_t > t3

matrix< I1, I2*I3, values_t > unf_lat_bwd;

t3.lateral_unfolding_bwd( unf_lat_bwd );

matrix< I2, I3*I1, values_t > unf_front_bwd;

t3.frontal_unfolding_bwd( unf_front_bwd );

matrix< I3, I1*I2, values_t > unf_horiz_bwd;

t3.horizontal_unfolding_bwd( unf_horiz_bwd );

backward unfolded tensor (lateral) (0, 12, 1, 13, 2, 14)

(3, 15, 4, 16, 5, 17) (6, 18, 7, 19, 8, 20) (9, 21, 10, 22, 11, 23)

backward unfolded tensor (frontal) (0, 3, 6, 9, 12, 15, 18, 21)

(1, 4, 7, 10, 13, 16, 19, 22) (2, 5, 8, 11, 14, 17, 20, 23)

backward unfolded tensor (horizontal) (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

(12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23) after [De Lathauer et al., 2000a]

[vmmlib]

(13)

Example Unfoldings along the Modes 1, 2, and 3

13

...

mode-1 unfolding

mode-3

unfolding

(14)

Tensor Times Matrix Multiplication

14

A B

C

I

_n

I

_n

I

₁

I

₁

J

_n

J

_n

I

₂

C

B _(n) ^I

ⁿ

I

_n

A _(n) ^J

n

J

_n

I

₁

· I

₂

I

₁

· I

₂

n-mode product ⁽ ^B ×

ⁿ

C )

_i₁_...ı_n₋₁ _j_n_i_n₊₁_...i_N

=

I_n i_n

∑

=1

b

_i₁_i₂_...i_N

· c

_j_n_i_n

A = B ×

ⁿ

C ⇔ A

_(n)

= CB

_(n)

[De Lathauer et al., 2000a]

(15)

Tensor Times Matrix Multiplications

15

A B

C

I

_n

I

_n

I

₁

I

₁

J

_n

J

_n

I

₂

t3_ttm::multiply_frontal_fwd( tensor3_b, matrix_c1, tensor3_a1 );

t3_ttm::multiply_horizontal_fwd( tensor3_b, matrix_c2, tensor3_a2 );

t3_ttm::multiply_lateral_fwd( tensor3_b, matrix_c3, tensor3_a3 );

t3_ttm::full_tensor3_matrix_multiplication(

tensor3_b, matrix_c1, matrix_c2, matrix_c3, tensor3_a );

• The T3_TTM is implemented using openMP and BLAS for the parallel matrix-tensor_slice multiplications.

• The full TTM multiplication includes three TTMs: first a TTM along frontal slices, then a TTM along horizontal slices, and finally a TTM along lateral slices.

• Since the tensor3 is an array consisting of frontal slices (matrices), we start first with the frontal slice multiplication. This is optimized for tensors with In > Jn (For example, Tucker core generation). If you

have a situation, where Jn > In (for example Tucker reconstruction), you could rearrange the order of the modes of the TTM multiplications such that the most expensive TTM (the one of the largest tensor) is

performed along frontal slices.

[vmmlib]

(16)

Example TTMs: Core Computation

16

I

₁

I

₂

I

₁

R

₁

A

U

⁽¹⁾^T

B

^�

I

₂

R

₁

I

₃

I

₂

R

₂

_U

⁽²⁾^T

B

^��

B

^�

I

₂

I

₃

I

₃

R

₂

R

₁

I

₃

R

₃

^U

⁽³⁾^T

B

^��

I

₃

R

₁

R

₁

R

₃

R

₂

B = A ×

¹

U

⁽¹⁾^T

×

²

U

⁽²⁾^T

×

³

··· ×

^N

U

^(N⁾^T

B = A ×

¹

U

⁽¹⁾⁽⁻¹⁾

×

²

U

⁽²⁾⁽⁻¹⁾

×

³

··· ×

^N

U

^(N⁾⁽⁻¹⁾

• Three consecutive TTM multiplication (along modes 1,2,3)

• For orthogonal matrices, use the transposes of the three factor matrices (otherwise the (pseudo)-inverses)

• t3_ttm::full_tensor3_matrix_multiplication( A, U1_t, U2_t, U3_t, B );

orthogonal factor matrices

[vmmlib]

(17)

Tucker3 Tensor

17

I₃ I₁

I₂

I₃

I₁

U⁽³⁾ R₃

R₁

R₂ U⁽¹⁾

I₂ U⁽²⁾

≈ B

A �

[vmmlib]

typedef tucker3_tensor< R1, R2, R3, I1, I2, I3, T_value, T_coeff > tucker3_t;

• Define input tensor size (I

1

,I

2

,I

2

)

• Define multilinear rank (R

1

,R

2

,R

3

)

• Define value type and coefficient value type

• Internally always computes with floating point values

• Stores the three factor matrices (I

n

x R

n

) and the core tensor (R

1

,R

2

,R

3

)

• ALS:

‣ if not converged (fit does not improve anymore, tolerance 1e-04)

‣ the ALS stops latest after 10 iteration

• Reconstruction

(18)

Example Code Tucker3 Tensor

18 !typedef tensor3< I1, I2, I3, values_t > t3_t;

! t3_t t3; //after initializing a tensor3, the tensor is still empty

! t3.fill_increasing_values(); //fills the empty tensor with the values 0,1,2,3...

typedef tucker3_tensor< R1, R2, R3, I1, I2, I3, values_t, float > tucker3_t;

! tucker3_t tuck3_dec; //empty tucker3 tensor

!

! //choose initialization of Tucker ALS (init_hosvd, init_random, init_dct)

! typedef t3_hooi< R1, R2, R3, I1, I2, I3, float > hooi_t;

!

! //Example for initialization with init_rand

! tuck3_dec.tucker_als( t3, hooi_t::init_random());

! //Example for initialization with init_hosvd

! tuck3_dec.tucker_als( t3, hooi_t::init_hosvd());

!

! //Reconstruction

! t3_t t3_reco;

! tuck3_dec.reconstruct( t3_reco );

//Reconstruction error (RMSE)

double rms_err = = t3.rmse( t3_reco );

[vmmlib]

I₃ I₁

I₂

I₃

I₁

U⁽³⁾ R₃

R₁

R₂ U⁽¹⁾

I₂ U⁽²⁾

≈ B

A �

(19)

U

⁽¹⁾

I

₁

R

₁

U

⁽²⁾

I

₂

R

₂

U

⁽³⁾

I

₃

R

₃

Example Tucker3 Factor Matrices

19

(20)

CP3 Tensor

20

I₃ I₁

I₂

I₃

I₁

U⁽³⁾

U⁽¹⁾

I₂ U⁽²⁾

A � ≈

R

R R

λ

₁

λ_R

...

[vmmlib]

• Define input tensor size (I

1

,I

2

,I

2

)

• Define rank R

• Define value type and coefficient value type

• Internally always computes with floating point values

• Stores three factor matrices each of size (I

n

x R) and the lambdas R

• ALS:

‣ if not converged (fit does not improve anymore, tolerance 1e-04)

‣ set number of maximum CP ALS iterations

• Reconstruction

(21)

Code Example CP3 Tensor

21

I₃ I₁

I₂

I₃

I₁

U⁽³⁾

U⁽¹⁾

I₂ U⁽²⁾

A � ≈

R

R R

λ

₁

λ_R

...

[vmmlib]

! typedef cp3_tensor< r, a, b, c, values_t, float > cp3_t;

! typedef t3_hopm< r, a, b, c, float > t3_hopm_t;

!

! cp3_t cp3_dec;

!

! //Decomposition or CP ALS

! //choose initialization of Tucker ALS (init_hosvd, init_random)

! int max_cp_iter = 20;

! cp3_dec.cp_als( t3, t3_hopm_t::init_random(), max_cp_iter );

!

//Reconstruction

! t3_t t3_cp_reco;

! cp3_dec.reconstruct( t3_cp_reco );

//Reconstruction error (RMSE)

! rms_err = t3.rmse( t3_cp_reco ) ;

!

(22)

Higher-order SVD (HOSVD)

22

unfold A along mode n (A_n)

start HOSVD for mode n

compute the matrix SVD on A_n

mode n matrix U_n tensor A

stop HOSVD for mode n set R_n left

singular vectors as U_n

[De Lathauwer et al., 2000a]

[vmmlib]

(23)

HOSVD vs. HOEIGS (HOEVD)

23

unfold A along mode n (A_n)

start HOSVD for mode n

compute the matrix SVD on A_n

stop HOSVD for mode n

set R_n left singular vectors as U_n

unfold A along mode n (A_n)

start HOEIGS for mode n

compute the matrix symmetric EIG on C_n

stop HOEIGS for mode n

set eigenvectors (of R_n most significant

eigenvalues) as U_n compute covariance

matrix C_n = A_n A_n^T

[De Lathauwer et al., 2000a]

Higher-order symmetric eigenvalue decomposition

- HOEIGS: [Suter et al.]

- HOEVD: [De Lathauwer et al., 2000a]

[vmmlib] typedef t3_hosvd< R1, R2, R3, I1, I2, I3 > t3_hosvd_t;

//HOSVD modes: eigs_e or svd_e

(24)

Higher-order Orthogonal Iteration (HOOI)

24

invert all matrices, but mode n

start mode-n optimization

multiply tensor with all inverted matrices (TTMs)

optimized tensor A' tensor A,

matrices U

stop mode-n optimization convergence?

init matrices U (random, HOSVD)

compute max Frobenius norm A

set convergence criteria

input tensor A

optimize mode n

compute new mode-n matrix (HOSVD on A')

yes

no

compute core tensor B

compute fit

stop iterations start ALS

matrices U, core tensor B

mode-n optimized tensor A'

[De Lathauwer et al., 2000b]

[vmmlib] typedef t3_hooi< R1, R2, R3, I1, I2, I3 > t3_hooi_t;

(25)

Higher-order Power Method (HOPM)

25

convergence?

init matrices U (random, HOSVD)

compute max Frobenius norm A

set convergence criteria

input tensor A

optimize mode n

yes

no

compute fit

stop iterations start ALS

matrices U, lambdas

multiply each U's transpose with U

start mode-n optimization

normalize new U_n norm -> new lambda

new matrix U_n, new lambda tensor A,

matrices U

stop mode-n optimization

unfold A along mode n to A_n

Khatri Rao product of all Us, but U_n ->

U_krp

piecewise

multiplication of all U^T U -> V

new U_n: multiply A_n with U_krp and V^+

pseudo inverse of V ->

V^+

[De Lathauwer et al., 2000b]

[vmmlib] typedef t3_hopm< R, I1, I2, I3 > t3_hopm_t;

(26)

Tucker Tensor-specific Quantization

26

To appear in an IEEE VGTC sponsored conference proceedings

data management system that divides the data into blocks is an impor- tant basis both to process and to visualize large datasets. Our method is based on the offline decomposition of the original volumetric dataset into small cubical bricks (subvolumes), i.e., third-order tensors, which are approximated, quantized and organized into an octree structure maintained out-of-core. The octree contains data bricks at different resolutions, where each resolution of the volume is represented as a collection of bricks in the subsequent octree hierarchy level.

Each brick has a fixed width B with an overlap of two voxels at each brick boundary for efficiently supporting runtime operations requiring access to neighboring voxels (trilinear interpolation and gradient computation). The width of the brick is flexible, but in this paper is set to B = (28 + 2 + 2) = 32, i.e., one brick is 32³, which has proved small enough to guarantee LOD adaptivity, while coarse enough to permit an effective brick encoding by the analysis of the local structure.

Each octree brick A ∈ R³ is tensor approximated using rank- reduced Tucker decomposition. A Tucker decomposition (see Ap- pendix A) is defined as A�= B ×1 U⁽¹⁾ ×2 U⁽²⁾ ×3 U⁽³⁾, where B is the so called core tensor and U⁽ⁿ⁾ are the factor matrices. A rank- reduced TA along every mode of the dataset is written with the notation: rank-(R₁,R₂,R₃) TA. As illustrated in Fig. 1, we compute for each brick of size B³ a rank-(R,R,R) TA, with R ∈ [1..B−1]. Typically, we use a rank reduction, where R = B/2, i.e., R = 16 for B = 32, following the rank reduction scheme used in other tensor approximation works [27, 23]. The resulting rank-reduced decomposition is quantized to further reduce memory usage (see Sec. 4) and stored in a out-of-core brick database. With each brick, we store a 64-bit binary histogram, which is used for transfer-function-based culling.

...

... ...

lowest resolution

highest resolution

B³bricks

core tensor and basis matrices U

B

A

Fig. 1. Multiresolution octree tensor decomposition hierarchy with B³ sized bricks.

The whole preprocessing is performed in a low-memory setting using a bottom-up process on a brick-by-brick basis, which is repeated until we reach the octree root. Leafs are constructed by sampling the original dataset, while non-leaf bricks are constructed from their previously constructed eight children, which are dequantized, reconstructed, and spatially averaged.

At run-time, an adaptive loader updates a view- and transfer function-dependent working set of bricks. The working set is incre- mentally maintained on the CPU and GPU memory by asynchronously fetching data from the out-of-core brick multiresolution TA structure.

Following the MOVR approach [12, 14], the working set is maintained by an adaptive refinement method guided by the visibility information fed back from the renderer. The adaptive loader maintains on GPU a cache of recently used volume bricks, stored in a 3D texture. At each frame, the loader constructs a spatial index for the current working set in the form of an octree with neighbor pointers.

For rendering and visibility computation, the octree is traversed using a CUDA stack-less octree ray-caster, which employs preintegrated scalar transfer functions to associate optical properties to scalar values, and supports a variety of shading modes [14]. The ray-caster works on reconstructed bricks, and reconstruction steps occur only upon GPU cache misses. The quantized tensor decomposition is dequantized and

reconstructed on demand by the adaptive loader during the visualization on the GPU (see Sec. 5).

In order to permit structural exploration of the datasets, the reconstruction can consider only the K most significant ranks of the tensor decomposition, where K ∈ [1..R] is chosen by the user. The reconstruction rank K can be changed during the visualization process with a rank slider. Lower-rank reductions give a faster outline of the visualized dataset and can highlight structures at specific scales [23], see also Sec.6. Higher K values add more details onto the dataset.

4 ENCODING OF COEFFICIENTS

As mentioned previously, the tensor and factor matrix coefficients take up unnecessary space if maintained as floating point values, see also storage cost analysis in Sec. 6.2. For compact representation of the tensor decomposition and to reduce the disk to host to device bandwidth during rendering, we apply a simple fixed bit length encoding based on tensor-specific quantization. In particular, the factor matrices and the core tensor of the Tucker model have a different distribution of coefficients and thus the quantization approach was selected accordingly, as described below. A fixed bit length approach has been selected in order to simplify parallel decoding on the GPU.

4.1 Factor Matrices and Core Tensor Coefficients

The coefficients of the basis factor matrices U^(1...3) are normalized and distributed between [−1,1], due to the orthonormality of factor matrices in the Tucker model. Therefore, a uniform linear 8- or 16-bit quantization as in Eq. 1 can effectively be applied. We use a single min/max-pair to indicate the quantization range for all three factor matrices to minimize the number of coefficients that need to be loaded by the CUDA kernels.

˜

x_U = (2^Q^U −1) · x−x_min

x_max − x_min (1) As per definition of the Tucker model, the core tensor B captures the contribution of the linear bases combinations, i.e., the energy of the data, in its coefficients. The distribution of the signed coefficients is such that the first entry of the core tensor has an especially high absolute value close to the volume’s norm, capturing most of the data energy, while many other entries concentrate around zero. The prob- ability distribution of the other values between the two extrema is de- creasing with their absolute magnitude in a logarithmic fashion. Hence we apply a logarithmic quantization scheme as in Eq. 2 for the core tensor coefficients, using a separate sign-bit.

|x˜_B| = (2^Q^B − 1) · log₂(1 + |x|)

log₂(1 +|x_max|) (2) Special treatment is given to the one first high energy value mentioned before. It is known that this value, the hot-corner coefficient, is always at position B(0,0,0). Since it is one value and in order to give more space to the quantization range to the other coefficients, we optionally do not quantize this value and store it separately.

Various quantization levels for the other coefficients, Q_U and Q_B, could be used and analyzed. In practice, we have chosen a byte- aligned quantization of Q_U_,_B = 8- or 16-bit as a compromise between the most effective quantization and efficient bit-processing. The ef- fects of quantization as well as other tensor-specific optimizations are reported in Sec. 6.2 where we analyze the quantization error.

4.2 Storage Requirements

The basic storage needed for a volume dataset A of size of I₁ ×I₂ ×I₃, is I₁ · I₂ · I₃ · Q, where Q is the number of bits (bytes) per scalar value. A rank-(R₁,R₂,R₃) tensor approximation, however, only re- quires R₁ · R₂ · R₃ ·Q_B + (I₁ · R₁ + I₂ · R₂ +I₃ · R₃) · Q_U, in addition to three floating point numbers for the quantization ranges of the factor matrices (min/max values) and core tensor (max quantization value), and one floating point value for the hot-corner value. This first coefficient of the core tensor is (optionally) encoded separately from the remaining ones, leading to a reduced quantization range for Eq. 2.

3

To appear in an IEEE VGTC sponsored conference proceedings

data management system that divides the data into blocks is an impor- tant basis both to process and to visualize large datasets. Our method is based on the offline decomposition of the original volumetric dataset into small cubical bricks (subvolumes), i.e., third-order tensors, which are approximated, quantized and organized into an octree structure maintained out-of-core. The octree contains data bricks at different resolutions, where each resolution of the volume is represented as a collection of bricks in the subsequent octree hierarchy level.

Each brick has a fixed width B with an overlap of two voxels at each brick boundary for efficiently supporting runtime operations requiring access to neighboring voxels (trilinear interpolation and gradient computation). The width of the brick is flexible, but in this paper is set to B = (28 + 2 + 2) = 32, i.e., one brick is 32³, which has proved small enough to guarantee LOD adaptivity, while coarse enough to permit an effective brick encoding by the analysis of the local structure.

Each octree brick A ∈ R³ is tensor approximated using rank- reduced Tucker decomposition. A Tucker decomposition (see Ap- pendix A) is defined as A�= B ×1 U⁽¹⁾ ×2 U⁽²⁾ ×3 U⁽³⁾, where B is the so called core tensor and U⁽ⁿ⁾ are the factor matrices. A rank- reduced TA along every mode of the dataset is written with the notation: rank-(R₁,R₂,R₃) TA. As illustrated in Fig. 1, we compute for each brick of size B³ a rank-(R,R,R) TA, with R ∈ [1..B−1]. Typically, we use a rank reduction, where R = B/2, i.e., R = 16 for B = 32, following the rank reduction scheme used in other tensor approximation works [27, 23]. The resulting rank-reduced decomposition is quantized to further reduce memory usage (see Sec. 4) and stored in a out-of-core brick database. With each brick, we store a 64-bit binary histogram, which is used for transfer-function-based culling.

...

... ...

lowest resolution

highest resolution

B³bricks

core tensor and basis matrices U

B

A

Fig. 1. Multiresolution octree tensor decomposition hierarchy with B³ sized bricks.

The whole preprocessing is performed in a low-memory setting using a bottom-up process on a brick-by-brick basis, which is repeated until we reach the octree root. Leafs are constructed by sampling the original dataset, while non-leaf bricks are constructed from their previously constructed eight children, which are dequantized, reconstructed, and spatially averaged.

At run-time, an adaptive loader updates a view- and transfer function-dependent working set of bricks. The working set is incre- mentally maintained on the CPU and GPU memory by asynchronously fetching data from the out-of-core brick multiresolution TA structure.

Following the MOVR approach [12, 14], the working set is maintained by an adaptive refinement method guided by the visibility information fed back from the renderer. The adaptive loader maintains on GPU a cache of recently used volume bricks, stored in a 3D texture. At each frame, the loader constructs a spatial index for the current working set in the form of an octree with neighbor pointers.

For rendering and visibility computation, the octree is traversed using a CUDA stack-less octree ray-caster, which employs preintegrated scalar transfer functions to associate optical properties to scalar values, and supports a variety of shading modes [14]. The ray-caster works on reconstructed bricks, and reconstruction steps occur only upon GPU cache misses. The quantized tensor decomposition is dequantized and

reconstructed on demand by the adaptive loader during the visualization on the GPU (see Sec. 5).

In order to permit structural exploration of the datasets, the reconstruction can consider only the K most significant ranks of the tensor decomposition, where K ∈ [1..R] is chosen by the user. The reconstruction rank K can be changed during the visualization process with a rank slider. Lower-rank reductions give a faster outline of the visualized dataset and can highlight structures at specific scales [23], see also Sec.6. Higher K values add more details onto the dataset.

4 ENCODING OF COEFFICIENTS

As mentioned previously, the tensor and factor matrix coefficients take up unnecessary space if maintained as floating point values, see also storage cost analysis in Sec. 6.2. For compact representation of the tensor decomposition and to reduce the disk to host to device bandwidth during rendering, we apply a simple fixed bit length encoding based on tensor-specific quantization. In particular, the factor matrices and the core tensor of the Tucker model have a different distribution of coefficients and thus the quantization approach was selected accordingly, as described below. A fixed bit length approach has been selected in order to simplify parallel decoding on the GPU.

4.1 Factor Matrices and Core Tensor Coefficients

The coefficients of the basis factor matrices U^(1...3) are normalized and distributed between [−1,1], due to the orthonormality of factor matrices in the Tucker model. Therefore, a uniform linear 8- or 16-bit quantization as in Eq. 1 can effectively be applied. We use a single min/max-pair to indicate the quantization range for all three factor matrices to minimize the number of coefficients that need to be loaded by the CUDA kernels.

˜

x_U = (2^Q^U − 1)· x − x_min

x_max −x_min (1) As per definition of the Tucker model, the core tensor B captures the contribution of the linear bases combinations, i.e., the energy of the data, in its coefficients. The distribution of the signed coefficients is such that the first entry of the core tensor has an especially high absolute value close to the volume’s norm, capturing most of the data energy, while many other entries concentrate around zero. The prob- ability distribution of the other values between the two extrema is de- creasing with their absolute magnitude in a logarithmic fashion. Hence we apply a logarithmic quantization scheme as in Eq. 2 for the core tensor coefficients, using a separate sign-bit.

|x˜_B| = (2^Q^B −1)· log₂(1 +|x|)

log₂(1 + |x_max|) (2) Special treatment is given to the one first high energy value mentioned before. It is known that this value, the hot-corner coefficient, is always at position B(0,0,0). Since it is one value and in order to give more space to the quantization range to the other coefficients, we optionally do not quantize this value and store it separately.

Various quantization levels for the other coefficients, Q_U and Q_B, could be used and analyzed. In practice, we have chosen a byte- aligned quantization of Q_U_,_B = 8- or 16-bit as a compromise between the most effective quantization and efficient bit-processing. The ef- fects of quantization as well as other tensor-specific optimizations are reported in Sec. 6.2 where we analyze the quantization error.

4.2 Storage Requirements

The basic storage needed for a volume dataset A of size of I₁ ×I₂ ×I₃, is I₁ · I₂ · I₃ · Q, where Q is the number of bits (bytes) per scalar value. A rank-(R₁,R₂,R₃) tensor approximation, however, only re- quires R₁ · R₂ · R₃ · Q_B + (I₁ · R₁ +I₂ · R₂ + I₃ · R₃) · Q_U, in addition to three floating point numbers for the quantization ranges of the factor matrices (min/max values) and core tensor (max quantization value), and one floating point value for the hot-corner value. This first coefficient of the core tensor is (optionally) encoded separately from the remaining ones, leading to a reduced quantization range for Eq. 2.

3

R

₁

R

₂

R

₃

B

U

⁽³⁾

U

⁽¹⁾

U

⁽²⁾

I

₁

I

₂

I

₃

R

₁

R

₂

R

₃

• Factor matrices quantization

‣ values between [-1,1]

‣ linear quantization

• Core tensor quantization

‣ many small values; few large values

‣ logarithmic quantization

[Suter et al., 2011]

typedef qtucker3_tensor< R1, R2, R3, I1, I2, I3, T_value, T_coeff > qtucker3_t;

[vmmlib]

(27)

Parallel Tensor Reconstruction

27

parallel computing grid per brick

To appear in an IEEE VGTC sponsored conference proceedings

for the outer product between the corresponding column vectors in the factor matrices

A � = ∑

r

₁

∑

r

₂

∑

r

₃

b _r

₁

_r

₂

_r

₃

· u ⁽ _r ¹

₁

⁾ · u ⁽ _r ²

₂

⁾ · u ⁽ _r ³

₃

⁾ . (A.4) The sum of all theses weighted “subtensors” forms the approxima- tion A � of the original data A (see Fig. 13).

+ ...

= ... +

I

₃

I

₂

I

₁

b

_r₁_r₂_r₃

u⁽¹⁾r₁

u⁽²⁾r₂

u⁽³⁾r₃

A �

Fig. 13. Tensor reconstruction from Eq. A.4 visualized.

Another approach, is to reconstruct each element of the approxi- mated dataset individually, which we call voxel-wise reconstruction approach. Each element a � _i1i2i3 is reconstructed as shown in Eq. A.5, i.e., sum up all core coefficients multiplied with the corresponding co- efficients in the factor matrices (weighted product).

a � _i

₁

_i

₂

_i

₃

= ∑

r

₁

∑

r

₂

∑

r

₃

b _r

₁

_r

₂

_r

₃

· u ⁽ _i

₁

¹ _r ⁾

₁

· u ⁽ _i

₂

² _r ⁾

₂

· u ⁽ _i

₃

³ _r ⁾

₃

(A.5) A third reconstruction approach is to build the n-mode products along every mode [15] (notation: B × ⁿ U ⁽ⁿ⁾ ), which leads to a ten- sor times matrix (TTM) multiplication for each mode, i.e., dimension.

This is analogous to the Tucker model given by Eq. A.3. The n-mode product between a tensor and a matrix is equivalent to a matrix prod- uct as it can be seen from Eq. A.6. In Fig. 14 we visualize the TTM approach using n-mode products.

Y = X × ⁿ U ⇔ Y ₍ _n ₎ = UX ₍ _n ₎ , (A.6) where X ₍ _n ₎ represents the mode-n unfolded tensor, i.e., a matrix.

I

₁ _U⁽¹⁾

R

₁

R

₃

R

₂

×

1

B

^�

R

₃

I

₁

R

₁

(a) TTM 1

I₂

R₂

U⁽²⁾

I₁ R₂

R₃×² B^�

B^�� I₂

I₁

(b) TTM 2

I₃

R₃

U⁽³⁾

I₂ R₃

I₁×3

B^��

A �

I₂

I₃

(c) TTM 3

Fig. 14. TTM: tensor times matrix along the 3 modes (n-mode products).

Backward cyclic reconstruction after Lathauwer et al. [6].

Given the fixed cost of generating an I ₁ × I ₂ × I ₃ grid, the computa- tional overhead factor varies from cubic rank complexity R ₁ · R ₂ · R ₃ in the case of the progressive reconstruction (Eq. A.4) to a linear rank complexity R ₁ for the TTM or the n-mode product reconstruction (Eq. A.5). (For R = 16, the improvement to R ³ = 4 ^� 096 is 256-fold.) A CKNOWLEDGMENTS

This work was supported in part by a grant from the Forschungskredit of the University of Zurich, Switzerland, and by internal research funds at CRS4, Italy. The authors wish to thank Paul Tafforeau for the help of acquiring the tooth dataset, which was scanned at the SLS in Switzerland (experiment number: 20080205).

R ^EFERENCES

[1] vmmlib: A vector and matrix math library. http://vmmlib.sf.net.

[2] B. W. Bader and T. G. Kolda. Algorithm 862: Matlab tensor classes for fast algorithm prototyping. ACM Transactions on Mathematical Soft- ware, 32(4):635–653, 2006.

[3] J. Beyer, M. Hadwiger, T. M¨oller, and L. Fritz. Smooth mixed-resolution GPU volume rendering. In Proc. IEEE/EG Symposium on Volume and Point-Based Graphics, pages 163–170, 2008.

[4] T.-c. Chiueh, C.-k. Yang, T. He, H. Pfister, and A. E. Kaufman. Inte- grated volume compression and visualization. In Proc. IEEE Visualiza- tion, pages 329–336. Computer Society Press, 1997.

[5] C. Crassin, F. Neyret, S. Lefebvre, and E. Eisemann. GigaVoxels: Ray- guided streaming for efficient and detailed voxel rendering. In Proc. Sym- posium on Interactive 3D Graphics and Games, pages 15–22. ACM SIG- GRAPH, 2009.

[6] L. de Lathauwer, B. de Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM Journal of Matrix Analysis and Applications, 21(4):1253–1278, 2000.

[7] M. Do and M. Vetterli. Pyramidal directional filter banks and curvelets.

In Proc. IEEE Image Processing, volume 3, pages 158–161, 2001.

[8] M. Do and M. Vetterli. The contourlet transform: an efficient direc- tional multiresolution image representation. IEEE Trans. Im. Proc., 14(12):2091–2106, 2005.

[9] K. Engel, M. Hadwiger, J. M. Kniss, C. Rezk-Salama, and D. Weiskopf.

Real-Time Volume Graphics. AK Peters, 2006.

[10] N. Fout, H. Akiba, K.-L. Ma, A. Lefohn, and J. Kniss. High quality ren- dering of compressed volume data formats. In Proceedings Eurographics, pages 77–84, Jun 2005.

[11] N. Fout and K.-L. Ma. Transform coding for hardware-accelerated vol- ume rendering. IEEE Transaction on Visualization and Computer Graph- ics, 13(6):1600–1607, 2007.

[12] E. Gobbetti, F. Marton, and J. A. I. Guiti`an. A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volu- metric datasets. The Visual Computer, 24(7-9):797–806, Jul 2008.

[13] S. Guthe, M. Wand, J. Gonser, and W. Strasser. Interactive rendering of large volume data sets. In Proc. IEEE Visualization, pages 53–60, 2002.

[14] J. A. Iglesias Guiti´an, E. Gobbetti, and F. Marton. View-dependent ex- ploration of massive volumetric models on large scale light field displays.

The Visual Computer, 26(6–8):1037–1047, 2010.

[15] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.

Siam Review, 51(3):455–500, Sep 2009.

[16] P. Ljung, C. Lundstrom, and A. Ynnerman. Multiresolution interblock interpolation in direct volume rendering. In Proc. Eurographics/IEEE TCVG Symposium on Visualization, pages 259–266, 2006.

[17] P. Ljung, C. Winskog, A. Persson, C. Lundstrom, and A. Ynnerman. Full body virtual autopsies using a state-of-the-art volume rendering pipeline.

IEEE Transactions on Visualization and Computer Graphics, 12(5):869–

876, Oct 2006.

[18] NVIDIA. CUDA C best practices guide, version 3.2 edition, Aug 2010.

[19] NVIDIA. CUDA C programming guide, version 3.2 edition, Nov 2010.

[20] R. Parys and G. Knittel. Giga-voxel rendering from compressed data on a display wall. In WSCG, 2009.

[21] F. Rodler. Wavelet based 3D compression with fast random access for very large volume data. In Proc. Pacific Graphics, pages 108–117, 1999.

[22] J. Schneider and R. Westermann. Compression domain volume rendering.

In Proc. IEEE Visualization, pages 293–300, 2003.

[23] S. K. Suter, C. P. Zollikofer, and R. Pajarola. Application of tensor ap- proximation to multiscale volume feature representations. In Proc. Vision, Modeling and Visualization, pages 203–210, 2010.

[24] Y.-T. Tsai and Z.-C. Shih. All-frequency precomputed radiance transfer using spherical radial basis functions and clustered tensor approximation.

ACM Transactions on Graphics, 25(3):967–976, 2006.

[25] J. E. Vollrath, T. Schafhitzel, and T. Ertl. Employing complex GPU data structures for the interactive visualization of adaptive mesh refinement data. In Proc. Volume Graphics, pages 55–58, 2006.

[26] H. Wang, Q. Wu, L. Shi, Y. Yu, and N. Ahuja. Out-of-core tensor approx- imation of multi-dimensional matrices of visual data. ACM Transactions on Graphics, 24(3):527–535, Aug 2005.

[27] Q. Wu, T. Xia, C. Chen, H.-Y. S. Lin, H. Wang, and Y. Yu. Hierarchical tensor approximation of multidimensional visual data. IEEE Transactions on Visualization and Computer Graphics, 14(1):186–199, Feb 2008.

[28] B.-L. Yeo and B. Liu. Volume rendering of DCT-based compressed 3D scalar data. IEEE Transactions on Visualization and Computer Graphics, 1(1):29–43, 1995.

9

+ ...

= ... +

I

₃

I

₂

I

₁

b

_r₁_r₂_r₃

u⁽¹⁾r₁

u⁽²⁾r₂

u⁽r³₃⁾

A �

triple-for-loop

ã

computational cost per voxel is cubic: O(R

³

)

ã ã ã ã ã ã ã ã

(28)

Faster Parallel Tensor Reconstruction

28

parallel computing grid per brick

I

₁

_U

⁽¹⁾

R

₁

tensor times matrix (TTM)

multiplication or n-mode product

R

₁

R

₃

B

^�

R

₃

I

₁

R

₁

R

₃

R

₂

×

¹

B

^�

R

₃

I

₁

(29)

Faster Parallel Tensor Reconstruction

29

parallel computing grid per brick

TTM1 TTM2 TTM3

store intermediate results (B’ and B’’)

I₃

R₃

U⁽³⁾

I₂ R₃

I₁×3

B^��

A�

I₂

I₃ I₂

R₂

U⁽²⁾

I₁ R₂

R₃×² B^�

B^�� I₂ I₁

I₁ _U⁽¹⁾ R₁

R₃

R₂×1

B

B^� R₃

I₁ R₁

computational cost per voxel is linear: O(R)

ã

(30)

Compute Intermediate Tensor B’

30

parallel computing grid per brick

TTM1

I₁ _U⁽¹⁾ R₁

R₃

R₂×1

B

B^� R₃

I₁ R₁

b’ b’ b’ b’ b’ b’ b’ b’

(31)

Compute Intermediate Tensor B’’

31

parallel computing grid per brick

TTM1 TTM2

I₂

R₂

U⁽²⁾

I₁ R₂

R₃×² B^�

B^�� I₂ I₁

I₁ _U⁽¹⁾ R₁

R₃

R₂×1

B

B^� R₃

I₁ R₁

b’’ b’’ b’’ b’’ b’’ b’’ b’’ b’’

Implementation Examples in Scientiﬁc Visualization

Tutorial: Tensor Approximation in Visualization and Graphics

Implementation Examples in Scientific Visualization

Renato Pajarola, Susanne K. Suter, and Roland Ruiters

Outline

• Tensor classes in MATLAB and vmmlib

‣ Downloads:

– http://www.sandia.gov/~tgkolda/TensorToolbox

– https://github.com/VMML/vmmlib

‣ Typical tensor operations

‣ Toy examples (see folder: vmmlib_ta_demo)

‣ Test dataset (see folder: vmmlib_ta_demo)

• GPU-based tensor reconstruction

Typical TA Operations

• Create a tensor (memory mapping)

• Unfolding

• TTM

‣ core generation vs. reconstruction

• Create tensor models (Tucker, CP)

• Algorithms (HOSVD, HOOI, HOPM)

Tensor: A Multidimensional Array

• MATLAB: N-way tensor

‣ M = ones(4,3,2); (A 4 x 3 x 2 array)

‣ A = tensor(M,[2 3 4]); (M has 24 elements)

‣ A = tenones([3 4 2]);

‣ A = tenrand([4 3 2]);

• For details on the MATLAB tensor toolbox see toolbox documentation

I 1 a

i

= 1 , . . . , I

I 1 A

i

= 1, . . . , I

I

I

I 1 A

i

= 1 , . . . , I

I

a ...

a scalar 1

order tensor or vector 2

order tensor or matrix 3

order tensor or volume

Test Dataset: Hazelnut

• A microCT scan of a dried hazelnut (acquired at the UZH)

• I 1 = I 2 = I 3 = 512

• Values: unsigned char (8bit)

I

I

A

I

A Matrix in vmmlib

• I 1 (M) rows

• I 2 (N) columns

• The matrices are per default column-major ordered

• A matrix is an array of I 2 (N) columns, where each column is of size I 1 (M)

matrix< I1, I2, Type >

matrix< 4, 3, unsigned char > m;

I 1 A

i

= 1, . . . , I

I

A Tensor3 in vmmlib

A

I

I

I

• A tensor3 A in vmmlib is an array of I 3 matrices each of size I 1 times I 2

• The matrices are per default column-major ordered

• For each tensor3, the explicit size and the type of the values is requested

• A tensor3 is internally allocated and deallocated as pointer while the matrices are not

tensor3< I1, I2, I3, Type >

tensor3< 4, 3, 2, unsigned char > t3;

A Tensor4 in vmmlib

A

I

I

I

• A tensor4 in vmmlib is an array of I 4 tensor3s

I ₁ a

I ₁ A

I ₁ A

a ^...

I ₁ A

B _(n) ^I

A _(n) ^J

n-mode product ⁽ ^B ×