T. Kuhlen, R. Pajarola, and K. Zhou (Editors)
Optimal Multi-Image Processing Streaming Framework on Parallel Heterogeneous Systems
Linh K. Ha, Jens Krüger, Joao Comba, Sarang Joshi, Cláudio T. Silva
Scientific Computing and Imaging Institute, University of Utah
Abstract
Atlas construction is an important technique in medical image analysis that plays a central role in understanding the variability of brain anatomy. The construction often requires applying image processing operations to multiple images (often hundreds of volumetric datasets), which is challenging in computational power as well as mem- ory requirements. In this paper we introduce MIP, a Multi-Image Processing streaming framework to harness the processing power of heterogeneous CPU/GPU systems. In MIP we introduce specially designed streaming algo- rithms and data structures that provides an optimal solution for out-of-core multi-image processing problems both in terms of memory usage and computational efficiency. MIP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the MIP out-of-core solution could achieve the same performance as the in-core solution. We demonstrate the efficiency of the MIP framework on synthetic and real datasets.
1. Introduction
Multi-image processing is an advanced image process- ing technique that is widely used in medical imag- ing [CMVG96], video processing [BJW95,RDK∗98], as- tronomy [Boy92,Ala92], visual robot control [DM95], vir- tual reality [SSS06,SGSS08], modeling and reconstruc- tion [GSC∗07,HE07], etc.
One example is the atlas construction technique, which plays a central role in medical image analysis, particularly in understanding the variability of brain anatomy [CMVG96, DFBJ07,JDJG04]. The atlas construction projects a large set of images to a common coordinate system, creates a statisti- cal average model of the population, and performs a regres- sion analysis of anatomical structures. This average serves as a deformable template which maps detailed atlas data such as structural, developmental, genetic, pathological, and func- tional information onto the individual or entire population of the brain.
Examples like this one show how multi-image process- ing techniques provide benefits over single-image process- ing, at the expense of introducing major computational chal- lenges: First, they involve huge amounts of data that easily exceed the direct processing capability of the system. Sec- ond, they require massive amounts of computation, which
results in the computations requiring days or even months to complete. As a result, using multi-image techniques of- ten involve super-computing systems [CMVG96] or large- scale clusters to run [SGSS08,HKF∗09], which limits the use of multi-image processing techniques to large laborato- ries. A solution based on commodity hardware will make this technique available to smaller labs, increase the influ- ence of these techniques in research, and presents robust so- lutions for many existing problems.
In this paper, we discuss a solution for the multi-image processing problems on commodity hardware using graphic processing units (GPUs) combined with an out-of-core streaming model. The contributions of our paper are:
• We introduce a high-performance, multi-image process- ing framework with a proof-of-concept optimal streaming model.
• We define basic building blocks of a general framework which allow efficient-implementation multi-image algo- rithms.
• We introduce concepts for implicit and explicit pipelining and prove that these are optimal solutions.
• We analyze reasons for streaming degeneracy and provide a solution based on an order-independent model.
• Our performance analyses serve as the guidance to help
c The Eurographics Association 2011.
Figure 1:Age regression anlysis on the ADNI dataset by computing the average brain atlases at different ages (65, 70, 75, and 80) confirms the proposition that fluid space is larger because brains atrophies overtime. This analysis, however, could only be performed if the system is capable of processing the whole dataset of 300-healthy brain-images (144×192×160)
developers to profile performance and to make quantita- tive decisions.
2. Related work
In the last decade, there is an emerging trend in High Per- formance Computing research community (HPC) to use het- erogeneous processing systems such as Cell processors, FP- GAs, and multi-core GPUs to replace the conventional su- percomputing model. Super-computing systems based on the heterogeneous model have been successfully exploited in some of the fastest computing system, such as the current number one super-computer Tianhe-1A, the first computing system to achieve 2.5 petaflop/s [Meu10].
With hundreds of simple, computation-centric processing cores, the GPU processing model has proven to be highly scalable to many problems especially image processing by providing massive computational power. Modern GPUs can offer a few Tera-flops of peak performance per unit, pro- viding processing power equivalent to a super-computer in mid-90s, while being much more cost-and energy-efficient.
The huge in-core memory bandwidth, which has historically doubled every two years, is another advantage of GPU sys- tems over the conventional processing model, adding sub- stantial speed increases to GPU centric processing model.
There have been a number of image processing applications
implemented on GPUs [RSM10,SRC09,EAK10], most of which achieve from 20x to several magnitudes of speedup over CPU versions. Conceptually, our streaming framework is an extension to the idea of the fast GPU image-processing framework by Haet al.[HKF∗09,HKJS11]. Their method achieved 60x speed up in comparison to an optimized, fully- parallel version running on an eight-core Xeon server for Greedy Iterative Differmorphic Atlas construction problem.
While the use of GPUs appears to be a good solution to the computing requirements of multi-image processing tech- niques, the large memory footprint remains an open prob- lem. Though providing ample memory bandwidth, the size of the on board GPU memory is very limited. But as GPU programs can only access on-board, all required data needs to be present on the card, so out-of-core methods need to be employed.
There are three primary approaches to out-of-core pro- gramming. The first is to use virtual memory based on op- erating system support. It is simple and unified for both in- core and out-of-core processing. However, due to a lack of application-specific knowledge about the data dependence and parallelism, this method often leads to a poor perfor- mance [WGRW02]. The second approach is to use compiler directed I/O to convert a program from in-core to out-of- core [BCK∗95,MDK96,BMK01]. For programs with com- plicated data dependencies this approach is not as effec- tive as the third approach that we use here: the explicit I/O controls by developers. These methods concentrate on tech- niques to improve the cache coherency such ascachingand prefetching[Mac94,CDS05,CESL∗03,HYWL07,BWP04]
to reduce the I/O necessary for blocks already in main mem- ory and/or by overlapping I/O operations with main-memory computations. The methods exploit particular computational properties of each individual problem as part of the algo- rithm design. While the explicit I/O controls are mostly application-specific, our method is able to be applied to a wide class of applications such as the out-of-core multi- image processing.
Our out-of-core strategy exploits two key performance concepts: prefetching and data-transfer-hiding based on an asynchronous streaming execution model. Asynchronous processing is a pipeline-concurrent execution model that ex- ploits the availability of multiple execution units in the sys- tem to run independent tasks concurrently. This strategy re- duces idle stages and increases the resource usage. It can also hide data transfer by prefetching data. When processing units finish current tasks, they can start the next tasks with- out delay. In many circumstances, using this model signif- icantly increases the overall system throughput.The similar concept has been successfully applied in video processing pipeline where video encoding/decoding/composition and filtering inherently require out-of-core multi-image stream processing [BJW95,RDK∗98].
The asynchronous processing is realized with streaming
I1
Multi-Input Single-Output (MISO)
•add, mul, sub, divide, normalized
•convolution, filter
•max, min, range
•average, accumulate I2
In
O1
O2
On
I1
I2
In
O Multi-Input Multi-Output
(MIMO)
Figure 2:Basic Multi-Image Operators
models for both tasks and data. Streaming is an efficient model for parallel processing in that a task is divided into smaller entities to allow their parallel executions. A stream is an abstraction of an execution unit; in particular, it rep- resents a sequence of commands that are executed or ac- cessed in a particular order. Pure data streams determine data parallelism processing model, while pure task streams determine the task parallelism model. A stream in practice may be either data-based or tasked-based or even a mixture.
The only restriction in a stream is the execution order that is satisfied by a sequential consistency model [Lam79], which makes a stream equivalent to a synchronous process. Differ- ent streams, on the other hand, may execute their commands out-of-order with respect to each other.
3. Multi-Image Processing Operators
As we can see from an example of the atlas construction Al- gorithm1[HKJS11], a multi-image algorithm involves sev- eral multi-image operations, most of which are direct exten- sions of single-image processing operations through a for loop over all the input. We build our multi-image processing framework upon the single-image high-performance multi- scale processing framework proposed by Haet al.[HKF∗09]
so that we are able to exploit the optimized performance of the existing framework.
We define the multi-image processing framework using a construction method that builds regular multi-image op- erators from basic building blocks. This strategy allows us a fine-grained and multi-level parallelism in that we could exploit different execution strategies on each implementa- tion level to make use of available resources. Here, we clas- sify basic multi-image operators into two main groups based on Flynn’s taxonomy [Fly72]: the Multiple-Input-Multiple- Output operators (MIMO) and the Multiple-Input-Single- Output operators (MISO).
The basic MIMO operators are defined as functions with equal numbers of inputs and outputs, whereas the n-th out- put image depends solely on the n-th input image(Figure2a).
These functions are the most frequently used in multi-image processing as they are direct extensions of single-image op- erations. Examples for such operations include adding, shift- ing, scaling, smoothing, filtering, de-noising images, and normalizing the intensity range.
Complex MIMO operations: n - inputs , m - outputs I1
I2
In
O1
O2
Om
I1
I2
In
O1
m instances of MISO I1
I2
In
O2
Figure 3:General Multi-Input Multi-Output operators
sliding window MIMO
I1 I2 I3 I4 I5 In
O2 O3 O4 O5 On
O1
I1 I2 I3 I4 I5 In
O2 O3 O4 O5 On
O1
current-buffer new input
basic MIMO-equivalent Sliding window MIMO operations
Figure 4:Sliding window MIMO operators
The MISO operators, as illustrated in Figure2b, produce a single or few outputs. Examples for such operations in- clude the computation of an average image, the image en- ergy, cross-correlation, cross-product of images, and finding the maximal and minimal values.
The implementation of general multi-image operators is based on a decomposition strategy that breaks a complex function into multiple, basic operations. For example, a gen- eral MIMO function that has a number of outputsMwhich is different from the number of inputsN, and the k-th out- put depends on multiple inputs, could be implemented asM instances of a MISO operator as shown on Figure3.
Another group of frequently used multi-image operators is the sliding-window operator (Figure 4a). This operator computes an output image based on all values in a fixed- size sliding window of the input. This window moves as we compute the next output image. As shown on Figure 4b, if we keep an input buffer with the size of the sliding window, as the window moves, we need to replace an entry of the window with the new input data. In other words, the compu- tation of a current output requires only a single input. Algo- rithmically, it is equivalent to the basic MIMO model.
Overall, we can implement arbitrarily complex multi- image functions based on the basic MIMO and MISO func- tions. We focus our discussion on how to efficiently imple- ment these out-of-core operators.
Note that the framework of Ha et al. already has sup- port for multi-image and large data processing through the GPU-cluster implementation using MPI. It also offers a multi-GPU implementation to exploit available computing resources and to increase the amount of in-core GPU mem- ory on a single processing node. Both approaches, however, have the limitation that they depend on the total amount of system memory. The out-of-core approach we introduce here, however, has no restrictions on data input and can
process the entire 3D-image brain dataset in a PC desk- top equipped with commodity GPUs. Hence, our solution is more complete and accessible to researchers and scientists.
We also offer a more flexible solution to existing methods with two levels of streaming operations: out-of-core GPU in-core-CPU, and fully out-of-core. The former utilizes the availability of the larger CPU memory system; in some cases the CPU (but not the GPU) memory may be sufficient for the entire computation. In the latter case, the dataset does not even fit into CPU memory and the data must be trans- ferred through two memory levels: between disks and CPUs, and between CPUs and GPUs. We show that our streaming strategies could be generalized through multiple memory hi- erarchy levels. In the following discussion, GPUs are pro- cessing devices in the first out-of-core level; consequently, in-core memory refers to the GPU global memory while the CPU memory plays the role of storage devices.
4. MIP Out-of-core Framework
We use a synchronous implementation of the MIMO (Algo- rithm2) and MISO (Algorithm3) operators as references for the correctness and performance improvement of our asynchronous implementations. We compare different meth- ods to implement out-of-core multi-image operations: an implicit model, a hardware-aware model, and a hardware- independent model. We will prove that the proposed strate- gies are optimal. But first, lets do some analyses on the best achievable performance of an asynchronous algorithm.
4.1. Asynchronous optimal performance analyses To evaluate the performance, we use a typical hardware con- figuration with three components: one computational unit (GPU) and two data transfer units(one for uploading, the other for downloading data). For performance analysis, we use following notation:
• n: the number of input images
• ns: the number of execution units
• τi,j: the runtime of thei-th execution unit on thej-th input image.
• Ts,Ta : the total synchronous/asynchronous processing time
• Tu,Te,Td: the uploading, executing, and downloading run- time per image (Figures5,6,7).
• Tithe total amounts of time spent by the execution uniti
• Tu=n×Tu,Te=n×Te,Td=n×Td: the total amounts of time spent on upload, execution and download process.
• Tmax = max(T1,T2,· · · Tns) the maximum amounts of time spent by a single execution unit.
Our analysis is based on the assumption that all images have similar sizes, and therefore require almost the same amount of running time. This assumption is normally sat- isfied with pre-processing multi-image data.
stream1
stream2
streamn-1
1 1 1
2 2
n-1
n n
streamn
2
n
1 1 1 2 2 2
no-stream
Implicit-MIMO
upload execution download
CPU to GPU memory transfer
GPU to CPU memory transfer GPU program execution
n-1 n-1
Ts=n×(Tu+Te+Td)
Ta=n×Tmax+ (Tu+Te+Td−Tmax) Tmax=max(Tu, Te, Td)
Figure 5:Implicit processing model for MIMOs
Upload
Stream 1
1 1 2
2 3
2 3
4
n-2 n-1
n n
n-1 n
Hardware-aware MIMO with 2 DMAs and 1 ALU
Ta=n×Tmax+ (Tu+Teu+Ted+Td−2×Tmax)
Execution Stream Download Stream
Figure 6: Pipeline processing model for Multiple Images Multiple Output
First, we determine the optimal asynchronous runtime, which we use as a reference to evaluate the efficiency of proposed implementation method. In the ideal case, all ex- ecution units run independently parallel. However, as a sin- gle execution entity, they perform tasks in sequential order.
The total amounts of time that an execution unit spends is Ti=∑nj=1τi,j that equals n×τi whereτi runtime ofi-th stream on a single-image. Since the multi-image operation is only completed when all the execution units have com- pleted their tasks, the runtime the entire operation will be at leastTmax=max(T1,T2,· · · Tns)orTa≥ Tmax=n×τmax. This is the optimum runtime that the system can accomplish.
Note that with the hardware configuration of upload, exe- cution, and download unitsτmax =Tmax=max(Tu,Te,Td) (Figure5).
4.2. Implicit Streaming Model
Theimplicit streaming model(Algorithms4) is solely based on data parallelism that assigns each image to a stream which works as a logical execution unit that performs the entire processing pipeline (Figure5). As streams operate on different memory spaces, the data transfer on a stream can be overlapped with processing tasks on other streams. This is a contrast to theexplicit streams(Algorithms5 6):hardware- awareandhardware-independentmodels, which depend on task parallelism. The former maps each hardware execution unit to a single stream while the latter delineates a stream to a fixed function.
Figure5illustrates the execution of an implicit streaming model for a MIMO problem (Algorithm4). It can be seen that with the number of images being significantly larger than the number of streams, the overall processing time is
approximately n×Tmax which is the optimal runtime of asynchronous processing.
4.3. Hardware-Aware Streaming Model
The execution of the hardware-aware processing model for MIMO problems is illustrated in Figure 6. In this model, there are three streams mapping to three execution devices.
Timing analysis of the method shows that the processing time in this case is also optimal. Because the hardware- aware model reflects the actual execution of asynchronous processes in the system, it requires developers prior informa- tion about the architecture of the underlying system. That is, it requires different implementation on different hardware.
4.4. Hardware-Independent Streaming Model
The last processing strategy, the hardware-independent model, is a generalization of the hardware-aware model. In- stead of decomposing tasks based on actual hardware con- figuration, we assume that there exists one special execu- tion unit for every task, and we can assign each task a sin- gle stream. In the case of MIMO operations, there are three primary tasks to apply to each image: the data upload, the processing, and the data download. On a system with two data transfer units and one processing unit, it results in a streaming scheme similar to hardware-aware models; con- sequently, this model also achieves the optimal runtime.
Normally, however, there are more tasks than the actual number of execution units. In this case it is possible that sev- eral tasks are mapped to the same execution unit, for exam- ple, data uploading and downloading will map to the same unit in a single-data-unit system. The question is how effi- cient it is when it incorrectly predicts the underlying sys- tems, in particular, when there are multiple streams sharing the same execution unit.
Data independence results in no performance loss, as the system can instantly switch between one task and the other.
This function is done automatically as sharing info is avail- able only at the system level. Figure7shows the runtime analysis of an optimal solution for MIMO operation on a system with one DMA and one ALU using the hardware- ware and hardware-independent implementation. The result shows that although the hardware-independent model incor- rectly predicts the underlying execution system, it still per- forms optimally.
4.5. Discussion on streaming modes
The primary advantage of the implicit approach is that devel- opers are relieved from the burden of asynchronous schedul- ing. Furthermore, the stream has the same execution flow as processing a single-image, no further change is required, and no synchronization is needed since each stream works on different data. However, it has several disadvantages:
MIMO - Single DMA system
1 1
1 2
2 3
3
n-2 n-1
n n
n-1 4
Execution 1
1 2
2 3 n
Tu
Sync-point
n-1 (a) Hardware-aware model
(b) Hardware-independent model
1 3 2 n n-1 n
n
Ta=Tu+ (n−1)×Tm+Ted
Ta=Tu+Teu+ (n−2)×Tm+Ted+Td Upload-
Download
Upload
Execution Download
Figure 7: Although the hardware-independent model mis- matchs the system configuration, the performance is still op- timal
• The method does not reduce the memory usage and all the data must be loaded in-core. Hence, this method cannot be used for out-of-core processing.
• It requires the capability of decomposing input data and combining output results that is not always satisfied.
• Although automatic scheduling hides executions from de- velopers, understanding the physical execution is essen- tial to profile the performance and to estimate the benefit of the method. This estimation is an important factor for making optimization decisions.
• The performance efficiency of the implicit streaming model is largely dependent on the scheduling algorithm used by the operating system or the concurrent controller.
In fact, the optimal scheduling problem is NP-hard. This explains why, in practice, this approach does not always provide the predicted optimal performance.
• The implicit model has an order-dependency that limits the execution of the streams. Particularly, all streams ex- ecute in the same order of the logical flow: uploading- processing-downloading. However, flexible reordering is an effective strategy to handle degenerate cases, including synchronous functions calls.
Most of the weaknesses of the implicit model can be han- dled by explicit approaches.
• Explicit methods require a much lower memory footprint, which is equal to the number of hardware devices with the hardware-aware model or number of decomposed tasks with the hardware-independent model. That means they are suitable for out-of-core processing.
• As it is always possible to divide an out-of-core algorithm into three primary tasks, it is easier to decompose tasks than partition data.
• The explicit method uses an explicit scheduler. That means the execution is controlled, providing several ben- efits. First, developers can profile the performance before they actually run it. Second, it reduces the complexity of scheduling problem to trivial mapping, so it is even opti-
mal without any automatic scheduler supports. And third, it helps to understand why degeneracy happens, how it af- fects the performance, and how to deal with it.
5. Re-ordering stages in streaming models
The aforementioned approaches are simple and theoretically optimal. They are straightforward to transfer from single- image processing to multi-image processing through the generalization of basic multi-image operators. However, the optimal performance is hardly achieved in practice, the pri- mary reason as we show here is the streaming degeneracy.
5.1. Forced Synchronizations
There are three primary reasons that degeneracies may ap- pear in streaming models
• Synchronous function calls
• Asynchronous stream mismatches
• Cross-stream function calls
The most common reason for an unintended synchronous function call is that the application requires an external call to a library function that was designed for synchronization execution. Another reason is the mixed use of synchronous and asynchronous functions.
Even when all functions support asynchronous execution, they might be designed using different schemes. The strate- gies are often incompatible and cannot work together effi- ciently. For example, a kernel function defined to run on a logical stream, named 0, is incapable of running in-parallel with a data-transfer function on the physical stream with the same identity. These functions frequently require ex- plicit synchronization to switch between the different asyn- chronous modes.
Cross-stream calls occur when the implementation re- quires data access and computation to or from different streams. As a result, the compiler force these streams to synchronize at cross-reference points to preserve the seman- tic order of the original program. One example is the tra- ditional implementation of the class of reduction functions in CUDA. Though the computations run in-core on GPU- devices, the output of these functions, which are typically used for branching on CPU-host, require the result to be copied from device memory to host memory. This opera- tion is a cross-stream function between the computational stream on the devices and the transfer data stream between devices and host. The popularity of the reduction functions is the main obstacle for applying asynchronous models on ex- isting GPU architectures. Our solution for the reduction-like function is an on-device model that outputs the result only to device memory. It requires subsequent functions to use on- device parameters, and to delay or remove the branching in the codes.
i i i
i
i i i-1
i-2
i+1
i-1
i+1 i
i i+1 i
i-1 i-1 i-1
i i+1 i+2
loop order (time)
orderindependenc
loop order (time) upload execution download
Stage Reordering
original execution order
Figure 8:The transformation from a synchronous model to an explicit streaming model preserves semantic correctness
5.2. Re-ordering Pipeline Stages
Still, in many cases, a forced synchronization is unavoidable, though negative effects can be minimized using a reordering technique. This out-of-order execution, is applied in modern compilers to reduce the number of mis-predicted branches, to avoid data spilling, to keep the instruction pipelines filled, and especially to allow parallel execution on a system of multi-processors.
In this case of streaming with degeneracy, the reordering optimization cannot be done automatically using the com- piler. The reason is that the uploading and downloading are the IO processes which have the side effects. This constrains the order of function execution and requires the compiler- generated code to execute in the same order as it appears in the API levels. Even worse, the forced synchronous func- tions impose a restriction in the order of the outputs. So re- ordering without compiler support must be done explicitly.
Allowing different streams working on independent im- ages allows our explicit models to break the order-execution dependency inside the loop, replacing it with an equivalent order-independent streaming model. As shown in Figure8, the order dependency of the original loop is still preserved in the order of loop execution. In other words, the logical correctness of the processing model is guaranteed by con- struction.
As the order of streams inside a loop becomes unim- portant, we can change the order of streams at the API level from the regular order of upload-process-download to upload-download-process, or process-upload-download.
The ability to change the processing order allows stream- ing optimization. This optimization is particularly effective when asynchronous stream degeneracy is unavoidable.
In the implicit model, when the synchronizations exists in the execution process, it is unable to overlap the uploading and downloading stream as the uploading process has to fin- ish before the synchronization points, while the download- ing only happens after the synchronization points. As shown on Figure9, changing the order of streams in the code using the explicit model allows the upload and download stream
Linh K. Ha et al. / Multi-Image Processing Streaming Framework
i+1 i
i-1 i
i-1 i-2
i+1 i
i-1 i
i-2
i+1 i-1
i+2 i+1
i i+2
i sync-
point Stage U-D-EU-E-D
Reordering Stage reordering benefit
with a forced-incident sync-point in MIMOs
sync- point
sync- point
sync- point sync- point
sync- point
Figure 9: Streaming optimization using reordering tech- nique. As shown on the figure it is able to eliminate the neg- ative effect of forced-synchronous function
1
1 1
1 1 2
2
2 2
5 5 3
3 4
2 4 6 3
4 3
Functional streams
Out-of-Core MIP Hardware-independent Streaming Mode
upload
execution download
CPU to GPU memory transfer GPU to CPU memory transfer GPU program execution
d-upload disk to CPU memory transfer d-download CPU to disk memory transfer
Figure 10: The implementation of hardware-independent model for “full” out-of-core multi-image processing
to be fully overlapped even when a synchronization point is present. Thus, reordering helps reduce the run-time per iter- ation as well as the overall run-time. The ability to seman- tically reorder the stream execution in the code allows us to adapt a performance heuristic that profiles the performance and selects the optimal order.
6. Extension to a full out-of-core framework
The extension from the partial out-of-core model with one level of memory hierarchy to a full out-of-core model with a two-memory levels comes naturally with the hardware- independent model. By adding two more stages to the algo- rithm decomposition-the upload from disk to CPU memory and download from the CPU memory to disk-we realize the transition to a fully out-of-core model. The execution of this model for MIMO operation was displayed on Figure10.
Using the same logic as the partial out-of-core model, we can prove that the hardware-independent model for out-of- core processing is optimal. Note that we use the term “full”
to mean that the data could be stored on the hard drives of a single machine. However, our hardware-independent model could be further extended to the other out-of-core models, such as the data stream on the network and the system with higher memory hierarchy levels, and we can still prove that the proposed models are optimal.
7. Results
The system we used in our experiment is a PC desktop, Intel Core i7-980X, 12-GB DDR3 1600, with a single NVIDIA GTX 480. Communication from the host to GPU is via the external x16 PCIe bus and is controlled by a single DMA.
10 40 65 100 120 130 200
0.08 347 53 322 720 670 674 674 672 672 672
0.3 347 204 322 874 673 679 679 672 672 672
0.5 347 334 322 1003 679 682 692 672 672 672
0.77 347 515 322 1185 849 853 873 683 683 683
0.93 347 619 322 1289 953 957 977 690 690 690
1 347 671 322 1339 1006 1010 1028 695 695 695
1.54 347 1031 322 1700 1366 1370 1390 1057 1057 1057
Ratio U E D Sync Impl U_ED UED
0.08 0.3 0.5 0.77 0.93 1 1.54
347 53 322 720 670 674 672
347 204 322 874 673 679 672
347 334 322 1003 679 682 672
347 515 322 1185 849 853 683
347 619 322 1289 953 957 690
347 671 322 1339 1006 1010 695
347 1031 322 1700 1366 1370 1057
0 1000 2000
0.08 0.3 0.5 0.77 0.93 1 1.54
Asynchronous runtime - ideal condition
Time(ms)
Processing ratio r = E / (U + D) Sync
Implicit=U_ED=UD_E UED=EDU=DUE
E U,D
Figure 11: Runtime comparison of different streaming strategies in ideal conditions.
The program is compiled with CUDA NVCC 3.1. Run-time of each function is measured in milliseconds.
We made a synthetic test on a data set of 32 volumes, sized 256×256×256. The test mimics a typical out-of-core multi-image processing program using three processes: up- load, execution, and download. Note that the execution time and data-transfer times scale propotionally to the number of images and the sizes of the image, we also achieve similar performance curves with different number of images rang- ing from 10 to 180 (the maximum number of volumes we can fit onto the 12GB of memory).
The existing architecture on commodity hardware has sin- gle DMA unit, so the upload and download process has to be performed sequentially. This information allows a two-device, hardware-aware model with only two memory buffers. There are two options for its implementation: (1) the upload of thek-th volume in parallel with the execu- tion and the download of(k−1)-th volume (U_ED);(2) the upload and execution of the k-th volume in parallel with the download of(k−1)-th volume (UE_D). Our hardware- independent model still decomposes the algorithm into three processes regardless of the system configuration. There are six permutations for the implementation of the hardware independent model, however, here we report the perfor- mance for three permutations: (1) regular upload-execution- download (UED) (2) execution-download-upload (EDU) (3) download-upload-execution (DUE).
7.1. Full asynchronous processing
First, we perform our test using the ideal cases, full asyn- chronous processing function, without a single synchronous call in the execution. Here we measure the influence of the ratio betwen computation and data transfer (processing ra- tio) on the performance of different asynchronous process- ing models, denotere=E/(U+D). This ratio indicates different types of out-of-core functions: data-transfer dom- inance (r<<1), processing dominance (r>>1), and bal- anced functions (r≈1). In the ideal case, the results on Fig- ure11show:
7
Function U E D Sync Impl Hrd-aware Hrd-indp
Max 347 13 0 360 349 349 349
Energy 692 20 0 710 698 700 698
Averaging 347 20 11 378 360 363 361
Normalization 347 28 322 694 696 687 677
Gaussian 347 431 322 1099 735 770 678
Atlas 201446 213423 1359583 555204 NA 372567 340356
Table 1:Runtime comparison of regular functions in practices with different streaming strategies.
Weight Ratio U E1 E2 E D Sync Impl U_ED UE_D UED EDU DUE
10 40 65 100 120 130 150 150
0.05 347 53 997 1050 322 1698 1663 1654 1389 1340 1054 1652
0.2 347 204 820 1024 322 1698 1664 1506 1389 1191 1054 1498
0.32 347 334 694 1028 322 1698 1664 1380 1389 1067 1055 1369
0.5 347 515 514 1029 322 1698 1664 1370 1389 1056 1199 1199
0.6 347 619 411 1030 322 1698 1664 1370 1389 1056 1296 1101
0.65 347 671 360 1031 322 1698 1664 1370 1389 1054 1346 1055
0.75 347 773 257 1030 322 1698 1664 1370 1457 1124 1448 1055
0.95 347 976 51 1027 332 1698 1667 1372 1651 1317 1650 1054
Ratio U E1 E2 E Sync Impl U_ED UE_D UED EDU DUE
0.05 0.2 0.32 0.5 0.6 0.65 0.75 0.95
347 53 997 1050 1698 1663 1654 1389 1340 1054 1652
347 204 820 1024 1698 1664 1506 1389 1191 1054 1498
347 334 694 1028 1698 1664 1380 1389 1067 1055 1369
347 515 514 1029 1698 1664 1370 1389 1056 1199 1199
347 619 411 1030 1698 1664 1370 1389 1056 1296 1101
347 671 360 1031 1698 1664 1370 1389 1054 1346 1055
347 773 257 1030 1698 1664 1370 1457 1124 1448 1055
347 976 51 1027 1698 1667 1372 1651 1317 1650 1054
0 1000 2000
0.05 0.2 0.32 0.5 0.6 0.65 0.75 0.95
Streaming with degeneracy
Time (ms)
Synchronous ratio r = (E1/(E2+E1))
Upload, Download E1
E2 E = E1 + E2
UED DUE EDU
UE_D U_ED
Sync Implicit
Figure 12: Runtime comparison of different streaming strategies in degenerate conditions.
• In all the tests, the three hardware independent imple- mentations give us the same performance. The hardware- aware and implicit models give similar runtimes. The U_ED is slightly faster than UE_D since the upload takes a bit longer than the download.
• If the function is transfer-dominant (re<0.5), all the models give optimal solutions.
• When the execution time is larger than the upload or the downloading time, the first two models still give strong performance, approximatelyTu+Te. However, it is not the optimal ofmax(Tu+Td,Te)achieved with the hardware-independent model. The hardware-independent model is faster than the hardware aware model in this case because the awareness from the hardware system re- quires that a double-image memory buffer is used instead of a triple one used by a hardware independent model.
In this configuration, it is impossible for the hardware- aware models to have a single stream with both the upload and the download when the other stream is only process- ing. This condition is required to achieve the best perfor- mance.
• When the function is balanced or processing-dominant (re≥1), the hardware-independent model gives the op- timal runtimeTeand the data transfer is completely hid- den. Note that this is also the condition for MIP out-of- core functions to outperform MIP in-core implementation since the in-core version will spendTe+n×Tu.
• The asynchronous function gives the best speedup in com- parison to the synchronous models when the loads be- tween two execution units are balanced (re=1).
7.2. Synchronous functions
Second, we test the result with the use of a synchronous function. Here we fix the run-time of three basic processes but change the position of the synchronous function inside the execution process to measure the influence of sync points inside the functions to different streaming models through the synchonous ratiors=E1/(E1+E2). With the existence of the synchronous function, the results in Figure12show:
• The position of the sync point within the asynchronous code directly affects the performance of the given imple- mentations.
• The three hardware-independent implementations give us different performance characteristics. No single hardware-inpendent implementation gives us the best run- ning time overall. However, the best result always is achieved with one of the hardware-independent imple- mentations.
• The implicit model no longer gives us the optimal result, and is as slow as the synchronous implementation. It sim- ply cannot find a schedule for asynchronous execution.
• The hardware-aware model could not give us optimal re- sults in all the tests. However, it is still far better than the implicit model. Note that their two implementations also give different runtimes.
Though we show the results with execution-dominant func- tion here, we also draw the same conclusions from transfer- dominant and balanced functions.
7.3. Regular out-of-core functions
On the third experiment, we focus on the regular out-of-core function sets such as a maximum value of all images, nor- malization, averaging, Gaussian filtering, product (energy computation), and atlas building. The results from Table 1 confirm that when the computation only requires simple functions (max, product, normalization, averaging, etc. ), the asynchronous streaming does give you the benefit of hiding the computational cost. However, it is negligible in compar- ison to the transfer cost. As the complexity of the functions increases (for example, Gaussian filtering function), we start seeing significant benefits of asynchronous streaming strate- gies, especially with the hardware independent model. In at- las construction, which is taken on the ADNI dataset that we mentioned on Figure1, as we increase the complexity of computational functions and reduce the cost of data transfer by merging all the functions together on a single loop, we
yield significant performance improve over the synchronous out-of-core version. The performance is compared to the in- core performance (execution time only) while we could pro- cess a significant amount of data much larger than that of an in-core version.
Overall, our results confirm our theoretical analysis. All the strategies are able to achieve optimal performance; how- ever, only the hardware-independent model gives the best performance in all the tests. In the degenerate cases, the im- plicit model completely fails. An presence of synchroniza- tion points makes it impossible to find a efficient sched- ule automatically. Note that in this case, a greedy approach, which immediately executes whenever the resource is avail- able, also fails. The hardware-aware model gives better performance even with the degenerate cases, although it is optimal. It is always possible to find the best runtime between hardware-independent implementations. In other words, the optimal performance is always achievable with the hardware-independent model.
8. Conclusions
In this paper, we have presented an optimized, parallel, multi-image processing framework on heterogeneous com- modity systems extending from the existing single-image, parallel processing framework. We have introduced multi- image operators, serving as the connection between the single-image processing model and the multi-image process- ing variant. We proposed two basic multi-image operators:
the MIMO and the MISO, which are utilized to construct other multi-image operators, allowing us to build a complete multi-image processing framework. We have also presented optimal streaming models for the multi-image processing framework. We have analyzed the advantages and disadvan- tages of various streaming strategies, and proposed a gener- alized streaming model based on functional decomposition that is optimal, hardware-independent, and highly scalable on future hardware. Our experimental results show that our hardware-independent model adapts to underlying hardware configurations, out-performs other streaming strategies, and gives optimal performance in all tests.
We also evaluated the efficiency of streaming models, and presented an quantitative evaluation that serves as a model for developers. We have investigated an optimal streaming strategy in unfavorable conditions based on reordering from order-independent properties of the explicit-streaming mod- els. We also give insight to the causes of unfavorable stream- ing conditions that help developers locate the performance degradation points in their implementations. Though we use a GPU computational model to illustrate the efficiency, our framework makes no specific assumptions about the under- lying architecture and hence can be generalized to any het- erogeneous parallel processing systems.
9. Appendix
1:Input:Nvolume inputs
2:Output: Template atlas volume
3:fork=1 tomax_itersdo
4: Fix imagesIik, compute the template ˆIk=N1∑Ni=1Iki wi
∑N i=1wi
5: fori=1 toNdo{loop over the images}
6: Fix the template ˆIk, solve pairwise-matching problem betweenIikand ˆIk
7: Update deformed imageIikwith current velocity
8: end for
9:end for
Algorithm 1:Atlas construction framework
1:Input:Ninput images
2:Output:Nprocessed output images
3:fork=1 toNdo
4: Upload thek-th image from the storage device to the processing device
5: Process the input in-core on the processing device
6: Download the output image back to the storage device
7:end for
Algorithm 2:Synchronous out-of-core MIMO operators
1:Input:Ninput volumes
2:Output: few numbers(sum, max/min, etc) or few images
3:fork=1 toNdo
4: Upload thek-th image from the storage device to the processing device
5: Process the input in-core on the precessing device
6: Update the accumulated output buffer on the processing device
7:end for
8:Write the final output to the storage device
Algorithm 3:Synchronous out-of-core MISO operators
1:Input:Ninput volumes
2:Output:Nprocessed output volumes
3:fork=1 toNdo
4: Load the dataiImg[k]from storage device to processing device,dkon the stream k-th
5:end for
6:fork=1 toNdo
7: Apply the operator on datado=oper(dk)on the streamk-th
8:end for
9:fork=1 toNdo
10: Write outputdoto the storage deviceoImg[k]on the streamk-th
11:end for
Algorithm 4:Implicit pipelining MIMO operator
1:Input:Ninput volumes, device input buffersdi[3]and device input buffersdo[3]
2:Output:Nprocessed output volumes
3:fork=1 toN+2do
4: ifk<=Nthen
5: Load the dataiImg[k]from storage device to device bufferdi[k%3]on theup- loadstream
6: end if
7: ifk>1 andk−1<=Nthen
8: Apply the operator on device bufferdo[(k−1)%3] =oper(di[(k−1)%3])on executionstream
9: end if
10: ifk>2 andk−2<=Nthen
11: Write outputdo[(k−2)%3]to the storage deviceoImg[(k−2)]on thedown- loadstream
12: end if
13: Synchronize streams
14:end for
Algorithm 5:Explicit pipelining MIMO operator
References
[Ala92] ALATTARA.: A probabilistic filter for eliminating tem- poral noise in time-varying image sequences. InCircuits and Sys-