• No results found

Parallel Programming Patterns

N/A
N/A
Protected

Academic year: 2022

Share "Parallel Programming Patterns"

Copied!
34
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

hetcomp.com Johan Seland

SINTEF Petroleum Development Workshop – Session 3 Trondheim - 9. December 2010

Parallel Programming Patterns

(2)

• Introduction and vocabulary

• Limits to performance

Amdahls Law vs Gustafson Law

• Concurrency

Domain decomposition

Task parallelism

• Synchronization

Fences

Barriers

Mutex es

Semaphores

• Testing parallel programs

Overview

(3)

hetcomp.com

A Pattern is:

• General

• Reusable

• Based on a proven design

• Not directly translatable into code

• Originated in architecture

Christopher Alexander 1977

• Now part of the Software Eng. Vocabulary

Gang of Four, 1987

• There is also Anti-Patterns

A word about patterns

(4)

• Task

Sequence of instructions that operate together

• Thread (process)

An executing task

• Core

The underlying hardware executing a thread

• Load balancing/scheduling

The mapping of threads to cores

• Race condition

The outcomes varies as the scheduling of threads varies

• Deadlock

A cycle of threads that are waiting on each other

Vocabulary

(5)

hetcomp.com

• Single Instruction, Single Data (SISD)

A sequential computer

Example: Mobile phones, low-end laptops

• Single Instruction, Multiple Data (SIMD)

A single instruction applied to multiple data streams

Example: Vector unit of CPUs, some GPUs

• Multiple Instruction, Single Data (MISD)

Multiple instructions on a single data stream.

Example: Fault tolerant systems (space shuttle)

• Multiple Instruction, Multiple Data (MIMD)

Multiple processors simultaneously executing different instructions on different data

Example: Multi-Core CPUs, clusters, some GPUs

Flynns Taxonomy of Computer Systems

Most TOP500 supercomputers

are based on MIMD

(6)

Limits to performance

(7)

hetcomp.com

• Presented by Gene Amdahl in 1967

Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities

• Find maximum expected improvement performance

• Overly pessimistic in practice

• Contradicted by Gustafsons Law

• Result: Theoretical speedup is limited by serial part of code

Amdahls Law

(8)

• Total running time of serial program is given by:

• Using P processors we get:

• The relative speedup is:

T T

T

T

total

( 1 )

setup compute

( 1 )

final

( ) compute(1)

total P setup final

P

T T T T

( ) (1)

( )

total total

S P T

T P

Amdahls Law - Equations

(9)

hetcomp.com

• The serial fraction is:

• Inserting this into S(P) give Amdahls Law:

( ) (1)

( )

total total

S P T

T P

1) , 1

( 0

setup final

total

T T

T

1

1

( ) (

(1)

(1) 1

)

total

ta

P l

P

to

S P T

T

Amdahls Law - Equations II

(10)

• Assume an infinite number of processors

• The maximum performance increase is bound by the serial fraction

1 1

lim ( ) lim 1

P P

P

S P

Amdahls Law - Equations III

(11)

hetcomp.com

- 2,00 4,00 6,00 8,00 10,00 12,00 14,00 16,00 18,00 20,00

1 16 256 4 096

Speed up

# cores

10 % 50 % 75 % 90 % 95 % 99 %

Parallel Portion

Plot of Amdahls law

Logarithmic scale

(12)

• Amdahls law does not incorporate increased problem size

• We are interested in solving the largest possible problem in reasonable time

• Assume γis independent of P

( )

(1 ( )

(

( )

1 )

)

total setup compute final

setup final scaled

total

scaled scaled

T P T

T T

P

T P T

T

S P P

Gustafson Law

(13)

hetcomp.com

- 500,00 1 000,00 1 500,00 2 000,00 2 500,00 3 000,00 3 500,00 4 000,00 4 500,00

- 500 1 000 1 500 2 000 2 500 3 000 3 500 4 000 4 500

Speed up

# cores

10 % 50 % 75 % 90 % 95 % 99 %

Parallel Portion

Plot of Gustafson law

(14)

- 500,00 1 000,00 1 500,00 2 000,00 2 500,00 3 000,00 3 500,00 4 000,00 4 500,00

- 500 1 000 1 500 2 000 2 500 3 000 3 500 4 000 4 500

Speedup

# cores

Gustafsons Law

6,00 8,00 10,00 12,00 14,00 16,00 18,00 20,00

Speedup

Amdahls Law

(15)

hetcomp.com

• Experience shows that Amdahls law is overly pessimistic

But you will always have some serial parts

• Many real world scenarios demonstrate almost linear speedup

• Some cases see superlinear speedup!

• Both models are simplified

Parallelism also introduces overhead

• Don’t forget Donald Knuth: Premature Optimization is the root of all evil

• Is the potential speedup worth the extra effort?

Up front and maintenance wide?

Dicussion

(16)

Concurrency

(17)

hetcomp.com

Definition of CONCURRENCE

a : the simultaneous occurrence of events or circumstances b :the meeting ofconcurrentlines in a point

Definition of CONCURRENT

a : operating or occurring at the same time b :running parallel

from Merriam Webster

Concurrency

(18)

• Concurrency can be found at many levels

• Concurrency exists in two forms:

1. Data parallel 2. Task parallel

• Not mutually exclusive

A complex program will have both

The line between them is blurred

Concurrency

(19)

hetcomp.com

• The same task is executed as many threads on different pieces of the data

• Examples:

Rendering

(Dense) linear algebra

FFTs

Max/Min computations

Web servers

Databases

Data parallelism

(20)

• Different, independent tasks

• Linked by sharing data

• Examples:

GUI code

Logging

Loading data

Writing data

Networking

Monitoring data?

• Hard to find enough tasks to scale to 10++ cores

Task parallelism

(21)

hetcomp.com

• Concurrency should be identified early

• #Cores on target hardware should be known before choosing algorithm

• Good serial algorithms seldom make good parallel ones

Discussion

(22)

Synchronization

(23)

hetcomp.com

• Most parallel programs require tasks to communicate

• Synchronization must be explicitly handled

• Difficult to enforce automatically

Threads are assumed to follow an agreed upon protocol

• Synchronization is expensive

Slows down the program

• Common source of bugs

Hard to find

Hard to reproduce

Synchronization

(24)

int getNextId() {

static int lastIdUsed = 0;

return ++lastIdUsed;

}

• Assume two threads call getNextId()

Illustrating the problem

Thread one Thread two lastIdUsed

43 44 44

44 43 44

43 43 43

(25)

hetcomp.com

• Modern CPUs have complex cache hierarchies

Typically three levels deep

• A fenceensures that all threads have a consistent view of memory

• Typically invoked by higher level primitives

• Only meaningful in a shared memory setting

Memory fences

RAM Level 2

Level 1 Level 1

Thread 1:

v[i] = 42;

Thread 2:

foo(v[i]);

v[i]

v[i]

v[i]

v[i]

(26)

• Synchronization point:

Every thread must arrive before continuing

• Typically used at the end of an iteration

• Explicitly written by the programmer

• Useful in cluster and shared memory processing

Barriers

(27)

hetcomp.com

• Mutex = Mutual Exclusion

• Protects against the simultaneous use of a common resource

Example: global variable, network card, write to file

• The mutex is a lock that protects the resource

• Threads must acquire the mutex before entering a critical section

• If the mutex is busy the thread must wait (spin on the lock)

• Remember to release the mutex!

• Coding a mutex is not trivial – use libraries (which generally use HW)

Mutex

(28)

int main() {

omp_lock_t lock;

omp_init_lock( &lock );

#pragma omp parallel shared (lock) { // non-critical section

omp_set_lock( &lock );

// critical section…

omp_unset_lock( &lock );

// non-critical section

OpenMP Lock Example

Declare and init lock

Wait or Aquire lock

Release lock

(29)

hetcomp.com

• Controls access to common resources (note the plural form)

• Records how many units of a resource is available (counting semaphore)

Hands out a permit to the resource

• Example:

A library with ten study rooms and ten keys

A librarian (semaphore) hands out keys to the rooms

Students (threads) must wait if there are no free keys

• A mutex can be seen as a binary-semaphore

A mutex has the concept of a “owner”

Semaphores

(30)

• An object designed to be safely used by several threads

• Often implemented using mutex/semaphores

Java, C# has language support

• Monitors often have mechanism for signaling callers when they are “ready”

• Mutex/Semaphore: caller is responsible

• Monitor: callee is responsible

Monitors

(31)

hetcomp.com

Release lock before returning

class A { private:

Lock l;

int lastIdUsed;

public:

int getNextId() { l.aquire();

int id = ++lastIdUsed;f l.release();

return id; }

} }

Monitor example

int main() { A a;

int id = a.getNextId();

}

Client code does need not worry about locking

(32)

• Notoriously hard

• Ensure unicore algorithm is correct

• Parameterize the “test space” – make it as big as possible

Vary number of cores

Vary hardware platform

Vary compiler settings

Vary input data

Run tests in different order

Run automatically

Debugging and testing parallel programs

(33)

hetcomp.com

• Synchronization protocol must be agreed upon

• Common source of bugs

Conclusion

(34)

Reading list

Referanser

RELATERTE DOKUMENTER

OB Outer Barrel pCT Proton CT PDU Protocol Data Unit PRU Proton CT Readout Unit RISC Reduced Instruction Set Architecture SCADA Supervisory Control and Data Acquisition SCU

The revealed patterns of sexual risk behaviours for example close to 50% of men having multiple partners and 78% of the population have never used a condom; it is likely that

The Fortran 2008 standard also includes a parallel programming model based primarily upon the coarray distributed data structure.. The advent of support for Fortran 2008 coarrays in

A normalized data packet is also passed onto the next chip when an on-chip processor completes interpolating and the last pixel on the scan line, Xend, does not

The PAVLOV machine is a two-dimensional array of SIMD --- Single Instruction Multiple Data - Processing Elements that would operate as a coprocessor for

In order to decrease computation time of each frame of simulation, we chose to make parallel algorithms, using the parallel programming interface Athapascan developped in the

Instead of tracking a single particle to get a single curve, we can use multiple particles to get complicated patterns. We suggest releasing new particles at intervals along the

This work studies the performance and scalability characteristics of “hybrid” parallel programming and execution as applied to raycasting volume rendering – a staple