COSC 6374 Parallel Computation Parallel Computer Architectures Some slides on network topologies based on a similar presentation by Michael Resch University of Stuttgart Edgar Gabriel Spring 2009 Edgar Gabriel Flynn s Taxonomy SISD Single instruction single data Classical von Neumann architecture SIMD Single instruction multiple data MISD Multiple instructions single data Non existent just listed for completeness MIMD Multiple instructions multiple data Most common and general parallel machine COSC 6374 Parallel Computation Edgar Gabriel 1 Single Instruction Multiple Data I Also known as Array processors A single instruction stream is broadcasted to multiple processors each having its own data stream Instructions stream Data Data Data Data Control unit processor processor processor processor COSC 6374 Parallel Computation Edgar Gabriel Single Instruction Multiple Data II Interesting detail handling of if conditions First all processors for which the if condition is true execute the according code section other processors are on hold Second all processors for the if condition is not true execute the according code section other processors are on hold Some architectures in the early 90s used SIMD MasPar Thinking Machines No SIMD machines available today SIMD concept used in processors of your graphics card COSC 6374 Parallel Computation Edgar Gabriel 2 Multiple Instructions Multiple Data I Each processor has its own instruction stream and input data Most general case every other scenario can be mapped to MIMD Further breakdown of MIMD usually based on the memory organization Shared memory systems Distributed memory systems COSC 6374 Parallel Computation Edgar Gabriel Shared memory systems I All processes have access to the same address space E g PC with more than one processor Data exchange between processes by writing reading shared variables Shared memory systems are easy to program Current standard in scientific programming OpenMP Two versions of shared memory systems available today Symmetric multiprocessors SMP Non uniform memory access NUMA architectures COSC 6374 Parallel Computation Edgar Gabriel 3 Symmetric multi processors SMPs All processors share the same physical main memory CPU CPU Memory CPU CPU Memory bandwidth per processor is limiting factor for this type of architecture Typical size 2 16 processors COSC 6374 Parallel Computation Edgar Gabriel SMP processors Example AMD 8350 quad core Opteron Barcelona Private L1 cache 32 KB data 32 KB instruction Private L2 cache 512 KB unified Shared L3 cache 2 MB unified Core Core Core Core L1 L2 L1 L1 L1 L2 L2 L2 shared L3 crossbar 3 Hyper2 Mem transports Controller COSC 6374 Parallel Computation Edgar Gabriel 4 SMP processors Example II Intel X7350 quad core Tigerton Private L1 cache 32 KB instruction 32 KB data Shared L2 cache 4 MB unified cache Core Core Core Core L1 L1 shared L2 L1 L1 shared L2 1066 MHz FSB COSC 6374 Parallel Computation Edgar Gabriel SMP systems Example III Intel X7350 quad core Tigerton multi processor configuration Socket 0 C 0 C 1 L2 C 8 Socket 1 C 9 L2 C 2 C 3 Socket 2 C 10 L2 L2 C 11 C 4 C 5 Socket 3 C 12 L2 L2 C 13 C 6 C 7 C 14 L2 C 15 L2 Memory Controller Hub MCH 8 GB s Memory 8 GB s Memory 8 GB s Memory 8 GB s Memory COSC 6374 Parallel Computation Edgar Gabriel 5 NUMA architectures I Some memory is closer to a certain processor than other memory The whole memory is still addressable from all processors Depending on what data item a processor retrieves the access time might vary strongly CPU CPU CPU CPU CPU Memory Memory Memory Memory CPU CPU CPU COSC 6374 Parallel Computation Edgar Gabriel NUMA architectures II Reduces the memory bottleneck compared to SMPs More difficult to program efficiently First touch policy data item will be located in the memory of the processor which touches the data item first Relative location of threads processes to each other matter To reduce effects of non uniform memory access caches are often used ccNUMA cache coherent non uniform memory access architectures Largest example as of today SGI Origin with 512 processors COSC 6374 Parallel Computation Edgar Gabriel 6 NUMA systems Example AMD 8350 quad core Opteron Barcelona multiprocessor configuration C 0 C 1 C 2 Socket 1 C 3 C 4 L3 C 6 C 7 L3 8 GB s HT HT HT L3 HT HT C 7 C 8 HT HT L3 HT 8 GB s 8 GB s HT C 9 HT C 10 HT C 7 C 8 C 9 C 10 Memory HT 8 GB s Memory C 5 Memory Memory Socket 0 Socket 3 Socket 2 COSC 6374 Parallel Computation Edgar Gabriel Distributed memory machines I Each processor has its own address space Communication between processes by explicit data exchange Sockets Message passing Remote procedure call remote method invocation Network interconnect Memory Memory Memory Memory Memory CPU CPU CPU CPU CPU COSC 6374 Parallel Computation Edgar Gabriel 7 Distributed memory machines II Performance of a distributed memory machine strongly depends on the quality of the network interconnect and the topology of the network interconnect Of the shelf technology e g fast Ethernet gigabitEthernet Specialized interconnects Myrinet Infiniband Quadrics COSC 6374 Parallel Computation Edgar Gabriel Distributed memory machines III Two classes of distributed memory machines Massively parallel processing systems MPPs Tightly coupled environment Single system image specialized OS Clusters Of the shelf hardware and software components such as Intel P4 AMD Opteron etc Standard operating systems such as LINUX Windows BSD UNIX COSC 6374 Parallel Computation Edgar Gabriel 8 Hybrid systems E g clusters of multi processor nodes CPU CPU CPU Memory CPU Memory CPU CPU Memory Network interconnect Memory CPU Memory CPU CPU CPU Memory CPU CPU COSC 6374 Parallel Computation Edgar Gabriel Network topologies I Important metrics Latency minimal time to send a very short message from one processor to another Unit ms s Bandwidth amount of data which can be transferred from one processor to another in a certain time frame Units Bytes sec KB s MB s GB s Bits sec Kb s Mb s Gb s baud COSC 6374 Parallel Computation Edgar Gabriel 9 Network topologies II Metric Description Optimal parameter Link A direct connection between two processors Path A route between two processors As many as possible Distance Minimum length of a path between two processors Small Diameter Maximum distance in a network Small Degree Number of links that connect to a processor Small costs Large redundancy Connectivity Minimum number of links that have to be cut to separate the
View Full Document
Unlocking...