Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems

Please download to get full document.

View again

of 27
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
OpenMP is emerging as a viable high-level programming model for shared memory parallel systems. It was conceived to enable easy, portable application development on this range of systems, and it has also been implemented on cache-coherent Non-Uniform
Document Share
Document Tags
Document Transcript
  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper.  2002;  14 :713–739 (DOI: 10.1002/cpe.646) Achieving performance underOpenMP on ccNUMA andsoftware distributed sharedmemory systems B. Chapman ∗ ,† , F. Bregier, A. Patil and A. Prabhakar  Department of Computer Science, University of Houston, Houston, TX 77204-3010, U.S.A. SUMMARYOpenMP is emerging as a viable high-level programming model for shared memory parallel systems. It wasconceived to enable easy, portable application development on this range of systems, and it has also beenimplemented on cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Unfortunately,it is hard to obtain high performance on the latter architecture, particularly when large numbers of threads are involved. In this paper, we discuss the difficulties faced when writing OpenMP programs forccNUMA systems, and explain how the vendors have attempted to overcome them. We focus on one suchsystem, the SGI Origin 2000, and perform a variety of experiments designed to illustrate the impact of thevendor’s efforts. We compare codes written in a standard, loop-level parallel style under OpenMP withalternative versions written in a Single Program Multiple Data (SPMD) fashion, also realized via OpenMP,and show that the latter consistently provides superior performance. A carefully chosen set of languageextensions can help us translate programs from the former style to the latter (or to compile directly, butin a similar manner). Syntax for these extensions can be borrowed from HPF, and some aspects of HPFcompiler technology can helpthe translation process. It is our expectation that an extended language, if wellcompiled, would improve the attractiveness of OpenMP as a language for high-performance computationon an important class of modern architectures. Copyright © 2002 John Wiley & Sons, Ltd. KEY WORDS : shared memory parallel programming; OpenMP; ccNUMA architectures; restructuring; datalocality; data distribution; software distributed shared memory 1. INTRODUCTION Many modern supercomputers are cache-coherent Non-Uniform Memory Access architectures(ccNUMAs). Such systems consist of a number of nodes, with memory and one or more processors, ∗ Correspondence to: B. Chapman, Department of Computer Science, University of Houston, Houston, TX 77204-3010, U.S.A. † E-mail: chapman@cs.uh.eduContract/grant sponsor: NASA Ames Research Center; contract/grant number: NCC2-5394Contract/grant sponsor: NSF; contract/grant number: NSF ACI 99-82160  Received 3 March 2001 Copyright © 2002 John Wiley & Sons, Ltd.  Revised 27 September 2001  714  B. CHAPMAN  ET AL. connected via a fast interconnect. Memory is globally addressed; code running on a processor mayreference data stored at any location throughout the machine. Whenever remote data is accessed, thesystem fetches the data and, if necessary, invalidates copies in cache belonging to other processors.Although distributed, memory is thus shared across the machine. Their design overcomes majorlimitations of SMP architectures and very large ccNUMA systems have been constructed.Such systems have distributed memory, and application programs may rely on correspondingprogramming models such as MPI or PVM, or they may exploit the fact that memory is shared. Giventhe effort undertaken to provide shared memory, the latter seems a natural approach. However, incontrast to uniform shared memory systems, the cost of accessing data at run time depends on itslocation relative to the processor that requires it. Distributed memory programming models force theuser to code remote fetches explicitly and thus the application developer will take this aspect of theprogram’s behavior into consideration. Shared memory programming models, by their very nature, donot provide for remote data fetches; they give the user no control over the relative placement of dataand computation, and thus no direct influence on the cost incurred by data references in a program.In this paper, we discuss the use of the shared memory programming model OpenMP on ccNUMAarchitectures, and consider how to write scalable OpenMP applications despite this problem. We showa programming style in which the locality of data and code is taken into account. We also evaluatethis programmingstyle on a software distributed shared memory (SDSM) system in orderto determinewhether it can also be expected to improveprogram performanceon such systems. Our results confirmits usefulnessforthesystemconsidered.Unfortunately,itis notanintuitivestyle,andcreatingOpenMPcode in this form requires considerable effort. Thus one of the major benefits of OpenMP is partiallylost in the process.In recognition of the problems posed by OpenMP usage on ccNUMA architectures, hardwarevendors have provided extensions to OpenMP to help a user manage the locality of data with respectto the work. The sets of extensions differ, compromising the portability of OpenMP code. We believethat a carefully considered, and modest, set of such extensions could be the key to enabling users toobtainperformanceonccNUMAarchitectures.Theycouldbeusedto translateanintuitivelydevelopedOpenMP code to a higher-performing equivalent program, or could be directly compiled. In thispaper we focus on the experiments performed and indicate the set of directives that we have begunto implement only briefly.The remainder of this paper is organized as follows. In Section 2, we describe the characteristics of a ccNUMA system, and then briefly introduce the SGI Origin 2000, on which our experimentswere performed, before discussing the OpenMP programming model and its usage. We explain theshortcomings of OpenMP as a programming model for such architectures, and outline the languagefeatures introduced by SGI and Compaq that are intended to overcome those shortcomings on theirrelative systems.This is followed by a report in Section 3 on a set of simple OpenMP programming experimentscarried out on a Silicon Graphics Origin 2000 at NCSA, some of which make use of SGI’s OpenMPextensions. We subsequently show a second set of experiments that use an alternative OpenMPprogramming style in Section 4. Section 4.2 reports upon our experiences using this programming style on TreadMarks, an SDSM system from Rice University.Finally, in Section 5 we briefly discuss a possible set of language extensions to OpenMP that mightbe used together with the straightforward ‘loop-level’ programming style (as employed in the firstset of experiments) and either translated to the alternative programming style, or compiled directly. Copyright © 2002 John Wiley & Sons, Ltd.  Concurrency Computat.: Pract. Exper.  2002;  14 :713–739  ACHIEVING PERFORMANCE UNDER OpenMP  715We compare these with the features proposed by Compaq and SGI, before discussing related work inSection 6 and reaching some conclusions in Section 7. 2. PROGRAMMING ccNUMA ARCHITECTURES Several vendors market ccNUMA systems (Compaq, HP, SGI, Sun). These machines emulate sharedmemory while providing significantly higher levels of total performance.A typical ccNUMA platformis made up of a collection of shared memory parallel (SMP) nodes, or modules, each of which hasinternal local shared memory; the individual memories together comprise the global memory. Theentire memoryis globally addressed,and thus accessible to all processors; however,non-local memoryaccesses are more expensive than local ones. The memory hierarchy thus consists of one or morelevels of cache associated with an individual processor, a node-local memory, and remote memory(main memory that is not physically located on the accessing node). A cache-coherentsystem assumesresponsibility not only for fetching and storing remote data, but also for ensuring consistency amongthe copies of a data item. If data saved in a cache are updated by another processor, then the value inthe cache must be invalidated. Thus such systems behave as shared memory machines with respect totheir cache management schemes. 2.1. The Silicon Graphics Origin 2000 The SGI Origin 2000 is such a system [1]. It is organized as a hypercube, where each node consistsof a number of processors and memory that is local to them. The size of Origin systems is growingrapidly, and a 1024 processor machine is expected to appear soon.Figure 1 shows the high-level architecture of the Origin 2000. A node is formed by a pair of MIPSR12000 processors, connected through a hub, together with a portion of the shared memory. Multiplenodes are connected in a hypercube through a switch-based interconnect. One router is needed toconnect up to eight processors, since two pairs are connected directly via their hubs; two routers areneeded to connect 16 processors, four for 32 processors, and so on. Each MIPS R12000 processorhas two levels of two-way set associative cache, where the first level of cache is on-chip and provides32 kB data cache and 32 kB instruction cache (32-byte line size). The second level of cache is off-chip; it typically provides 1–4 MB unified cache for both data and instructions (128-byteline size). Allcaches utilize a Least Recently Used algorithm for cache line replacement.While data only exist in either local or remote memory, copies of the data can exist in variousprocessor caches. Keeping these copies consistent is the responsibility of the cache-coherent protocolof the hubs. The directory-based coherence removes the broadcast bottleneck that limits snoopy bus-based coherence. Latency of access to level 1 cache is approximately 5.5 ns; for level 2 cache,latency is 10 times this amount. Latency of access to local memory is another six times as expensive,whereas latency to remote memory ranges from around twice to nearly four times that for localmemory, depending on the number of routers traversed. SGI reports that the bidirectional bandwidthis  ∼ 620 MBps for up to three routers (32 processors) and thereafter is  ∼ 310 MBps. However, theexperienced cost of a remote memory access depends not only on the distance of its location, i.e. thenumber of hops required, but also on contention for bandwidth. Contention can have a severe impacton performance; it can arise as the result of many non-local references within a single code, or may be Copyright © 2002 John Wiley & Sons, Ltd.  Concurrency Computat.: Pract. Exper.  2002;  14 :713–739  716  B. CHAPMAN  ET AL. Proc A Proc BCacheScalable Interconnect Network HubCacheMem&DirIOXbarNode 0Node 1 Node NIO Ctrls Figure 1. Architecture of the SGI Origin 2000. caused by the activities of other independent applications running on the same machine, and its effectis thus unpredictable. Seidel [2] has studied access delays related to memory hierarchy on the SGIOrigin 2000.The operating system supports data allocation at the granularity of a physical page (16 kB).It attemptstoallocatememoryfora processonthesame nodeonwhichit runs.However,results arenotguaranteed[3]. Default strategies may be set by the user or the site. Typically,a default first-touch pageallocation policy is used that aims to allocate a page from the local memory of the processor incurringthe page-fault. In an optional round-robin data allocation policy, pages are allocated to each processorin a round-robinfashion. The Origin provides hardware and software support for page migration.The  de facto  standards HPF [4], OpenMP [5], and MPI [6] may be used on the Origin. Additional programmingmodels include hybrid MPI+OpenMP (e.g. [7]), shmem, and pthreads [8]. We introduce the main features of OpenMP below, before discussing its use on ccNUMA platforms. 2.2. OpenMP OpenMP consists of a set of compiler directives and library routines, for explicit SMP programming.It was developed by a consortium of vendors to provide a portable programming interface for thegrowing number of SMP systems on the market. OpenMP directives may be inserted into Fortran,C, or C++ code in order to specify how the program’s computations are to be distributed among theexecuting threads at runtime. Based largely upon the earlier X3H5 standardization effort [9], it is afamiliar programming model, enables relatively fast, incremental, application development, and hasthus rapidly gained acceptance. Experts have reported being able to port substantial codes to (true)SMP systems in a matter of hours [10]. Compilers for OpenMP are readily available. Copyright © 2002 John Wiley & Sons, Ltd.  Concurrency Computat.: Pract. Exper.  2002;  14 :713–739  ACHIEVING PERFORMANCE UNDER OpenMP  717The languageextensionis based uponthe fork-joinexecutionmodel,in which a single master threadbegins execution and spawns worker threads to perform computations in parallel regions. It is the task of the user to specify parallel regions for execution by a team of threads. Directives are thus providedfor declaring such regions, for sharing work among threads, and for synchronizing them. Worksharingdirectives spread loop iterations among threads, or divide the work into a set of parallel sections.Parallel regions may differ in the numberof threads assigned to them at runtime, and the assignation of work to threads may also be dynamically determined. Typically, there will be one thread per executingprocessor at runtime.The user is also responsible for detecting and dealing with race conditions in the code, as well asfor thread coordination. Critical regions and atomic updates may be defined; variable updates may beordered, or a set of statements executed by the first thread to reach them only. The OpenMP libraryimplements locks.Threads may have private copies of some of the program’s data. OpenMP permits private variables,with potentially different values on each thread, within parallel regions. Fortran programs may havethreadprivate common blocks, and the latest version permits private variables to be SAVEd also. SinceOpenMP is a shared memory model, sequence and storage association is not affected by its directives.An innovative feature of OpenMP is the inclusion of so-called orphan directives, directives thatare dynamically, rather than lexically, within the scope of a parallel region. They greatly simplify thecode migration process, since code must no longer be extracted from subprograms in order to specifyworksharing. Moreover, the fact that programs may be incrementally migrated means that, in contrastto a global programming model such as MPI or HPF, a programmer need not consider the flow of control and data throughout an entire program while inserting directives. 2.3. OpenMP on the Origin 2000 OpenMP was developed as a high-level programming model for creating applications that are able toexploit the processing power of SMP systems, and it is also implementable on ccNUMA systems.SGI provides OpenMP on the Origin 2000, and recommends its use. The threads that executea parallel region may run on processors across multiple nodes of the machine. When a loop isparallelized via a worksharing construct, for instance, loop iterations are executed on threads thatmay be located on different processors spread across the system, and the program data may also bespread across the different node memories. It is likely that a thread will reference data stored in non-local memory. However, remote memory accesses can increase the memory latency by a significantfactor; furthermore,if data is not stored where it is (mainly)used,network traffic may cause substantialbottlenecks.In order to obtain high performance from ccNUMA systems, it is thus important that the placementof data and computations over the memories is such that the data accessed by each thread is largelylocal to the processor on which that thread is currently executing. OpenMP does not provide anymechanisms for controlling the placement of data relative to computation, so the vendor has supplieda number of means with which this may be influenced. ‘First touch’ data allocation, whereby data isallocated to memory local to the processor that first accesses it, can be beneficial if used correctly; if the initialization is not parallelized, however, it may lead to the storage of large data structures on asingle node. In such cases, the performance degradation can be significant. The operating system alsoprovides for automatic page migration, whereby data that is remotely accessed may be transparently Copyright © 2002 John Wiley & Sons, Ltd.  Concurrency Computat.: Pract. Exper.  2002;  14 :713–739
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks