Microsoft's Parallel Computing Platform: Applied Research in ...

Microsoft Research Faculty Summit 2008 Microsoft's Parallel Computing Platform Applied Research in a Product Setting Joe Duffy Lead Software Development Engineer Parallel Computing Platform (Visual Studio) Parallel Computing Platform overview What is the Parallel Computing Platform? Part of Microsofts Parallel Computing Initiative Microsofts Parallel Computing Initiative encompasses the vision, strategy, and innovative technologies for delivering transformative, natural, and immersive

personal computing experiences, that harness the computing power of manycore architectures. What were trying to do: Enable parallel computing for Windows developers (C++, .NET) Drive shift to ubiquitous parallelism Why? To cope with the hardware reality ahead To enable immersive, rich user experiences Reinvigorate the client Manycore impacts the entire stack Applications Constructing Parallel Applications

Libraries Languages, Compilers and Tools Executing Fine-Grain Parallel Applications Concurrency Runtime OS/Hypervisor Hardware Coordinating System Resources and Services Change is hard Visual Studio has shipped 9 major versions C++: C Runtime Library (version 9), Win32 1.0 in 1993 .NET Framework (version 3.5) 1.0 in 2001

Lots of legacy, i.e. code already written Win32: >100,000 methods .NET: >10,000 types; >100,000 methods Even more customer code Compatibility is paramount Cultural: Never break a customer app during upgrade Hard to be revolutionary Incubation is a good way to ignore legacy (for a while) This is a big reason its so hard to tech transfer! Transferring research to product Incubate Research MSR partnerships

Universities Academic publications Small # of researchers (often 1) Explore research in a product setting Research mashups Consider legacy Fail early and often: feedback loop w/ research Medium sized team: 1-

10 Productize Integrate incubation results into real products Speak w/ customers Compromise due to legacy Proper QA Reliability, bugfixing, polish, Large team: 1-50+

What is under development in PCP? Concurrency Runtime (i) Concurrency Runtime Maps logical (tasks) to physical concurrency (OS threads & processors) Data structures mirror hardware topology: packages, nodes, and caches --- to optimize for locality Work stealing task queues User-mode cooperative scheduling of blocked tasks Process-level resource management C++ runtime publicly exposed to community OpenMP, .NET runtime hidden, used by abstractions

Concurrency Runtime (ii) Parallel Parallel Language Language Integrated Integrated Query Query Agents Agents Language Language Managed Managed Library Library

Native Native Library Library CCR CCR C++ C++ Bindings Bindings Parallel Parallel Pattern Pattern Library Library Task Task Parallel Parallel Library

Library Concurrent CLR Concurrent CLR Work-Stealing Asynchronous Work-Stealing Asynchronous Data Thread Data Thread Pool Pool APIs Agents APIs

Agents Structures Structures Managed Native Managed Native Task Task Task Scheduler Scheduler (Concurrency (Concurrency Task Scheduler Scheduler (Concurrency (Concurrency Runtime) Runtime) Runtime)

Runtime) Resource Resource Manager Manager (ConcRT) (ConcRT) Windows: Windows: OS OS Threads Threads Language Language Concurrency Runtime (iii) Hard problems Engineering complexity Achieving sufficiently capable but simple architecture

Efficiency: static over-expression, dynamic load balance Real-world workloads: what to optimize for? What does the 5-10 year hardware architecture look like? In order? Out of order? SIMD? MIMD? GPU? Memory architecture? Influenced by R. D. Blumofe, C. E. Leiserson. Scheduling multithreaded computations by work stealing. 1994. H. J. Lu. Heterogeneous multithreaded computing. 1995. G. E. Blelloch, P. Gibbons, Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. 1999. Task parallelism (i) .NET and C++ offerings Task Parallel Library (.NET) Parallel Pattern Library (C++) Primary features

Object-oriented task abstractions Logical lifecycle & well-defined state transitions Parent/child relationships (structured concurrency) First class cancelation Common exceptions story Dataflow (aka, futures) Nonblocking continuations (aka, promises) Task parallelism (ii) Hard problems Clean & nonobtrusive design Fine-grained efficiency: reducing task footprint Safety with the addition of concurrency Relationship to message passing, nonblocking Influenced by H.C. Baker Jr., C. Hewitt. The Incremental Garbage Collection of Processes. 1977.

A.N. Hambermann, I.R. Nassi. Efficient Implementation of Ada Tasks. 1980. J. E. Richardson, M. J. Carey, D. T. Schuh. The Design of the E Programming Language. 1993. A. Rossberg, D. L. Botlan, G. Tack, T. Brunklaus, G. Smolka. Alice ML Through the Looking Glass. 2004. Imperative data parallelism (i) TPL & PPL Parallel for loops (integer ranges, iterators) Recursion (dynamic limits) Reductions In-place sorts E.g., simple for loop Parallel.For(0, N, i => { }); parallel_for(0, N, [&](int i) { });

E.g., also over .NET enumerables and C++ iterators Parallel.ForEach(e, t => { }); parallel_for_each(e->begin(), e->end(), [&](T * t) { }); Imperative data parallelism (ii) Hard problems Balance between static & dynamic scheduling Load imbalance Integration into an existing side-effectful ecosystem Lots of good research: too much, in fact Influenced by L. Lamport. The parallel execution of DO loops. 1974. W. D. Hillis, G. L. Steele, Jr. Data parallel algorithms. 1986. C. D. Polychronopoulos, D. J. Kuck. Guided selfscheduling: A practical scheme for parallel

supercomputers. 1987. Declarative data parallelism (i) PLINQ == Parallel (P) LINQ-to-Objects (LINQ) Visual Studio 2008 introduced LINQ SQL-like operators to unify DB, in-memory, and XML data access Can query across multiple data source: join XML w/ in-memory, etc. C# & VB have language syntax PLINQ auto-parallelizes in-memory and XML parts E.g., filter, project, sort int[] arr = some big array ; var q = from x in arr where p(x) // p: predicate/filter orderby k(x) // k: key selection routine select f(x); // f: data transformation float[] out = q.ToArray(); // Execute!

Declarative data parallelism (ii) Hard problems Loss of control => perceived efficiency New way of expressing problems Nonfunctional problems dont map well: graphics, etc. Expectations about order (e.g., queries over arrays) Heuristics: how parallel to go? Decision criteria? Influenced by D.J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, et. al. GAMMA: a high performance dataflow database machine. 1986. G. E. Blelloch. NESL: A nested data-parallel language. 1995. G. Graefe. Volcano: An extensible and parallel query evaluation system. 1994. D. Tarditi, S. Puri, J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. 2006. M.M. T. Chakravarty. R. Leshchinskiy, S.P. Jones. Data parallel Haskell: a status report. 2007.

Scalable containers (i) Communication in shared-memory often container-oriented Currently ad-hoc, race-prone, not scalable, In both .NET and C++ we offer new containers Primary goal is scalability, not theoretical (lock-free) In some cases, we use lock-free to improve reliability Combination of non-blocking and fine-grained locking Stack, queue, deque, linked list, hashtable, set, Some coordination-oriented: blocking & bounding Managed equivalents to .NET containers C++ are STL-like Scalable containers (ii) Hard problems

Pushing the right abstractions Expected semantic parity w/ sequential collections Enumeration/iteration Synchronization approach Performance of CAS on current generation hardware (~100-500 cycles) ABA in C++ Influenced by M. M. Michael, M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. 1996. T. Harris. A pragmatic implementation of non-blocking linked lists. 2001. M. M. Michael. High-performance dynamic lock-free hash tables and listbased sets. 2002. M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. 2004. M. M. Michael. ABA prevention using single-word instructions. 2004. Memory models

Memory models are a constant signal Large amounts of lock-free code Written to abstract memory models .NET 2.0: stronger than JMM (but harder to implement) C++: weak, exposes underlying hardware MM Hard problems Resisting the temptation to use too much (intellectually stimulating) Pushing the edge hereweve found code-gen bugs Influenced by L. Lamport. Time, clocks, and the ordering of events in a distributed system. 1978. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. 1979. M. P. Herlihy, J. M. Wing. Linearizability: A correctness condition for concurrent objects. 1990. J. Manson, B. Pugh. The Java memory model. 2004, 2005.

Verification All of this lock-free and concurrency oriented code Its hard enough to write But even harder to verify its correctness Hard problems Area of active new research: impossible to keep up Lack of programming model support (contracts) Brute force testing is costly (time & hardware matrix) Influenced by Y. Yu, T. Rodeheffer, W. Chen. RaceTrack: Efficient detection of data race conditions via adaptive tracking. 2005. M. Musuvathi, S. Qadeer. CHESS: Systematic stress testing of concurrent software. 2007. S. Burckhardt. Memory model sensitive analysis of concurrent data types. 2007.

What is PCP incubating? Transactional memory TM C# syntax & CLR JIT compiler support Many approaches: optimistic write-in-place, buffered writes, ... Hard problems Existing ecosystem: extension points, I/O Strong vs. weak atomicity Atomicity safety: publication, privatization Efficiency Keeping up with the research community Influenced by Herlihy, Moss. Transactional memory: Architectural support for lock-free data structures. 1993.

T. Harris, K. Fraser. Language support for lightweight transactions. 2003. M.F. Spear, V.J. Marathe, L. Dalessandro, M.L. Scott. Privatization techniques for software transactional memory. 2007. C.C. Minh, M. Trautmann, J.W. Chung, A. McDonald, et. al. An effective hybrid transactional memory system with strong isolation guarantees. 2007. Asynchronous agents Orchestration language Coarse-grained agent-based concurrency Loosely coupled with contracts All interactions across isolation boundaries via message passing channel ChSum { input int [] Add; output int Sum; Start: { Add -> Computing; -> End; } Computing: { Sum -> Start; } } agent A : ChSum { A() {

while (true) { int [] in = receive(this::Add); this::Sum <- SumArray(in); } } } agent B : ChSum { B() { this::Add => SumArray => this::Sum } } Influenced by G Andrews, R Olsson: The SR Programming Language, 1993 ISO 8652 (a.k.a. Ada) C A R Hoare: Communicating Sequential Processes, 1978 M Fndrich, M Aiken, C Hawblitzel, O Hodson, G Hunt, J Larus, S Levi: Language Support for Fast and Reliable Message-based Communication in Singularity OS, 2006 Statically provable thread safety Many constructs provided lead to traps

int c = 0; Parallel.ForEach(0, N, i => c++); Functional languages help: think Haskell (all side-effects are monads) Hence purity & immutability as 1st class constructs can help But Win32, .NET, and our customers are very imperative Instead: integrated isolation support in the type system Hard problems Legacy! Lack of unified research Influenced by J. M. Lucassen, D. K. Gifford. Polymorphic effect systems. 1988. J. Hughes. Why functional programming matters. 1989. P. Wadler. Linear types can change the world! 1990. M. Abadi, C. Flanagan, S.N. Freund. Types for safe locking: Static race detection for Java. 2006. M. Barnett, K.R.M. Leino, W. Schulte. The Spec# programming system: An overview. 2004.

In conclusion Parting thoughts; and a challenge We are paying attention A big part of our job is mining & aggregating research Applying it in a product setting often entails (surprisingly a lot) more than the research suggests (legacy) Weve taken the first few steps down a long road Starting with mechanisms (ship) Exploring encompassing programming models (incubate) A challenge A significant portion of customers dont have CS degrees Even those that do have a big concurrency learning curve How can we get them to write parallel code? Well be looking to you for help in the coming years

Q&A Thank-you [email protected]

Recently Viewed Presentations

  • Aedes Mosquito May 2017 - Naval Hospital Bremerton

    Aedes Mosquito May 2017 - Naval Hospital Bremerton

    Monaghan AJ, Morin CW, Steinhoff DF, Wilhelmi O, Hayden M, Quattrochi DA, Reiskind M, Lloyd AL, Smith K, Schmidt CA, Scalf PE, Ernst K. On the Seasonal Occurrence and Abundance of the Zika Virus Vector Mosquito Aedes Aegypti in the...
  • Costa&#x27;s Levels of Questions - TypePad

    Costa's Levels of Questions - TypePad

    Level 2 Questions. Examples. Using a plot diagram, analyze the plot of the short story. Compare the square root of 49 to the square root of 64. Which is greater? Diagram and order the stages of photosynthesis. Compare and contrast...
  • PowerPoint-Präsentation


    Thoracoabdominal resection of retrocrural residual cancers. Aktuelle Urologie. Histologische Evaluation der Cochleaimplantation mittels Supramealem Zugang (SMA) Otorhinolaryngologia Nova . Cheek dimples in Greek children and adolescents. International Journal of Anthropology
  • Improving Stream Flow Estimates in NHDPlus

    Improving Stream Flow Estimates in NHDPlus

    Tim Bondelid- NHDPlus Team. Consulting Engineer. 2012 ESRI International User Conference . July 23-27, 2012. San Diego, CA. NHD Plus ... Water-quality modeling (SPARROW) Regional and national stream flow assessments goal of National Stream Flow Information Program.
  • Introduction to Astronomy

    Introduction to Astronomy

    This heat allows fusion to occur in a shell of material surrounding the core… Due to the higher central temperature, the star's luminosity is greater than before… This increased energy production causes the outer part of the star to expand...
  • ATmega103 Timer0 and Interrupts 23/02/20 Tim Sumner, Imperial

    ATmega103 Timer0 and Interrupts 23/02/20 Tim Sumner, Imperial

    ATmega103 Timer0 and Interrupts The Need for Processor Interrupts Processor Interrupts The ATmega103 Timers and Interrupts The ATmega103 Memory Map Interrupts mapping Interrupts mapping Using interrupts Global enable via the status register Masks to work at bit level within devices...
  • LESSON 3 - Weebly

    LESSON 3 - Weebly

    WE DO. A market economy is a type of economic system where supply and demand regulate the economy, rather than government intervention. A true free market economy is an economy in which all resources are owned by individuals.
  • TA 101 Think and Analyze - IITK

    TA 101 Think and Analyze - IITK

    What does GOD (google) say about sectional views? In section view drawings, hidden line representation is omitted in that part of the view with the section lining. A correct conventional representation of the full section in the front view omits...