Main Memory ECE/CS 752 Fall 2017 Prof. Mikko

Main Memory ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Jim Smith and Mark Hill Updated by Mikko Lipasti Readings Read on your own: Review: Shen & Lipasti Chapter 3 W.-H. Wang, J.-L. Baer, and H. M. Levy. Organization of a two-level virtual-real cache hierarchy, Proc. 16th ISCA, pp. 140-148, June 1989 (B6) Online PDF Read Sec. 1, skim Sec. 2, read Sec. 3: Bruce Jacob, The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It, Synthesis Lectures on Computer Architecture 2009 4:1, 1-77. Online PDF To be discussed in class: Review #1 due 11/1/2017: Andreas Sembrant, Erik Hagersten, David Black-Schaffer, The Direct-to-Data (D2D) cache: navigating the cache hierarchy with a single lookup, Proc.

ISCA 2014, June 2014.. Online PDF Review #2 due 11/3/2017: Jishen Zhao, Sheng Li, Doe Hyun Yoon, Yuan Xie, and Norman P. Jouppi. 2013. Kiln: closing the performance gap between systems with and without persistence support. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 421-432. Online PDF Review #3 due 11/6/2017: T. Shaw, M. Martin, A. Roth, NoSQ: Store-Load Communication without a Store Queue, in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. Online PDF 2 Outline: Main Memory DRAM chips Memory organization Interleaving Banking

Memory controller design Hybrid Memory Cube Phase Change Memory (reading) Virtual memory TLBs Interaction of caches and virtual memory (Wang et al.)

Large pages, virtualization DRAM Chip Organization Bitlines Row Decoder Row Address Word Lines Memory Cell Array Bitline Transistor Wordline

Capacitor Sense Amps Row Buffer Column Address Column Decoder Data bus Optimized for density, not speed Data stored as charge in capacitor Discharge on reads => destructive reads

Charge leaks over time refresh every 64ms Cycle time roughly twice access time Need to precharge bitlines before access 4 DRAM Chip Organization Row Address Row Decoder

Bitlines Word Lines Memory Cell Array Bitline Transistor Wordline Capacitor Sense Amps Row Buffer Column Address Column Decoder Data bus

Current generation DRAM Address pins are time-multiplexed 8Gbit @25nm Row address strobe (RAS) Up to 1600 MHz synchronous interface Column address strobe (CAS) Data clock 2x (3200MHz), double-data rate so 3200 MT/s peak 5 DRAM Chip Organization Row Address

Row Decoder Bitlines Word Lines Memory Cell Array Bitline Transistor Wordline Capacitor Sense Amps Row Buffer Column Address Column Decoder Data bus

New RAS results in: Bitline precharge Read from row buffer Row decode, sense Row buffer write (up to 8K) New CAS Much faster (3-4x) Streaming row accesses desirable 6

Simple Main Memory Consider these parameters: 10 cycles to send address 60 cycles to access each word 10 cycle to send word back Miss penalty for a 4-word block (10 + 60 + 10) x 4 = 320 How can we speed this up? 7 Wider(Parallel) Main Memory Make memory wider Read out all words in parallel Memory parameters 10 cycle to send address 60 to access a double word 10 cycle to send it back

Miss penalty for 4-word block: 2x(10+60+10) = 160 Costs Wider bus Larger minimum expansion unit (e.g. paired DIMMs) 8 Interleaved Main Memory Break memory into M banks Word A is in A mod M at A div M Bank 0 Banks can operate concurrently and independently Byte in Word Bank 1

Word in Doubleword Bank Doubleword in bank Bank2 Each bank has Private address lines Private data lines Bank 3 Private control lines (read/write) 9 Interleaved and Parallel Organization S e r ia l P a r a ll e l

D R AM D RA M D R AM D RA M D R AM D RA M D R AM D RA M N o n - i n t e r le a v e d In t e r le a v e d DRAM D RA M

DR A M D RA M DRAM D RA M DR A M D RA M 10 Interleaved Memory Examples Ai = address to bank i Ti = data transfer Unit Stride: A0 bank 0 access

A1 bank 1 access A2 Stride 3: T1 bank 2 access A3 T0 A0 bank 3 access bank 0 access A3

bank 3 access A2 bank 2 access A1 T2 bank 1 access T3 T0 T1 T2 T3 11 Interleaved Memory Summary Parallel memory adequate for sequential accesses Load cache block: multiple sequential words Good for writeback caches

Banking useful otherwise If many banks, choose a prime number Can also do both Within each bank: parallel memory path Across banks Can support multiple concurrent cache accesses (nonblocking) DDR SDRAM Control Raise level of abstraction: commands Activate row Read row into row buffer Column access Read data from addressed row Bank Precharge Get ready for new row access

Bank Precharge Bank N-1 Bank 1 Address Row Decoder Memory Array Bank 0 Sense Amplifiers Row Buffer Column Decoder Data Idle Active

Column Access Row Activation 13 DDR SDRAM Timing Read access Clock Command Data CMD Constructing a Memory System Combine chips in parallel to increase access width

E.g. 8 8-bit wide DRAMs for a 64-bit parallel access DIMM Dual Inline Memory Module Combine DIMMs to form multiple ranks Attach a number of DIMMs to a memory channel Memory Controller manages a channel (or two lock-step channels) Interleave patterns: Rank, Row, Bank, Column, [byte] Row, Rank, Bank, Column, [byte] Better dispersion of addresses Works better with power-of-two ranks 15 Memory Controller and Channel DIMM 0 DDR SDRAM Controller 1 Channel

DIMM 1 DIMM 2 B0 B1 B0 B1 B0 B1 B2 B3 B2 B3

B2 B3 B0 B1 B0 B1 B0 B1 B2 B3 B2

B3 B2 B3 B0 B1 B0 B1 B0 B1 B2 B3 B2

B3 B2 B3 B0 B1 B0 B1 B0 B1 B2 B3

B2 B3 B2 B3 chip (DIMM) select data address and command 16 Memory Controllers Contains buffering Cache Commands and Addresses Cache Data Bus In both directions

Cache Data Bus Arrival Time Assignment Schedulers manage resources Channel and banks Cache Line Read Buffer Bank 0 Requests ... Cache Line Write Buffer Bank n-1 Requests

Transaction Buffer Bank 0 Scheduler Bank 0 Scheduler Channel Scheduler SDRAM Data Bus Control Path SDRAM Command/ Address Bus Command/Response Path SDRAM Data Bus Data Path

17 Resource Scheduling An interesting optimization problem Example: Precharge: 3 cycles Row activate: 3 cycles Column access: 1 cycle FR-FCFS: 20 cycles StrictFIFO: 56 cycles Bank Precharge Idle Active Column Access Row Activation

Request Sequence (Bank, Row, Column) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 (0,0,0) (0,1,0) (0,0,1) (0,1,3) (1,0,0) (1,1,1) (1,0,0) (1,1,2) P A C P A C

P: bank Precharge A: row Activation C: Column Access C C P A C P A C C C DDR SDRAM Policies Goal: try to maximize requests to an open row (page)

Close row policy Always close row, hides precharge penalty Lost opportunity if next access to same row Open row policy Leave row open If an access to a different row, then penalty for precharge Also performance issues related to rank interleaving Better dispersion of addresses Memory Scheduling Contest Clean, simple, infrastructure Traces provided Very easy to make fair comparisons

Comes with 6 schedulers Also targets power-down modes (not just page open/ close scheduling) Three tracks: 1. Delay (or Performance), 2. Energy-Delay Product (EDP) 3. Performance-Fairness Product (PFP) Future: Hybrid Memory Cube Micron proposal [Pawlowski, Hot Chips 11] 21 Hybrid Memory Cube MCM Micron proposal [Pawlowski, Hot Chips 11] 22 Network of DRAM

Traditional DRAM: star topology HMC: mesh, etc. are feasible 23 Hybrid Memory Cube High-speed logic segregated in chip stack 3D TSV for bandwidth 24 High Bandwidth Memory (HBM) [Shmuel Csaba Otto Traian] High-speed serial links vs. 2.5D silicon interposer Commercialized, HBM2/HBM3 on the way 25 Future: Resistive memory PCM: store bit in phase state of material

Alternatives: Memristor (HP Labs) STT-MRAM Nonvolatile Dense: crosspoint architecture (no access device) Relatively fast for read Very slow for write (also high power) Write endurance often limited Write leveling (also done for flash) Avoid redundant writes (read, cmp, write) Fix individual bit errors (write, read, cmp, fix) 26 Main Memory and Virtual Memory Use of virtual memory Main memory becomes another level in the memory

hierarchy Enables programs with address space or working set that exceed physically available memory No need for programmer to manage overlays, etc. Sparse use of large address space is OK Allows multiple users or programs to timeshare limited amount of physical memory space and address space Bottom line: efficient use of expensive resource, and ease of programming Virtual Memory Enables Use more memory than system has Think program is only one running Dont have to manage address space usage across programs E.g. think it always starts at address 0x0 Memory protection Each program has private VA space: no-one else can clobber it Better performance

Start running a large program before all of it has been loaded from disk Virtual Memory Placement Main memory managed in larger blocks Page size typically 4K 16K Fully flexible placement; fully associative Operating system manages placement Indirection through page table Maintain mapping between: Virtual address (seen by programmer) Physical address (seen by main memory) Virtual Memory Placement Fully associative implies expensive lookup? In caches, yes: check multiple tags in parallel In virtual memory, expensive lookup is avoided by using a level of indirection Lookup table or hash table Called a page table

Virtual Memory Identification Virtual Address 0x20004000 Physical Address 0x2000 Dirty bit Y/N Similar to cache tag array Page table entry contains VA, PA, dirty bit Virtual address: Matches programmer view; based on register values Can be the same for multiple programs sharing same system, without conflicts Physical address: Invisible to programmer, managed by O/S Created/deleted on demand basis, can change

Virtual Memory Replacement Similar to caches: FIFO LRU; overhead too high Approximated with reference bit checks Clock algorithm intermittently clears all bits Random O/S decides, manages CS537 Virtual Memory Write Policy Write back Disks are too slow to write through Page table maintains dirty bit Hardware must set dirty bit on first write O/S checks dirty bit on eviction Dirty pages written to backing store Disk write, 10+ ms

Virtual Memory Implementation Caches have fixed policies, hardware FSM for control, pipeline stall VM has very different miss penalties Remember disks are 10+ ms! Hence engineered differently Page Faults A virtual memory miss is a page fault Physical memory location does not exist Exception is raised, save PC Invoke OS page fault handler Find a physical page (possibly evict) Initiate fetch from disk Switch to other task that is ready to run Interrupt when disk access complete Restart original instruction Why use O/S and not hardware FSM?

Address Translation VA PA Dirty Ref Protection 0x20004000 0x2000 Y/N Y/N Read/Write/ Execute O/S and hardware communicate via PTE How do we find a PTE? &PTE = PTBR + page number * sizeof(PTE) PTBR is private for each program Context switch replaces PTBR contents Address Translation Virtual Page Number Offset PTBR + D VA

PA Page Table Size How big is page table? 232 / 4K * 4B = 4M per program Much worse for 64-bit machines To make it smaller Use limit register(s) If VA exceeds limit, invoke O/S to grow region Use a multi-level page table Make the page table pageable (use VM) Multilevel Page Table Offset PTBR + +

+ Hashed Page Table Use a hash table or inverted page table PT contains an entry for each real address Instead of entry for every virtual address Entry is found by hashing VA Oversize PT to reduce collisions: #PTE = 4 x (#phys. pages) Hashed Page Table Virtual Page Number Offset PTBR Hash PTE0 PTE1 PTE2

PTE3 High-Performance VM VA translation Additional memory reference to PTE Each instruction fetch/load/store now 2 memory references Or more, with multilevel table or hash collisions Even if PTE are cached, still slow Hence, use special-purpose cache for PTEs Called TLB (translation lookaside buffer) Caches PTE entries Exploits temporal and spatial locality (just a cache) Translation Lookaside Buffer Tag Index Set associative (a) or fully associative (b)

Both widely employed Interaction of TLB and Cache Serial lookup: first TLB then D-cache Excessive cycle time Virtually Indexed Physically Tagged L1 Parallel lookup of TLB and cache Faster cycle time Index bits must be untranslated Restricts size of n-associative cache to n x (virtual page size) E.g. 4-way SA cache with 4KB pages max. size is 16KB Virtual Memory Protection Each process/program has private virtual address space Automatically protected from rogue programs

Sharing is possible, necessary, desirable Avoid copying, staleness issues, etc. Sharing in a controlled manner Grant specific permissions Read Write Execute Any combination Protection Process model Privileged kernel Independent user processes Privileges vs. policy Architecture provided primitives

OS implements policy Problems arise when h/w implements policy Separate policy from mechanism! Protection Primitives User vs kernel at least one privileged mode usually implemented as mode bits How do we switch to kernel mode? Protected gates or system calls Change mode and continue at pre-determined address Hardware to compare mode bits to access rights Only access certain resources in kernel mode E.g. modify page mappings Protection Primitives Base and bounds Privileged registers base <= address <= bounds

Segmentation Multiple base and bound registers Protection bits for each segment Page-level protection (most widely used) Protection bits in page entry table Cache them in TLB for speed VM Sharing Share memory locations by: Map shared physical location into both address spaces: E.g. PA 0xC00DA becomes: VA 0x2D000DA for process 0 VA 0x4D000DA for process 1 Either process can read/write shared location However, causes synonym problem VM Homonyms Process-private address space

Same VA can map to multiple PAs: E.g. VA 0xC00DA becomes: PA 0x2D000DA for process 0 PA 0x4D000DA for process 1 Either process can install line into the cache However, causes homonym problem Virtually-Addressed Caches Virtually-addressed caches are desirable No need to translate VA to PA before cache lookup Faster hit time, translate only on misses However, VA homonyms & synonyms cause problems Can end up with homonym blocks in the cache Can end up with two copies of same physical line Causes coherence problems [Wang et al. reading] Solutions to homonyms: Flush caches/TLBs on context switch Extend cache tags to include PID or ASID

Effectively a shared VA space (PID becomes part of address) Enforce global shared VA space (PowerPC) Requires another level of addressing (EA->VA->PA) Solutions to synonyms: Prevent multiple copies through reverse address translation Or, keep pointers in PA L2 cache [Wang et al.] Additional issues Large page support Most ISAs support 4K/1M/1G Page table & TLB designs must support Renewed interest in segments as an alternative Recent work from Multifacet [Basu thesis, 2013][Gandhi thesis, 2016] Can be complementary to paging Multiple levels of translation in virtualized systems

Virtual machines run unmodified OS Each OS manages translations, page tables Hypervisor manages translations across VMs Hardware still has to provide efficient translation Summary: Main Memory DRAM chips Memory organization Interleaving Banking Memory controller design

Hybrid Memory Cube Phase Change Memory (reading) Virtual memory TLBs Interaction of caches and virtual memory (Wang et al.) Large pages, virtualization

Recently Viewed Presentations

  • Potilaan valmistaminen tutkimuksiin: Varjoaine ja munuaiset Petri Sipola,

    Potilaan valmistaminen tutkimuksiin: Varjoaine ja munuaiset Petri Sipola,

    Munuainen 5 minuuttia varjoaineen annosta CT-varjoaineet erittyvät munuaisten kautta Varjoainenefropatia Contrast-induced nephropathy (CIN) Contrast-induced acute kidney injury (AKI) Määritelmä: kreatiniini nousee 2-3 vrk varjoaineen annosta vähintään 25% lähtötasoa korkeammalle tai 44 µmol/l Potilaalle kehittyy heti vähävirtsaisuus ...
  • Text Categorization With Support Vector Machines: Learning ...

    Text Categorization With Support Vector Machines: Learning ...

    Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore Goal of Text Categorization Classify documents into a number of pre-defined categories.
  • Response to Intervention (RTI): Building the Plane in the Air

    Response to Intervention (RTI): Building the Plane in the Air

    Tammy Rasmussen. Dean Richards. COSA/OCE Fall Conference. Oct. 6, 2011. Core RTI Principles. We can effectively teach all children. Intervene early. Use a multi-tier model of service delivery. Use a problem-solving method to make decisions within a multi-tier model.
  • Bab Vii Wawasan Nusantara Sebagai Geopolitik Indonesia

    Bab Vii Wawasan Nusantara Sebagai Geopolitik Indonesia

    Bagi bgs Indonesia, Ruang mrpk sumberdaya alam yg hrs dikelola bagi sebesar-besar kemakmuran rakyat (pasal 33 (3) UUD 1945 Indonesia terletak pada koordinat 6 LU - 95 BB - 141 45 BT; diantara benua Asia dan Australia/Oceania serta antara Samudera...
  • PSYA3 Relationships

    PSYA3 Relationships

    Hardy personality. To name but a few . How many people did you see? There are at least 10 and the most frequently missed is the baby in the woman's arms. But what has this got to do with 21st...
  • &quot;Those Winter Sundays&quot; - The Class Down the Hall

    "Those Winter Sundays" - The Class Down the Hall

    "Those Winter Sundays" ... SOAPStone is an acronym to remind you to ask yourself several questions about a poem to establish some background for understanding. S = subject of the poem. What is the poem about? O = occasion. What...
  • Practical DSGE modelling

    Practical DSGE modelling

    Arial Times New Roman Wingdings Network Microsoft Equation 3.0 Microsoft Graph Chart Simulation techniques Baseline DSGE model Numerical simulations Stylised facts Recursive simulation Recursive simulation Variances Correlations Autocorrelations Cross-correlations Impulse response functions Impulse response functions Response to vt shock ...
  • Thermal Energy and Heat + Conservation of Energy

    Thermal Energy and Heat + Conservation of Energy

    Consider the operation of a pile driver. The overall goal of the energy transformation is the work done on the pile. ... A 0.20 kg ball is held at rest 2.2 m above the ground, and then it is dropped....