Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 1

## **CS-590.26, Spring 2014**

# High Speed Memory Systems: Architecture and Performance Analysis

# Memory System Organization and System Controller

#### Credit where credit is due:

Slides contain original artwork (© Jacob, Wang 2005)



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 2

# UNIVERSITY OF MARYLAND

#### **Memory System Organization**



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 3

#### Where is the data?



Rank?
Bank?
Row?
Column?



Rank Address = ?
Bank Address = ?

Row address = ?

**Column Address?** 



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 4

#### **Channel I**



"PC Class" memory system.

1 physical channel of DDR SDRAM



Intel i850 DRDRAM memory system. 2 physical channel. 1 logical channel



Intel 875P DDR SDRAM memory system. 2 physical channel. 1 logical channel



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 5

#### **Channel II**



Two Channels: 64 bit wide per channel



Two Channels: 64 bit wide per channel



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 6

#### Rank I



It's a "bank" of chips that responds to a single command and returns data.

"Bank" terminology already used.



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 7

#### Rank II





RDRAM system: <= 32 ranks



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 8

#### Bank



"Banks" of indepedent memory arrays inside of a DRAM Chip

SDRAM/DDR SDRAM system: 4 banks RDRAM system: "32" split or 16 full banks



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 9

#### Row

DRAM devices arranged in parallel in a given rank



one row spanning multiple DRAM devices



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 10

#### Column

DRAM devices arranged in parallel in a given rank



Column = Smallest unit of data moved in memory system



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 11

#### Where's the data? Part 1





Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 12

#### Where's the data? Part 2







Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 13



#### **Bare Chips**

#### Bare DIP's shoved into sockets 18 Chips, each x1, 18 bit wide data bus



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 14

# **Memory Modules I**

#### **Organizing chips into modules**



Put chips on PCB, make a module



FPM / EDO / SDRAM / etc.



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 15

# **Memory Modules II**



same electrical contact

front side of 30 pin SIMM



back side of 30 pin SIMM



**Single Inline Memory Module** 

Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 16

# **Memory Modules III**



electrically different contact

front side of DIMM



back side of DIMM



**Dual Inline Memory Module** 

Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 17

## **Memory Modules IV**

#### **Registered DIMM**



One extra cycle to buffer and distribute address.

More chips (load) can be placed on module



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 18

# **Memory Modules V**

| Capacity | device<br>density | number<br>of<br>ranks | devices<br>per<br>rank | device<br>width | number<br>of<br>banks | number<br>of rows | number<br>of<br>columns |
|----------|-------------------|-----------------------|------------------------|-----------------|-----------------------|-------------------|-------------------------|
| 128 MB   | 64<br>Mbit        | 1                     | 16                     | x4              | 4                     | 4096              | 1024                    |
| 128 MB   | 64<br>Mbit        | 2                     | 8                      | x8              | 4                     | 4096              | 512                     |
| 128 MB   | 128<br>Mbit       | 1                     | 8                      | x8              | 4                     | 4096              | 1024                    |
| 128 MB   | 256<br>Mbit       | 1                     | 4                      | x16             | 4                     | 8192              | 512                     |

Four different configurations for a 128 MB SDRAM DIMM



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 19

#### **SPD: Serial Presence Detect**

# SPD: Tiny EEPROM

**Contains Parameters** 

- Speed settings
- Configurations
- Programmed by module maker





Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 20

#### **Kingston SDRAM DIMM**

8 Chips. 128 Mbit each. (Infineon)



PC133 CAS 3

**Dual Inline Memory Module** 



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 21

# System Controller



Heavy demand placed on memory system

Heavier still in SMP/SMT/CMP system

System Controller == System traffic cop



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 22

#### **System Controller**



Problem remains (exacerbated?) even if controller integrated onto CPU



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 23

#### **Memory Request Overview**



<sup>\*\*</sup> Steps not required for some processor/system controllers. protocol dependant.

**Progression of a Memory Read Transaction Request Through Memory System** 



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 24

## "Memory Latency"



A: Transaction request may be delayed in Queue

B: Transaction request sent to Memory Controller

C: Transaction converted to Command Sequences (may be gueued)

D: Command/s Sent to DRAM

E<sub>1</sub>: Requires only a **CAS** or

E<sub>2</sub>: Requires **RAS** + **CAS** or

E<sub>3</sub>. Requires **PRE + RAS + CAS** 

F: Data is staged at controller

G: Transaction sent back to CPU

"DRAM Latency" = A + B + C + D + E + F + G



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 25

# **Small System Topologies**



Classic small system topology (Lots of systems) (including multicore + on-chip MC)



Point-to-point processor-controller system topology

(AMD Athlon/Alpha EV6/PPC 970)



Integrated system controller system topology (AMD Opteron/Alpha EV7 etc.)

represents point of synchronization\*. (for local access)



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 26

# UNIVERSITY OF MARYLAND

## **System Controller: Athlon**



**MRO**:Memory Request Organizer

**APC:**AGP PCI Controller block

MCT:Memory Controller (SDRAM/DDR/DRDRAM)

Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 27

#### MRO: Memory Request Organizer

- Request crossbar responsible for scheduling memory read and write requests from BIU, PCI, AGP
- Serves as the coherence point
- Requests are reordered to minimize page conflict and maximize page hits
- Anti-starvation mechanism by aging of entries
- Arbitration bypassed during idle conditions to improve latency



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 28

#### **AMD Athlon Controller:**

| Chip<br>Version     | Tech & Voltage  | Max Core<br>Speed | Die Size<br>(pad limited) | No. of pins |
|---------------------|-----------------|-------------------|---------------------------|-------------|
| SDRAM<br>1P, 2xAGP  | 0.35um,<br>3.3V | 100 MHz           | 107 mm <sup>2</sup>       | 492         |
| SDRAM, 2P,<br>2xAGP | 0.35um,<br>3.3V | 100 MHz           | 130 mm <sup>2</sup>       | 656         |
| DDR, 1P,<br>4xAGP   | 0.25um,<br>2.5V | 133 MHz           | 133 mm <sup>2</sup>       | 553         |
| DDR, 2P,<br>4xAGP   | 0.25um,<br>2.5V | 133 MHz           |                           |             |
| RDRAM, 1P,<br>4xAGP | 0.25um,<br>2.5V | 133 MHz           | 107 mm <sup>2</sup>       | 492         |



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 29

#### Cache Coherency I





Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 30

#### **Cache Coherency II**



Snoop Request: Do you have cachline 0x001CA980?

Memory Fetch: Give me data for 0x001CA980.



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 31

#### **Cache Coherency Illa**



**Snoop Response: No** 

SDRAM MCT: RAS to rank 2, bank 0, row 0x00842

SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3

SDRAM MCT: Here's the data.



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 32

#### **Cache Coherency IIIb**



**Snoop Response: Yes, I have this cache line** 

SDRAM MCT: RAS to rank 2, bank 0, row 0x00842

SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3

MRO: Here's the data.



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 33

#### Why worry about CC? Part 1



What if distance to DRAM is shorter than distance to cache (in another CPU)?



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 34

# Why worry about CC? Part 2



#### Intel P6 system bus read transaction latency breakdown



Processors can grab request address off of shared bus in shared multi-drop topology

System controller rebroadcast request address to aid in snoop for point-to-point topology



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 35

# **Multiple Clock Domains I**



Most clock domains are integer multiples of each other



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 36

# **Multiple Clock Domains II**



What if clock domains are not integer multiples of each other?



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 37

# **Multiple Clock Domains III**







Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 38

# **Multiple Clock Domains IV**



Data transfer from 100 MHz clock domain to 133 MHz clock domain (Latency Optimal)



Data transfer from 100 MHz clock domain to 133 MHz clock domain (Bandwidth Optimal)



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 39

# **Multiple Clock Domains V**



Data transfer from 400 MHz clock domain to 800 MHz clock domain (Latency Optimal)



Data transfer from 400 MHz clock domain to 800 MHz clock domain (Bandwidth Optimal)



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 40

## **Multiple Clock Domains VI**



**Processor to Processor Bus Interface** 

Fractional multipliers could impact performance, but we may not have a choice



Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 41

UNIVERSITY OF MARYLAND

# **Multiple Clock Domains VII**





Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 42

#### **AMD Opteron**





Same steps, just all inside the same chip

Spring 2014

CS-590.26 Lecture C

Bruce Jacob David Wang

University of Crete

SLIDE 43

#### **Summary**

- System Controller is a "traffic cop"
- Traffic cop may have to deal with clock domain synchronization issue
- Handles Cache Coherency for small scale SMP configuration
- "Memory Latency" depends on lots of little things, not just speed of DRAM.

