Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 1 # UNIVERSITY OF MARYLAND ### **DRAM Reliability:** Parity, ECC, Chipkill, Scrubbing Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 2 ### **Alpha Particles:** - Soft errors were big problems for early DRAM chips. - Low energy alpha particles were discovered to be the culprit, but where were they coming from? - Intel published paper in 1979 caused industry to pay close attention to material purity in silicon processing and packaging. - Now largely considered to be "solved problem" Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 3 #### **Terrestrial Neutrons:** - High energy cosmic rays originate in space, but ... - collisions with atmosphere generates secondary particles. "Terrestrial Neutrons" main part of flux - Flux of neutrons depend on altitude. - IBM claims 5950 failures per billion device-hours at sea level, 0 failures in underground vault, with 50 feet of rocks completely shielding test setup. Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 4 UNIVERSITY OF MARYLAND # **Parity: "For Farmers"** - Odd bit error detection - No error correction capability - Overhead: 1 bit per byte Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 5 ### **Error Correcting Code I** - Also based on "parity checking", but more sophisticated - Error detection AND correction capability - Overhead: depending on scheme Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 6 UNIVERSITY OF MARYLAND ### **Error Correcting Code Ila** #### **Single-bit Error Correction (SEC)** requires n+1 check bits to provide SEC to 2<sup>n</sup> data bits Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang > University of Crete > > SLIDE 7 ### **Error Correcting Code IIb** #### **SEC** Encoding Example $D = \{11001110\} \longrightarrow R = \{011010011110\}$ Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 8 ### **Error Correcting Code IIc** #### **SEC** Verification Example $$R = \{011010011110\}$$ $R = \{011010011100\}$ One bit error. Can we detect and correct? #### **Recompute check bits** $$R_{0001} = R_{0011} + R_{0101} + R_{0111} + R_{1001} + R_{1011} = 1 + 1 + 0 + 1 + 0 = 1$$ $$R_{0010} = R_{0011} + R_{0110} + R_{0111} + R_{1010} + R_{1011} = 1 + 0 + 0 + 1 + 0 = 0$$ $$R_{0100} = R_{0101} + R_{0110} + R_{0111} + R_{1100} = 1 + 0 + 1 + 0 = 0$$ $$R_{1000} = R_{1001} + R_{1010} + R_{1011} + R_{1100} = 1 + 1 + 0 + 0 = 0$$ #### XOR old check bits against new check bits | | $R_{1000}$ | $R_{0100}$ | $R_{0010}$ | $R_{0001}$ | | |---|------------|------------|------------|------------|----------------------------------------| | | 1 | 0 | 1 | 0 | Old | | + | 0 | 0 | 0 | 1 | New | | | 1 | 0 | 1 | 1 | Difference ! Bit position 11 is rotten | Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang > University of Crete > > SLIDE 9 ### **Error Correcting Code Illa** #### What about multi-bit errors? $$R = \{ 0 1 1 0 1 0 0 1 1 1 1 0 \}$$ $$R = \{ 0 1 1 0 1 0 0 1 1 1 0 1 \}$$ Multi bit error. Can we detect and correct? #### Recompute check bits $$R_{0001} = R_{0011} + R_{0101} + R_{0111} + R_{1001} + R_{1011} = 1 + 1 + 0 + 1 + 0 = 1$$ $$R_{0010} = R_{0011} + R_{0110} + R_{0111} + R_{1010} + R_{1011} = 1 + 0 + 0 + 1 + 0 = 0$$ $$R_{0100} = R_{0101} + R_{0110} + R_{0111} + R_{1100}$$ = 1+0+1+1 $$= 1 + 0 + 1 + 1 = 1$$ $$R_{1000} = R_{1001} + R_{1010} + R_{1011} + R_{1100} = 1 + 1 + 0 + 1$$ #### XOR old check bits against new check bits | | $R_{1000}$ | $R_{0100}$ | $R_{0010}$ | $R_{0001}$ | | |---|------------|------------|------------|------------|-------------| | | 1 | 0 | 1 | 0 | Old | | + | 1 | 1 | 0 | 1 | New | | | 0 | 1 | 1 | 1 | Difference! | Oops, Bit position 7 is NOT rotten Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 10 ### **Error Correcting Code IIIb** #### What about multi-bit errors? Single Error Correction Double Error Detection (SECDED) requires n+2 check bits to provide SECDED to 2<sup>n</sup> data bits Spring 2014 CS-590.26 Lecture F Bruce Jacob **David Wang** > University of Crete > > SLIDE 11 ### **Error Correcting Code IIIc** #### What about multi-bit errors - Redux $$R = \{ 1011010011110 \}$$ $$R = \{1011010011101\}$$ Multi bit error. Can we detect and correct? #### Recompute check bits $$R_{0001} = R_{0011} + R_{0101} + R_{0111} + R_{1001} + R_{1011} = 1 + 1 + 0 + 1 + 0 = 1$$ $$R_{0010} = R_{0011} + R_{0110} + R_{0111} + R_{1010} + R_{1011} = 1 + 0 + 0 + 1 + 0 = 0$$ $$R_{0100} = R_{0101} + R_{0110} + R_{0111} + R_{1100}$$ = 1+0+1+1 $$= 1 + 0 + 1 + 1 = 1$$ $$R_{1000} = R_{1001} + R_{1010} + R_{1011} + R_{1100}$$ = 1+1+0+1 #### XOR old check bits against new check bits | | 1 | 0 | 1 | 0 | Old | |---|---|---|---|---|-----| | + | 1 | 1 | 0 | 1 | New | Difference! **XOR** check bits tell us there is error, but R<sub>0</sub> parity says all is well. This is a 2 bit error, cannot be corrected. Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 12 # **Error Correcting Code IV** - **SECDED** needs n + 2 check bits to protect 2<sup>n</sup> data bits - Data bus width of $64 = 2^6$ means 6 + 2 = 8 check bits to provide SECDED protection - Logic depth of n + 1 = 7 to compute XOR parity for $0^{th}$ bit - May cost additional cycle(s) on read latency Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 13 # Weaknesses of ECC? Error rate is given in failures per bit. There are always more DRAM storage bits in the next generation system. Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture12.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 14 #### **Multi-bit Error Correction I** $$0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \qquad 1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \qquad \alpha = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \qquad \alpha^2 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$ Parity check matrix in GF(2²) Apply transform matrices $$T_0 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}$$ $T_1 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ $T_{\alpha} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}$ $T_{\alpha}^2 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}$ FIGURE 30.12: Locating a single bit and 2-adjacent bit error in a 64-bit word. #### A two-bit error in positions 32,33 results in 11110011 Table 30.3 Error location table for the 2-adjacent error correction algorithm, taken from US Patent #5,490,155 (Compaq's Advanced ECC implementation) | _ | | | | | _ | | | | | | | _ | | | | | | | | | | |---|---|----|----|----|------------|-------|-------|-------|-------|-------|----|----|-------|-------|----|----|-------|-------|-----|-----|-------| | ı | | | | | <b>S7:</b> | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | L | | | | | S6: | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | | | S | S | S | S | s5: | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | | | 3 | 2 | 1 | 0 | s4: | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | | | 0 | 0 | 0 | 0 | | | C4 | C5 | | C6 | 5 | 3 | 1 | C7 | 0 | 4 | 2 | | 2,3 | 0.1 | 4,5 | | | 0 | 0 | 0 | 1 | | C0 | 51 | 49 | 47 | 63 | 33 | | | 61 | | 28 | | 59 | | | 30,31 | | | 0 | Ċ | 1 | 0 | | C1 | 46 | 50 | 48 | 58 | 31 | | | 62 | | 32 | | 60 | | | 28,29 | | | 0 | 0 | 1 | 1 | | | 48,49 | 46,47 | 50,51 | 60,61 | 29 | | | 58,59 | | | | 62,63 | | | 32,33 | | | 0 | 1 | U | 0 | | C2 | 57 | 52 | 54,55 | 11 | 35 | | | 9 | 19 | | | 7 | 17 | | | | | 0 | 1 | 0 | 1 | | 45 | 39 | 23 | 21 | 37 | | | | | | | | | | | | | | 0 | 1 | 1 | 0 | | 43 | | | | | | | | 24 | | | | | | | 12,13 | | | 0 | 1 | 1 | 1 | | 41 | | | | | | | | | | 14 | | 26,27 | | | | | | 1 | 0 | 0 | 0 | | C3 | 55 | 56 | 52,53 | 6 | | 16 | | 10 | | 34 | | 8 | | 18 | | | | 1 | 0 | 0 | 1 | | 40 | | | | 27 | | | | | | | | | | | 14,15 | | | 1 | 0 | 1 | 0 | | 44 | 20 | 38 | 22 | | | | | 36 | | | | | | | | | | 1 | 0 | 1 | 1 | | 42 | | | | | | | | | | | | 24,25 | | | | | | 1 | 1 | 0 | 0 | | | 53 | 54 | 56,57 | 8,9 | | | 18,19 | 6,7 | | | 16,17 | 10,11 | | | 34,35 | | | 1 | 1 | 0 | 1_ | | 42,43 | | | | 25 | | | | | | 12 | | | | | | | | 1 | 1_ | 1_ | 0 | | 40,41 | | | | | 15 | | | 26 | | | | | | | | | | 1 | 1 | 1 | 1 | | 44,45 | 22,23 | 20,21 | 38,39 | | | | | | | | | 36,37 | | | | Syndrome of 11110011 points to bad bits 32,33 Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 15 #### Multi-bit Error Correction II - Each pair of bit positions treated as a single symbol. - Combine with bit steering to cover failure across address boundaries. - Different algorithms exist with varying level of complexity - Should try to work with established framework of (64, 72) DIMMs. - Else, custom memory modules for specialized systems Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 16 ### "Chipkill" I Architect the memory system so there is no Single Point of Failure that could bring down the system Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 17 ### "Chipkill" II SECDED requires n + 2 bits to protect 2<sup>n</sup> bits. Need 9 check bits to protect 128 data bits. wider interface Deploy more advanced algorithm to detect and repair multi-bit errors with 128 data bits and 16 check bits, or 256:32. Architect the memory system so there is no Single Point of Failure that could bring down the system. Deploy method 1, method 2, or combination of both to protect against multi-bit errors **Bit-Steering** Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 18 #### **Problems Remain** Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 19 ### **Scrubbing** Soft error model based on Single Event Upset alpha particles or cosmic rays. "Scrubbing" merely reads out data to controller, scrub out any correctable error(s), write it back into memory before multi-bit errors build up and become no longer correctable Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 20 #### **Serverworks Grand Champion HE** - 128 bit ECC algorithm. 16 bit detection, 8 bit correction. - Memory scrubbing - Spare memory - Memory mirroring - Hot plug memory card Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 21 #### **What about Rambus?** Each "access" to DRAM is serviced by a single DRAM chip. One DRAM chip will provide 8 consecutive beats of data, 16 bit wide per beat. - Design ECC version, with 18 bit wide interface. provides SECDED protection, not chipkill Spring 2014 CS-590.26 Lecture F Bruce Jacob David Wang University of Crete SLIDE 22 #### **Interleaved Device Mode** - Each chip provides 2 bits of data for every read request - Provides effective chipkill capability when used in multiple channel configuration