Excelente estudo no mundo real:
Erros de DRAM na natureza: um estudo de campo em larga escala (pdf)
This paper provides the first large-scale study of DRAM
memory errors in the field. It is based on data collected
from Google’s server fleet over a period of more than two
years making up many millions of DIMM days. The DRAM
in our study covers multiple vendors, DRAM densities and
technologies (DDR1, DDR2, and FBDIMM).
The paper addresses the following questions: How com mon are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature, and system utilization? And how do they vary with chip-specific factors, such as chip density, memory technology and DIMM age?
We find that in many aspects DRAM errors in the field behave very differently than commonly assumed. For example, we observe DRAM error rates that are orders of magnitude
higher than previously reported, with FIT rates (failures in time per billion device hours) of 25,000 to 70,000 per Mbit and more than 8% of DIMMs affected per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which most previous work focuses on. We find that, out of all the factors that impact a DIMM’s error behavior in the field, temperature has a surprisingly small effect. Finally, unlike commonly feared, we don’t observe any indication that per-DIMM error rates increase with newer generations of DIMMs.
Interessante que a maioria dos erros de memória é difícil, erros de memória são irrecuperáveis, o que significa que a memória deve ser fisicamente substituída como falha , enquanto erros de memória podem ser corrigidos sobrescrevendo a memória com o correto valor. Isso indica para mim que o valor da ECC é bastante limitado.
There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or hard error. In this situation, a piece of hardware is broken and will consistently return incorrect results. A bit may be stuck so that it always returns "0" for example, no matter what is written to it. Hard errors usually indicate loose memory modules, blown chips, motherboard defects or other physical problems. They are relatively easy to diagnose and correct because they are consistent and repeatable.
Parece que todos os servidores do estudo usaram ECC, por isso não podemos saber as taxas de erros ECC vs. não-ECC.
This paper studied the incidence and characteristics of
DRAM errors in a large fleet of commodity servers. Our
study is based on data collected over more than 2 years and
covers DIMMs of multiple vendors, generations, technologies, and capacities. All DIMMs were equipped with error
correcting logic (ECC) to correct at least single bit errors.