I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for ... I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting.
A página da web da Wikipedia em Esfregando a memória diz:
"Over 8% of DIMM modules experience at least one correctable error per year. This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small.".
"In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.
Essa página contém um link para o manual da placa-mãe SuperMicro X9SRA, que explica o intervalo de depuração:
"Patrol Scrub
Patrol Scrubbing is a process that allows the CPU to correct correctable memory errors detected on a memory module and send the correction to the requestor (the original source). When this item is set to Enabled, the North Bridge will read and write back one cache line every 16K cycles, if there is no delay caused by internal processing. By using this method, roughly 64 GB of memory behind the North Bridge will be scrubbed every day. The options are Enabled and Disabled.".
Assim, a causa não é de esfregar. É possível que há um bit defeituoso. Embora uma falha possa ocorrer de repente, parece estranho que ela desapareça e volte, especialmente quando ocorre com tanta frequência.
"How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?"
Pavel Machek, que inventou o módulo do kernel nohammer diz:
"It is fairly hard to do rowhammer by accident, so if you are hitting it, someone is probably doing it on purpose. ... Well, there's more than three orders of magnitude difference between cosmic rays and rowhammer. IIRC cosmic rays are expected to cause 2 bit flips a year... rowhammer can do bitflip in 10 minutes, and that is old version, not one of the optimized ones.".
Você pode trocar os módulos de RAM e ver se o relatório de erros segue o chip, fica na posição da memória ou ocorre em outro lugar.
A HPE recomenda (para um módulo de memória defeituoso):
"SYMPTOM: The below error message is found in the OS logs:
host1 kernel: Northbridge Error (node X): DRAM ECC error detected on the NB.
FIX:
1. Identify the Memory module number that has failed (if mentioned in the error)
2. Check IML for Error relating to Memory module. Ex Proc x slot x
3. Update System BIOS
4. If no errors are found run diagnostics and replace the memory module (5-6 loops of Memory Diagnostics to isolate the memory module)"
Curso sugerido:
-
Mudar a RAM nos soquetes indicará se é um módulo de RAM específico ou se a falha está em outros circuitos.
-
Contanto que você não receba mais do que um erro de bit em intervalos de alguns dias, não haverá pânico (rush).
-
Se você for atingido a cada 10 minutos, poderá ser atacado.
Veja também: " Defendendo o RowHammer no kernel " e "ECCploit: Memória ECC vulnerável a ataques de Rowhammer depois de tudo ". Para processadores ARM, há: " patches do Android GuardION para reduzir ataques de Rowhammer baseados em DMA no ARM ".