Based on kernel version 4.9. Page generated on 2016-12-21 14:33 EST.
1 EDAC - Error Detection And Correction 2 ===================================== 3 4 "bluesmoke" was the name for this device driver when it 5 was "out-of-tree" and maintained at sourceforge.net - 6 bluesmoke.sourceforge.net. That site is mostly archaic now and can be 7 used only for historical purposes. 8 9 When the subsystem was pushed into 2.6.16 for the first time, it was 10 renamed to 'EDAC'. 11 12 PURPOSE 13 ------- 14 15 The 'edac' kernel module's goal is to detect and report hardware errors 16 that occur within the computer system running under linux. 17 18 MEMORY 19 ------ 20 21 Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 22 primary errors being harvested. These types of errors are harvested by 23 the 'edac_mc' device. 24 25 Detecting CE events, then harvesting those events and reporting them, 26 *can* but must not necessarily be a predictor of future UE events. With 27 CE events only, the system can and will continue to operate as no data 28 has been damaged yet. 29 30 However, preventive maintenance and proactive part replacement of memory 31 DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events 32 and system panics. 33 34 OTHER HARDWARE ELEMENTS 35 ----------------------- 36 37 A new feature for EDAC, the edac_device class of device, was added in 38 the 2.6.23 version of the kernel. 39 40 This new device type allows for non-memory type of ECC hardware detectors 41 to have their states harvested and presented to userspace via the sysfs 42 interface. 43 44 Some architectures have ECC detectors for L1, L2 and L3 caches, 45 along with DMA engines, fabric switches, main data path switches, 46 interconnections, and various other hardware data paths. If the hardware 47 reports it, then a edac_device device probably can be constructed to 48 harvest and present that to userspace. 49 50 51 PCI BUS SCANNING 52 ---------------- 53 54 In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 55 in order to determine if errors are occurring during data transfers. 56 57 The presence of PCI Parity errors must be examined with a grain of salt. 58 There are several add-in adapters that do *not* follow the PCI specification 59 with regards to Parity generation and reporting. The specification says 60 the vendor should tie the parity status bits to 0 if they do not intend 61 to generate parity. Some vendors do not do this, and thus the parity bit 62 can "float" giving false positives. 63 64 There is a PCI device attribute located in sysfs that is checked by 65 the EDAC PCI scanning code. If that attribute is set, PCI parity/error 66 scanning is skipped for that device. The attribute is: 67 68 broken_parity_status 69 70 and is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for 71 PCI devices. 72 73 74 VERSIONING 75 ---------- 76 77 EDAC is composed of a "core" module (edac_core.ko) and several Memory 78 Controller (MC) driver modules. On a given system, the CORE is loaded 79 and one MC driver will be loaded. Both the CORE and the MC driver (or 80 edac_device driver) have individual versions that reflect current 81 release level of their respective modules. 82 83 Thus, to "report" on what version a system is running, one must report 84 both the CORE's and the MC driver's versions. 85 86 87 LOADING 88 ------- 89 90 If 'edac' was statically linked with the kernel then no loading 91 is necessary. If 'edac' was built as modules then simply modprobe 92 the 'edac' pieces that you need. You should be able to modprobe 93 hardware-specific modules and have the dependencies load the necessary 94 core modules. 95 96 Example: 97 98 $> modprobe amd76x_edac 99 100 loads both the amd76x_edac.ko memory controller module and the edac_mc.ko 101 core module. 102 103 104 SYSFS INTERFACE 105 --------------- 106 107 EDAC presents a 'sysfs' interface for control and reporting purposes. It 108 lives in the /sys/devices/system/edac directory. 109 110 Within this directory there currently reside 2 components: 111 112 mc memory controller(s) system 113 pci PCI control and status system 114 115 116 117 Memory Controller (mc) Model 118 ---------------------------- 119 120 Each 'mc' device controls a set of DIMM memory modules. These modules 121 are laid out in a Chip-Select Row (csrowX) and Channel table (chX). 122 There can be multiple csrows and multiple channels. 123 124 Memory controllers allow for several csrows, with 8 csrows being a 125 typical value. Yet, the actual number of csrows depends on the layout of 126 a given motherboard, memory controller and DIMM characteristics. 127 128 Dual channels allows for 128 bit data transfers to/from the CPU from/to 129 memory. Some newer chipsets allow for more than 2 channels, like Fully 130 Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: 131 132 133 Channel 0 Channel 1 134 =================================== 135 csrow0 | DIMM_A0 | DIMM_B0 | 136 csrow1 | DIMM_A0 | DIMM_B0 | 137 =================================== 138 139 =================================== 140 csrow2 | DIMM_A1 | DIMM_B1 | 141 csrow3 | DIMM_A1 | DIMM_B1 | 142 =================================== 143 144 In the above example table there are 4 physical slots on the motherboard 145 for memory DIMMs: 146 147 DIMM_A0 148 DIMM_B0 149 DIMM_A1 150 DIMM_B1 151 152 Labels for these slots are usually silk-screened on the motherboard. 153 Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are 154 channel 1. Notice that there are two csrows possible on a physical DIMM. 155 These csrows are allocated their csrow assignment based on the slot into 156 which the memory DIMM is placed. Thus, when 1 DIMM is placed in each 157 Channel, the csrows cross both DIMMs. 158 159 Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 160 Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 161 will have 1 csrow, csrow0. csrow1 will be empty. On the other hand, 162 when 2 dual ranked DIMMs are similarly placed, then both csrow0 and 163 csrow1 will be populated. The pattern repeats itself for csrow2 and 164 csrow3. 165 166 The representation of the above is reflected in the directory 167 tree in EDAC's sysfs interface. Starting in directory 168 /sys/devices/system/edac/mc each memory controller will be represented 169 by its own 'mcX' directory, where 'X' is the index of the MC. 170 171 172 ..../edac/mc/ 173 | 174 |->mc0 175 |->mc1 176 |->mc2 177 .... 178 179 Under each 'mcX' directory each 'csrowX' is again represented by a 180 'csrowX', where 'X' is the csrow index: 181 182 183 .../mc/mc0/ 184 | 185 |->csrow0 186 |->csrow2 187 |->csrow3 188 .... 189 190 Notice that there is no csrow1, which indicates that csrow0 is composed 191 of a single ranked DIMMs. This should also apply in both Channels, in 192 order to have dual-channel mode be operational. Since both csrow2 and 193 csrow3 are populated, this indicates a dual ranked set of DIMMs for 194 channels 0 and 1. 195 196 197 Within each of the 'mcX' and 'csrowX' directories are several EDAC 198 control and attribute files. 199 200 201 'mcX' directories 202 ----------------- 203 204 In 'mcX' directories are EDAC control and attribute files for 205 this 'X' instance of the memory controllers. 206 207 For a description of the sysfs API, please see: 208 Documentation/ABI/testing/sysfs-devices-edac 209 210 211 212 'csrowX' directories 213 -------------------- 214 215 When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX 216 directories. As this API doesn't work properly for Rambus, FB-DIMMs and 217 modern Intel Memory Controllers, this is being deprecated in favor of 218 dimmX directories. 219 220 In the 'csrowX' directories are EDAC control and attribute files for 221 this 'X' instance of csrow: 222 223 224 Total Uncorrectable Errors count attribute file: 225 226 'ue_count' 227 228 This attribute file displays the total count of uncorrectable 229 errors that have occurred on this csrow. If panic_on_ue is set 230 this counter will not have a chance to increment, since EDAC 231 will panic the system. 232 233 234 Total Correctable Errors count attribute file: 235 236 'ce_count' 237 238 This attribute file displays the total count of correctable 239 errors that have occurred on this csrow. This count is very 240 important to examine. CEs provide early indications that a 241 DIMM is beginning to fail. This count field should be 242 monitored for non-zero values and report such information 243 to the system administrator. 244 245 246 Total memory managed by this csrow attribute file: 247 248 'size_mb' 249 250 This attribute file displays, in count of megabytes, the memory 251 that this csrow contains. 252 253 254 Memory Type attribute file: 255 256 'mem_type' 257 258 This attribute file will display what type of memory is currently 259 on this csrow. Normally, either buffered or unbuffered memory. 260 Examples: 261 Registered-DDR 262 Unbuffered-DDR 263 264 265 EDAC Mode of operation attribute file: 266 267 'edac_mode' 268 269 This attribute file will display what type of Error detection 270 and correction is being utilized. 271 272 273 Device type attribute file: 274 275 'dev_type' 276 277 This attribute file will display what type of DRAM device is 278 being utilized on this DIMM. 279 Examples: 280 x1 281 x2 282 x4 283 x8 284 285 286 Channel 0 CE Count attribute file: 287 288 'ch0_ce_count' 289 290 This attribute file will display the count of CEs on this 291 DIMM located in channel 0. 292 293 294 Channel 0 UE Count attribute file: 295 296 'ch0_ue_count' 297 298 This attribute file will display the count of UEs on this 299 DIMM located in channel 0. 300 301 302 Channel 0 DIMM Label control file: 303 304 'ch0_dimm_label' 305 306 This control file allows this DIMM to have a label assigned 307 to it. With this label in the module, when errors occur 308 the output can provide the DIMM label in the system log. 309 This becomes vital for panic events to isolate the 310 cause of the UE event. 311 312 DIMM Labels must be assigned after booting, with information 313 that correctly identifies the physical slot with its 314 silk screen label. This information is currently very 315 motherboard specific and determination of this information 316 must occur in userland at this time. 317 318 319 Channel 1 CE Count attribute file: 320 321 'ch1_ce_count' 322 323 This attribute file will display the count of CEs on this 324 DIMM located in channel 1. 325 326 327 Channel 1 UE Count attribute file: 328 329 'ch1_ue_count' 330 331 This attribute file will display the count of UEs on this 332 DIMM located in channel 0. 333 334 335 Channel 1 DIMM Label control file: 336 337 'ch1_dimm_label' 338 339 This control file allows this DIMM to have a label assigned 340 to it. With this label in the module, when errors occur 341 the output can provide the DIMM label in the system log. 342 This becomes vital for panic events to isolate the 343 cause of the UE event. 344 345 DIMM Labels must be assigned after booting, with information 346 that correctly identifies the physical slot with its 347 silk screen label. This information is currently very 348 motherboard specific and determination of this information 349 must occur in userland at this time. 350 351 352 353 SYSTEM LOGGING 354 -------------- 355 356 If logging for UEs and CEs is enabled, then system logs will contain 357 information indicating that errors have been detected: 358 359 EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, 360 channel 1 "DIMM_B1": amd76x_edac 361 362 EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, 363 channel 1 "DIMM_B1": amd76x_edac 364 365 366 The structure of the message is: 367 the memory controller (MC0) 368 Error type (CE) 369 memory page (0x283) 370 offset in the page (0xce0) 371 the byte granularity (grain 8) 372 or resolution of the error 373 the error syndrome (0xb741) 374 memory row (row 0) 375 memory channel (channel 1) 376 DIMM label, if set prior (DIMM B1 377 and then an optional, driver-specific message that may 378 have additional information. 379 380 Both UEs and CEs with no info will lack all but memory controller, error 381 type, a notice of "no info" and then an optional, driver-specific error 382 message. 383 384 385 PCI Bus Parity Detection 386 ------------------------ 387 388 On Header Type 00 devices, the primary status is looked at for any 389 parity error regardless of whether parity is enabled on the device or 390 not. (The spec indicates parity is generated in some cases). On Header 391 Type 01 bridges, the secondary status register is also looked at to see 392 if parity occurred on the bus on the other side of the bridge. 393 394 395 SYSFS CONFIGURATION 396 ------------------- 397 398 Under /sys/devices/system/edac/pci are control and attribute files as follows: 399 400 401 Enable/Disable PCI Parity checking control file: 402 403 'check_pci_parity' 404 405 406 This control file enables or disables the PCI Bus Parity scanning 407 operation. Writing a 1 to this file enables the scanning. Writing 408 a 0 to this file disables the scanning. 409 410 Enable: 411 echo "1" >/sys/devices/system/edac/pci/check_pci_parity 412 413 Disable: 414 echo "0" >/sys/devices/system/edac/pci/check_pci_parity 415 416 417 Parity Count: 418 419 'pci_parity_count' 420 421 This attribute file will display the number of parity errors that 422 have been detected. 423 424 425 426 MODULE PARAMETERS 427 ----------------- 428 429 Panic on UE control file: 430 431 'edac_mc_panic_on_ue' 432 433 An uncorrectable error will cause a machine panic. This is usually 434 desirable. It is a bad idea to continue when an uncorrectable error 435 occurs - it is indeterminate what was uncorrected and the operating 436 system context might be so mangled that continuing will lead to further 437 corruption. If the kernel has MCE configured, then EDAC will never 438 notice the UE. 439 440 LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1] 441 442 RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 443 444 445 Log UE control file: 446 447 'edac_mc_log_ue' 448 449 Generate kernel messages describing uncorrectable errors. These errors 450 are reported through the system message log system. UE statistics 451 will be accumulated even when UE logging is disabled. 452 453 LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1] 454 455 RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 456 457 458 Log CE control file: 459 460 'edac_mc_log_ce' 461 462 Generate kernel messages describing correctable errors. These 463 errors are reported through the system message log system. 464 CE statistics will be accumulated even when CE logging is disabled. 465 466 LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1] 467 468 RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 469 470 471 Polling period control file: 472 473 'edac_mc_poll_msec' 474 475 The time period, in milliseconds, for polling for error information. 476 Too small a value wastes resources. Too large a value might delay 477 necessary handling of errors and might loose valuable information for 478 locating the error. 1000 milliseconds (once each second) is the current 479 default. Systems which require all the bandwidth they can get, may 480 increase this. 481 482 LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1] 483 484 RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 485 486 487 Panic on PCI PARITY Error: 488 489 'panic_on_pci_parity' 490 491 492 This control file enables or disables panicking when a parity 493 error has been detected. 494 495 496 module/kernel parameter: edac_panic_on_pci_pe=[0|1] 497 498 Enable: 499 echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 500 501 Disable: 502 echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 503 504 505 506 EDAC device type 507 ---------------- 508 509 In the header file, edac_core.h, there is a series of edac_device structures 510 and APIs for the EDAC_DEVICE. 511 512 User space access to an edac_device is through the sysfs interface. 513 514 At the location /sys/devices/system/edac (sysfs) new edac_device devices will 515 appear. 516 517 There is a three level tree beneath the above 'edac' directory. For example, 518 the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website) 519 installs itself as: 520 521 /sys/devices/systm/edac/test-instance 522 523 in this directory are various controls, a symlink and one or more 'instance' 524 directories. 525 526 The standard default controls are: 527 528 log_ce boolean to log CE events 529 log_ue boolean to log UE events 530 panic_on_ue boolean to 'panic' the system if an UE is encountered 531 (default off, can be set true via startup script) 532 poll_msec time period between POLL cycles for events 533 534 The test_device_edac device adds at least one of its own custom control: 535 536 test_bits which in the current test driver does nothing but 537 show how it is installed. A ported driver can 538 add one or more such controls and/or attributes 539 for specific uses. 540 One out-of-tree driver uses controls here to allow 541 for ERROR INJECTION operations to hardware 542 injection registers 543 544 The symlink points to the 'struct dev' that is registered for this edac_device. 545 546 INSTANCES 547 --------- 548 549 One or more instance directories are present. For the 'test_device_edac' case: 550 551 test-instance0 552 553 554 In this directory there are two default counter attributes, which are totals of 555 counter in deeper subdirectories. 556 557 ce_count total of CE events of subdirectories 558 ue_count total of UE events of subdirectories 559 560 BLOCKS 561 ------ 562 563 At the lowest directory level is the 'block' directory. There can be 0, 1 564 or more blocks specified in each instance. 565 566 test-block0 567 568 569 In this directory the default attributes are: 570 571 ce_count which is counter of CE events for this 'block' 572 of hardware being monitored 573 ue_count which is counter of UE events for this 'block' 574 of hardware being monitored 575 576 577 The 'test_device_edac' device adds 4 attributes and 1 control: 578 579 test-block-bits-0 for every POLL cycle this counter 580 is incremented 581 test-block-bits-1 every 10 cycles, this counter is bumped once, 582 and test-block-bits-0 is set to 0 583 test-block-bits-2 every 100 cycles, this counter is bumped once, 584 and test-block-bits-1 is set to 0 585 test-block-bits-3 every 1000 cycles, this counter is bumped once, 586 and test-block-bits-2 is set to 0 587 588 589 reset-counters writing ANY thing to this control will 590 reset all the above counters. 591 592 593 Use of the 'test_device_edac' driver should enable any others to create their own 594 unique drivers for their hardware systems. 595 596 The 'test_device_edac' sample driver is located at the 597 bluesmoke.sourceforge.net project site for EDAC. 598 599 600 NEHALEM USAGE OF EDAC APIs 601 -------------------------- 602 603 This chapter documents some EXPERIMENTAL mappings for EDAC API to handle 604 Nehalem EDAC driver. They will likely be changed on future versions 605 of the driver. 606 607 Due to the way Nehalem exports Memory Controller data, some adjustments 608 were done at i7core_edac driver. This chapter will cover those differences 609 610 1) On Nehalem, there is one Memory Controller per Quick Patch Interconnect 611 (QPI). At the driver, the term "socket" means one QPI. This is 612 associated with a physical CPU socket. 613 614 Each MC have 3 physical read channels, 3 physical write channels and 615 3 logic channels. The driver currently sees it as just 3 channels. 616 Each channel can have up to 3 DIMMs. 617 618 The minimum known unity is DIMMs. There are no information about csrows. 619 As EDAC API maps the minimum unity is csrows, the driver sequentially 620 maps channel/dimm into different csrows. 621 622 For example, supposing the following layout: 623 Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 624 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 625 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 626 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 627 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 628 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 629 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 630 The driver will map it as: 631 csrow0: channel 0, dimm0 632 csrow1: channel 0, dimm1 633 csrow2: channel 1, dimm0 634 csrow3: channel 2, dimm0 635 636 exports one 637 DIMM per csrow. 638 639 Each QPI is exported as a different memory controller. 640 641 2) Nehalem MC has the ability to generate errors. The driver implements this 642 functionality via some error injection nodes: 643 644 For injecting a memory error, there are some sysfs nodes, under 645 /sys/devices/system/edac/mc/mc?/: 646 647 inject_addrmatch/*: 648 Controls the error injection mask register. It is possible to specify 649 several characteristics of the address to match an error code: 650 dimm = the affected dimm. Numbers are relative to a channel; 651 rank = the memory rank; 652 channel = the channel that will generate an error; 653 bank = the affected bank; 654 page = the page address; 655 column (or col) = the address column. 656 each of the above values can be set to "any" to match any valid value. 657 658 At driver init, all values are set to any. 659 660 For example, to generate an error at rank 1 of dimm 2, for any channel, 661 any bank, any page, any column: 662 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 663 echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 664 665 To return to the default behaviour of matching any, you can do: 666 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 667 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 668 669 inject_eccmask: 670 specifies what bits will have troubles, 671 672 inject_section: 673 specifies what ECC cache section will get the error: 674 3 for both 675 2 for the highest 676 1 for the lowest 677 678 inject_type: 679 specifies the type of error, being a combination of the following bits: 680 bit 0 - repeat 681 bit 1 - ecc 682 bit 2 - parity 683 684 inject_enable starts the error generation when something different 685 than 0 is written. 686 687 All inject vars can be read. root permission is needed for write. 688 689 Datasheet states that the error will only be generated after a write on an 690 address that matches inject_addrmatch. It seems, however, that reading will 691 also produce an error. 692 693 For example, the following code will generate an error for any write access 694 at socket 0, on any DIMM/address on channel 2: 695 696 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 697 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 698 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 699 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 700 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 701 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 702 703 For socket 1, it is needed to replace "mc0" by "mc1" at the above 704 commands. 705 706 The generated error message will look like: 707 708 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 709 710 3) Nehalem specific Corrected Error memory counters 711 712 Nehalem have some registers to count memory errors. The driver uses those 713 registers to report Corrected Errors on devices with Registered Dimms. 714 715 However, those counters don't work with Unregistered Dimms. As the chipset 716 offers some counters that also work with UDIMMS (but with a worse level of 717 granularity than the default ones), the driver exposes those registers for 718 UDIMM memories. 719 720 They can be read by looking at the contents of all_channel_counts/ 721 722 $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 723 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 724 0 725 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 726 0 727 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 728 0 729 730 What happens here is that errors on different csrows, but at the same 731 dimm number will increment the same counter. 732 So, in this memory mapping: 733 csrow0: channel 0, dimm0 734 csrow1: channel 0, dimm1 735 csrow2: channel 1, dimm0 736 csrow3: channel 2, dimm0 737 The hardware will increment udimm0 for an error at the first dimm at either 738 csrow0, csrow2 or csrow3; 739 The hardware will increment udimm1 for an error at the second dimm at either 740 csrow0, csrow2 or csrow3; 741 The hardware will increment udimm2 for an error at the third dimm at either 742 csrow0, csrow2 or csrow3; 743 744 4) Standard error counters 745 746 The standard error counters are generated when an mcelog error is received 747 by the driver. Since, with udimm, this is counted by software, it is 748 possible that some errors could be lost. With rdimm's, they display the 749 contents of the registers 750 751 AMD64_EDAC REFERENCE DOCUMENTS USED 752 ----------------------------------- 753 amd64_edac module is based on the following documents 754 (available from http://support.amd.com/en-us/search/tech-docs): 755 756 1. Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 757 Opteron Processors 758 AMD publication #: 26094 759 Revision: 3.26 760 Link: http://support.amd.com/TechDocs/26094.PDF 761 762 2. Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 763 Processors 764 AMD publication #: 32559 765 Revision: 3.00 766 Issue Date: May 2006 767 Link: http://support.amd.com/TechDocs/32559.pdf 768 769 3. Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 770 Processors 771 AMD publication #: 31116 772 Revision: 3.00 773 Issue Date: September 07, 2007 774 Link: http://support.amd.com/TechDocs/31116.pdf 775 776 4. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 777 Models 30h-3Fh Processors 778 AMD publication #: 49125 779 Revision: 3.06 780 Issue Date: 2/12/2015 (latest release) 781 Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 782 783 5. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 784 Models 60h-6Fh Processors 785 AMD publication #: 50742 786 Revision: 3.01 787 Issue Date: 7/23/2015 (latest release) 788 Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 789 790 6. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 791 Models 00h-0Fh Processors 792 AMD publication #: 48751 793 Revision: 3.03 794 Issue Date: 2/23/2015 (latest release) 795 Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 796 797 CREDITS: 798 ======== 799 800 Written by Doug Thompson <dougthompson@xmission.com> 801 7 Dec 2005 802 17 Jul 2007 Updated 803 804 (c) Mauro Carvalho Chehab 805 05 Aug 2009 Nehalem interface 806 807 EDAC authors/maintainers: 808 809 Doug Thompson, Dave Jiang, Dave Peterson et al, 810 Mauro Carvalho Chehab 811 Borislav Petkov 812 original author: Thayne Harbaugh