September 28, 2006

Tips & tricks

Red Hat's customer service and support teams receive technical support questions from users all over the world. Red Hat technicians add the questions and answers to Red Hat Knowledgebase on a daily basis. Access to Red Hat Knowledgebase is free. Red Hat Magazine offers a preview into the Red Hat Knowledgebase by highlighting some of the most recent entries.

How does Red Hat Enterprise Linux 4 Update 4 and above support the machine check exceptions (MCE) in the revision F AMD Opteron Chip?

Red Hat Enterprise Release 4 Update 4 supports AMD Opteron Rev F MCE threadhold counters.

The Revision F of AMD Opteron processor adds support for MCE Threshold counters for DRAM. These counters allow a user with root access to specify a threshold of correctable ECCs that can be taken from the DRAM controller before an MCE is issued. This feature lets administrators of large server systems ignore infrequent ECC errors caused by cosmic radiation but be alerted via the MCE mechanism when a DRAM chip is failing.

Three major sections of note are:

  1. Sysfs Interfaces

    With Update 4, the sysfs interface will be created under:

    /sys/devices/system/threshold/threshold[i]/bank[j]
    

    where, [i] refers to the CPU number the threshold registers is located and [j] refers to the MCA bank number the threshold register is under.

    There may be up to five banks per CPU, however, currently only the fifth bank, MC4_MISC contains a valid threshold counter for DRAM ECC errors.

    The following files will be created per valid threshold register:

    error_count (R/W)
    - read: output the current error count in hex
    - write: reset the count
    
    interrupt_enable (R/W)
    - read: output 1 if interrupt enabled, else 0
    - write: writing 0 will disable, non-0 will enable interrupts
    
    threshold_limit (R/W)
    - read: output the current threshold limit in hex
    - write: set a new threshold limit
    

    The interrupt_enable may be changed without affecting the error_count. The threshold_limit may be changed without affecting the error_count if the new limit is not below the current error count. The threshold_limit must be: 0x0 < limit < 0xFFF.

    When the error_count reaches the threshold_limit, the error_count will be fixed at the threshold_limit and will not increment any longer.

    The user must reset the error_count in order for the counting to resume.

  2. Threshold Interrupt

    When the error_count reaches the threshold_limit and if interrupt_enable is set, the processor will generate an interrupt with THRESHOLD_APIC_VECTOR. The driver will service the interrupt by simply logging the mcelog with a software defined bank number.

  3. Mcelog

    The mcelog resides in /dev/mcelog and can be read by the user, or a user-land program 'mcelog' which decodes various machine check exception dumps.

    Example#1: decodes a fatal machine check exception message in ascii to stdout

    bash# mcelog --k8 --ascii
    

    Example#2: redirects output to system log

    bash# mcelog --syslog --k8 /dev/mcelog

    Refer to mcelog manpage for more detailed usage.

How do I find out if a given piece of PCI hardware is supposed to be taken care of by the current kernel?

Each pci device is flagged with a Constructor and a Model identifier that make them unique. A list of such id is available on sourceforge. A kernel module can detect the hardware it should handle using those identifier. A list of such id mapped to the module name is available in the modules.pcmimap file that comes with a given kernel (/lib/modules/`uname -r`/modules.pcimap).
Detecting the Constructor and Model vendor can be done with the lspci command. 'lspci' on its own will list the hardware available in human readable form. 'lspci -n' will gives the same list with the actual constructor and model id. Detect the piece of hardware that needs to be checked out with lspci, then find the line with the same pci id (the leftern collumn) in the output of 'lspci -n'. If there is a line related to this piece of hardware in modules.pcimap, chances are that this piece of hardware will be taken care of by the kernel.
Example Scenario: Will an Intel Corporation 82546GB Gigabit Ethernet Card be taken care of by a Red Hat Enterprise Linux 4 system running a 2.6.9-34.0.1.EL kernel?

lspci gives:

02:01.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)

lspci -n gives this result for the '02:01.0' device:
02:01.0 Class 0200: 8086:1079 (rev 03)

The constructor code is 8086 (Intel Corporation). The model is 1079 .
The /lib/modules/2.6.9-34.0.1.EL/modules.pcimap countain the following line that match those numbers:
e1000                0x00008086 0x00001079 0xffffffff 0xffffffff 0x00000000 0x00000000 0x0

This card should work with the e1000 module on a Red Hat Enterprise Linux 4 with the 2.6.9-34.0.1.EL kernel.

Why does one get the warning "Cannot Preserve Ownership Error when Files are Moved to an NFS Mount Point" while trying to move a file to an NFS mount point?

This happens because commands like cp and mv try to copy ACL information. The warning is reported even though there may not be any ACL information to be transferred. This is because these programs cannot differentiate between a filesystem that doesn't support ACL, and one that does, but has ACL disabled. This warning is harmless and the file should be copied successfully.

Why am I getting SCSI errors when using device mapper multipathing with my MSA series and StorageWorks SAN?

The MSA series and StorageWorks Storage Area Networks(SANs) are mostly active/passive SANs, which means that one path is always active, and the other path does not do anything until the active side fails. It is not possible to safely use device mapper multipathing on these SANs as it requires a special driver not provided with Red Hat Enterprise Linux 4. The vendor has a special firmware that will change the SAN to active/active which will allow it to work with device mapper multipathing. Please contact the vendor if you wish to use device mapper multipathing with this SAN.

The information provided in this article is for your information only. The origin of this information may be internal or external to Red Hat. While Red Hat attempts to verify the validity of this information before it is posted, Red Hat makes no express or implied claims to its validity.

This article is protected by the Open Publication License, V1.0 or later. Copyright © 2006 by Red Hat, Inc.