Recovering and Troubleshooting the Managed System: Integrated Dell Remote Access Controller 6 (iDRAC6) Enterprise for Blade Servers Version 2.2 User Guide
This section explains how to perform tasks related to diagnosing and troubleshooting a remote managed system using iDRAC6 utilities. It contains the following subsections:
Trouble indications Helps you to find messages and other system indications that can lead to a diagnosis of the problem
Problem-solving tools Describes iDRAC6 tools that you can use to troubleshoot your system
Troubleshooting and frequently asked questions Answers to typical situations you may encounter
Safety First For You and Your System
To perform certain procedures in this section, you must work with the chassis, the Dell PowerEdge system, or other hardware modules. Do not attempt to service the system hardware except as explained in this guide and elsewhere in your system documentation.
CAUTION: Many repairs may only be done by a certified service technician. You should only perform troubleshooting and simple repairs as authorized in your product documentation, or as directed by online or telephone service and support team. Damage due to servicing that is not authorized by Dell is not covered by your warranty. Read and follow the safety instructions that came with the product.
Trouble Indicators
This section describes indications that there may be a problem with your system.
LED Indicators
LEDs on the chassis or on components installed in the chassis are generally the first indicators of system trouble. The following components and modules have status LEDs:
Chassis LCD display
Servers
Fans
CMCs
I/O modules
Power supplies
The single LED on the chassis LCD summarizes the status of all of the components in the system. A solid blue LED on the LCD indicates that no fault conditions have been detected in the system. A blinking amber LED on the LCD indicates that one or more fault conditions have been detected.
If the chassis LCD has a blinking amber LED, you can use the LCD menu to locate the component that has a fault. See the Dell Chassis Management Controller Firmware User Guide for help using the LCD.
Table 20-1 describes the meanings of the LED on the Dell PowerEdge system:
Table 20-1. Blade Server LED Indicators
LED indicator
Meaning
solid green (only for power button)
The server is powered on. Absence of the green LED means the server is not powered on.
solid blue
iDRAC6 is healthy.
flashing amber
iDRAC6 has detected a fault condition or may be in the process of updating firmware.
flashing blue
A user has activated the locator ID for this server.
Hardware Trouble Indicators
Indications that a module has a hardware problem include the following:
Failure to power up
Noisy fans
Loss of network connectivity
Battery, temperature, voltage, or power monitoring sensor alerts
Hard drive failures
USB media failure
Physical damage caused by dropping, water, or other external stress
When these kinds of problems occur, inspect the damage caused, and then try to correct the problem using these strategies:
Reseat the module and restart it
Try inserting the module into a different bay in the chassis
Try replacing hard drives or USB keys
Reconnect or replace the power and network cables
If these steps do not correct the problem, consult the Hardware Owner's Manual for specific troubleshooting information for the hardware device.
Other Trouble Indicators
Table 20-2. Trouble Indicators
Look for:
Action:
Alert messages from the systems management software
See the systems management software documentation.
This section describes iDRAC6 utilities you can use to diagnose problems with your system, especially when you are trying to solve problems remotely.
Checking the system health
Checking the System Event Log for error messages
Checking the POST codes
Viewing the last crash screen
Viewing the Most Recent Boot Sequences
Checking the Server Status Screen on the LCD for Error Messages
Viewing iDRAC6 log
Viewing system information
Identifying the managed server in the chassis
Using the diagnostics console
Managing power on a remote system
Checking the System Health
When you log in to iDRAC6 Web interface, the System Summary screen displays the health of the system components. Table 20-3 describes the meaning of the system health indicators.
Table 20-3. Server Health Indicators
Indicator
Description
A green check mark indicates a healthy (normal) status condition.
A yellow triangle containing an exclamation point indicates a warning (noncritical) status condition.
A red X indicates a critical (failure) status condition.
A question mark icon indicates that the status is unknown.
Click any component on the Server Health section to see information about the component. Sensor readings are displayed for batteries, temperatures, voltages, and power monitoring, helping to diagnose some types of problems. iDRAC6 and CMC information screens provide useful current status and configuration information.
Checking the System Event Log (SEL)
The SEL Log screen displays messages for events that occur on the managed server.
To view the System Event Log, perform the following steps:
Click System and then click the Logs tab.
Click System Event Log to display the System Event Log screen.
The System Event Log screen displays a system health indicator (see Table 20-3), a time stamp, and a description of the event.
Click the appropriate System Event Log button to continue
(see Table 20-4).
Table 20-4. SEL Buttons
Button
Action
Print
Prints the SEL in the sort order that it appears in the window.
Clear Log
Clears the SEL.
NOTE: The Clear Log button appears only if you have Clear Logs permission.
Save As
Opens a pop-up window that enables you to save the SEL to a directory of your choice.
NOTE: If you are using Internet Explorer and encounter a problem when saving, be sure to download the Cumulative Security Update for Internet Explorer, located on the Microsoft® Support website at support.microsoft.com.
Refresh
Reloads the SEL screen.
Checking the Post Codes
The Post Codes screen displays the last system post code prior to booting the operating system. Post codes are progress indicators from the system BIOS, indicating various stages of the boot sequence from Power on Reset, and allow you to diagnose any faults related to system boot-up.
NOTE: View the text for POST code message numbers in the LCD display or in the Hardware Owner's Manual.
To view the Post Codes, perform the following steps:
Click System, the Logs tab, and then Post Code.
The Post Code screen displays a system health indicator (see Table 20-3), a hexadecimal code, and a description of the code.
Click the appropriate Post Code button to continue (see Table 20-5).
The Last Crash Screen screen displays the most recent crash screen, which includes information about the events that occurred before the system crash. The last system crash image is saved in iDRAC6 persistent store and is remotely accessible.
To view the Last Crash Screen screen, perform the following steps:
Click System, the Logs tab, and then Last Crash Screen.
The Last Crash Screen screen provides the buttons shown in Table 20-6:
NOTE: The Save and Delete buttons do not appear if there is no saved crash screen.
Table 20-6. Last Crash Screen Buttons
Button
Action
Print
Prints the Last Crash Screen screen.
Save
Opens a pop-up window that enables you to save the Last Crash Screen to a directory of your choice.
Delete
Deletes the Last Crash Screen screen.
Refresh
Reloads the Last Crash Screen screen.
NOTE: Due to fluctuations in the Auto Recovery timer, the Last Crash Screen may not be captured when the System Reset Timer is configured with a value that is too high. The default setting is 480 seconds. Use Server Administrator or IT Assistant to set the System Reset Timer to 60 seconds and ensure that the Last Crash Screen functions properly. See "Configuring the Managed Server to Capture the Last Crash Screen" for additional information.
Viewing the Most Recent Boot Sequences
If you experience boot problems, you can view the screen activity of what happened during the last three boot sequences from the Boot Capture screen. Playback of the boot screens occurs at a rate of 1 frame per second. iDRAC6 records fifty frames during boot time.
NOTE: You must have administrator privileges to view playback of the Boot Capture sequences.
Table 20-7. Boot Capture Options
Button/Option
Description
Select the boot sequence
Allows you to select the boot sequence to load and play.
Boot Capture 1 Loads the most recent boot sequence.
Boot Capture 2 Loads the (second most recent) boot sequence that occurred prior to Boot Capture 1.
Boot Capture 3 Loads the (third most recent) boot sequence that occurred prior to Boot Capture 2.
Save As
Creates a compressed .zip file that contains all boot capture images of the current sequence. The user must have administrator privileges to perform this action.
Previous Screen
Takes you to previous screen, if any, in the replay console.
Play
Starts the screenplay from current screen in the replay console.
Pause
Pauses the screenplay on the current screen being displayed in the replay console.
Stop
Stops the screenplay and loads the first screen of that boot sequence.
Next Screen
Takes you to next screen, if any, in the replay console.
Print
Prints the Boot Capture image that appears on the screen.
Refresh
Reloads the Boot Capture screen.
Checking the Server Status Screen for Error Messages
When a flashing amber LED is lit, and a particular server has an error, the main Server Status Screen on the LCD will highlight the affected server in orange. Use the LCD navigation buttons to highlight the affected server, then click the center button. Error and warning messages will be displayed on the second line. The following table lists all of the error messages and their severity.
Table 20-8. Server Status Screen
Severity
Message
Cause
Warning
System Board Ambient Temp: Temperature sensor for System Board, warning event
Server ambient temperature crossed a warning threshold
Critical
System Board Ambient Temp: Temperature sensor for System Board, failure event
Server ambient temperature crossed a failure threshold
Critical
System Board CMOS Battery: Battery sensor for System Board, failed was asserted
CMOS battery is not present or has no voltage
Warning
System Board System Level: Current sensor for System Board, warning event
Current crossed a warning threshold
Critical
System Board System Level: Current sensor for System Board, failure event
Current crossed a failure threshold
Critical
CPU<number> <voltage sensor name>: Voltage sensor for CPU<number>, state asserted was asserted
Voltage out of range
Critical
System Board <voltage sensor name>: Voltage sensor for System Board, state asserted was asserted
Voltage out of range
Critical
CPU<number> <voltage sensor name>: Voltage sensor for CPU<number>, state asserted was asserted
Voltage out of range
Critical
CPU<number> Status: Processor sensor for CPU<number, IERR was asserted
CPU failure
Critical
CPU<number> Status: Processor sensor for CPU<number>, thermal tripped was asserted
CPU overheated
Critical
CPU<number> Status: Processor sensor for CPU<number, configuration error was asserted
Incorrect processor type or in wrong location
Critical
CPU<number> Status: Processor sensor for CPU<number>, presence was deasserted
Required CPU is missing or not present
Critical
System Board Video Riser: Module sensor for System Board, device removed was asserted
Required module was removed
Critical
Mezz B<slot number> Status: Add-in Card sensor for Mezz B<slot number>, install error was asserted
Incorrect Mezzanine card installed for IO fabric
Critical
Mezz C<slot number> Status: Add-in Card sensor for Mezz C<slot number>, install error was asserted
Incorrect Mezzanine card installed for I/O fabric
Critical
Backplane Drive <number>: Drive Slot sensor for Backplane, drive removed
Storage drive was removed
Critical
Backplane Drive <number>: Drive Slot sensor for Backplane, drive fault was asserted
Storage drive failed
Critical
System Board PFault Fail Safe: Voltage sensor for System Board, state asserted was asserted
This event is generated when the system board voltages are not at normal levels
Critical
System Board OS Watchdog: Watchdog sensor for System Board, timer expired was asserted
iDRAC6 watchdog timer expired and no action is set
Critical
System Board OS Watchdog: Watchdog sensor for System Board, reboot was asserted
iDRAC6 watchdog detected that the system has crashed (timer expired because no response was received from Host) and the action is set to reboot
Critical
System Board OS Watchdog: Watchdog sensor for System Board, power off was asserted
iDRAC6 watchdog detected that the system has crashed (timer expired because no response was received from Host) and the action is set to power off
Critical
System Board OS Watchdog: Watchdog sensor for System Board, power cycle was asserted
iDRAC6 watchdog detected that the system has crashed (timer expired because no response was received from Host) and the action is set to power cycle
Critical
System Board SEL: Event Log sensor for System Board, log full was asserted
The SEL device detects that only one entry can be added to the SEL before it is full
This event is generated in association with a CPU IERR and indicates which device caused the CPU IERR
Warning
PCIE NonFatal Er: Non Fatal I/O Group sensor, PCIe error (<location>)
This event is generated in association with a CPU IERR
Viewing iDRAC6 Log
iDRAC6 Log is a persistent log maintained in iDRAC6 firmware. The log contains a list of user actions (such as log in, log out, and security policy changes) and alerts issued by iDRAC6. The log gets erased after iDRAC6 firmware update.
Where the System Event Log (SEL) contains records of events that occur in the managed server, iDRAC6 Log contains records of events that occur in iDRAC6.
To access iDRAC6 Log, perform the following steps:
iDRAC6 Log provides the information in Table 20-9.
Table 20-9. iDRAC6 Log Information
Field
Description
Date/Time
The date and time (for example, Dec 19 16:55:47).
iDRAC6 sets its clock from the managed server's clock. When iDRAC6 initially starts and is unable to communicate with the managed server, the time is displayed as the string System Boot.
Source
The interface that caused the event.
Description
A brief description of the event and the user name that logged in to iDRAC6.
Using iDRAC6 Log Buttons
iDRAC6 Log screen provides the following buttons (see Table 20-10).
Table 20-10. iDRAC6 Log Buttons
Button
Action
Print
Prints iDRAC6 Log screen.
Clear Log
Clears iDRAC6 Log entries.
NOTE: The Clear Log button only appears if you have Clear Logs permission.
Save As
Opens a pop-up window that enables you to save iDRAC6 Log to a directory of your choice.
NOTE: If you are using Internet Explorer and encounter a problem when saving, be sure to download the Cumulative Security Update for Internet Explorer, located on the Microsoft Support website at support.microsoft.com.
Refresh
Reloads iDRAC6 Log screen.
Viewing System Information
The System Details screen displays information about the following system components:
The Dell PowerEdge M1000e chassis holds up to sixteen servers. To locate a specific server in the chassis, you can use iDRAC6 Web interface to turn on a blue flashing LED on the server. When you turn on the LED, you can specify the number of seconds that you want the LED to flash to ensure that you can reach the chassis while the LED is still flashing. Entering 0 leaves the LED flashing until you disable it.
In the Identify Server Timeout field, enter the number of seconds that you
want the LED to blink. Enter 0 if you want the LED to remain flashing
until you disable it.
Click Apply.
A blue LED on the server will flash for the number of seconds you specified.
If you entered 0 to leave the LED flashing, follow these steps to disable it:
iDRAC6 provides a standard set of network diagnostic tools (see Table 20-11) that are similar to the tools included with Microsoft® Windows® or Linux-based systems. Using iDRAC6 Web interface, you can access the network debugging tools.
To access the Diagnostics Console screen, perform the following steps:
Click System®iDRAC6®Troubleshooting.
Select the Diagnostics Console tab.
Table 20-11 describes the commands that can be entered on the Diagnostics Console screen. Enter a command and click Submit. The debugging results appear in the Diagnostics Console screen.
Click the Clear button to clear the results displayed by the previous command.
To refresh the Diagnostics Console screen, click Refresh.
Table 20-11. Diagnostic Commands
Command
Description
arp
Displays the contents of the Address Resolution Protocol (ARP) table. ARP entries may not be added or deleted.
ifconfig
Displays the contents of the network interface table.
netstat
Prints the content of the routing table.
ping <IP Address>
Verifies that the destination IP address is reachable from iDRAC6 with the current routing-table contents. A destination IP address must be entered in the field to the right of this option. An Internet control message protocol (ICMP) echo packet is sent to the destination IP address based on the current routing-table contents.
ping6 <IPv6 Address>
Verifies that the destination IPv6 address is accessible from iDRAC6 with the current routingtable contents. A destination IPv6 address must be entered in the field to the right of this option. An ICMP (Internet control message protocol) echo packet is sent to the destination IPv6 address based on the current routingtable contents.
traceroute <IP Address>
Used to determine the route taken by packets across an IP network.
traceroute6 <IPv6 Address>
Used to determine the route taken by packets across an IPv6 network.
gettracelog
Displays iDRAC6 trace log. See "gettracelog" for more information.
Managing Power on a Remote System
iDRAC6 enables you to remotely perform several power management actions on the managed server. Use the Power Management screen to perform an orderly shutdown through the operating system when rebooting and powering on and off.
NOTE: You must have Execute Server Action Commands permission to perform power management actions. See "Adding and Configuring iDRAC6 Users" for help configuring user permissions.
Click System, then click the Power Management® Power Control tab.
Select a Power Control Operation, for example Reset System (warm
boot). Table 20-12 provides information about Power Control Actions.
Click Applyto perform the selected action.
Table 20-12. Power Control Actions
Power On System
Turns on the system power (equivalent to pressing the power button when the system power is off).
Power Off System
Turns off the system power (equivalent to pressing the power button when the system power is on).
NMI (Non-Masking Interrupt)
Sends a high-level interrupt to the operating system, which causes the system to halt operation to allow for critical diagnostic or troubleshooting activities.
Graceful Shutdown
Attempts to cleanly shut down the operating system, then powers off the system. It requires an ACPI (Advanced Configuration and Power Interface) aware operating system, which allows for system directed power management.
NOTE: A graceful shutdown of the server operating system may not be possible when the server software stops responding, or if you are not logged as an administrator at a local Windows console. In these cases, you must specify a forced reboot instead of a graceful shutdown of Windows. In addition, depending on the version of the Windows OS, there might be a policy configured around the shutdown process that modifies shutdown behavior when triggered from iDRAC6. See Microsoft's documentation for the local computer policy "Shutdown: Allow system to be shut down without having to login."
Reset System (warm boot)
Reboots the system without powering off (warm boot).
A user has activated the locator ID for the server. This is a signal to help them identify the server in the chassis. See "Identifying the Managed Server in the Chassis" for information about this feature.
How can I find the IP address of iDRAC6?
From CMC Web interface:
Click Chassis® Servers, then click the Setup tab.
Click Deploy.
Read the IP address for your server from the table that is displayed.
From the iKVM:
Reboot the server and enter iDRAC6 Configuration Utility by pressing <Ctrl><E>.
Watch for the IP address which displays during BIOS POST.
Select the "Dell CMC" console in the OSCAR to log in to CMC through a local serial connection. CMC RACADM commands can be issued from this connection. See the
Dell Chassis Management Controller Administrator Reference Guide for a complete list of CMC RACADM subcommands.
Use the local RACADM getsysinfo command to view iDRAC6 IP address.
On the Main Menu, highlight Server and press the check button.
Select the server whose IP address you seek and press the check button.
How can I find the IP address of CMC?
From iDRAC6 Web interface:
Click System® Remote Access® CMC.
CMC IP address is displayed on the CMC Summary screen.
From the iKVM:
Select the "Dell CMC" console in the OSCAR to log in to CMC through a local serial connection. CMC RACADM commands can be issued from this connection. See the Dell Chassis Management Controller Administrator Reference Guide for a complete list of CMC RACADM subcommands
$ racadm getniccfg -m chassis
NIC Enabled = 1 DHCP Enabled = 1 Static IP Address = 192.168.0.120 Static Subnet Mask = 255.255.255.0 Static Gateway = 192.168.0.1 Current IP Address = 10.35.155.151 Current Subnet Mask = 255.255.255.0 Current Gateway = 10.35.155.1 Speed = Autonegotiate Duplex = Autonegotiate
NOTE: The above action can also be performed with remote RACADM.
iDRAC6 network connection is not working.
Ensure that the LAN cable is connected to CMC.
Ensure that NIC settings, IPv4 or IPv6 settings, and either Static or DHCP is enabled for your network.
I inserted the server into the chassis and pressed the power button, but nothing happened.
iDRAC6 requires upto 2 minutes to initialize before the server can power up.
Check CMC power budget. The chassis power budget may have exceeded.
I have forgotten iDRAC6 administrative user name and password.
You must restore iDRAC6 to its default settings.
Reboot the server and press <Ctrl><E> when prompted to enter iDRAC6 Configuration Utility.
On iDRAC6 Configuration Utility menu, highlight Reset to Default and press <Enter>.
NOTE: You can also reset iDRAC6 from local RACADM by issuing racadm racresetcfg.
How can I change the name of the slot for my server?
Log in to CMC Web interface.
Open the Chassis tree and click Servers.
Click the Setup tab.
Enter the new name for the slot in the row for your server.
Click Apply.
When starting a console redirection session from iDRAC6 Web interface, an ActiveX security popup appears.
iDRAC6 may not be a trusted site. To prevent the security popup from appearing every time you begin a console redirection session, add iDRAC6 to the trusted site list in the client browser:
Click Tools® Internet Options® Security® Trusted sites.
Click Sites and enter the IP address or the DNS name of iDRAC6.
Click Add.
Click Custom Level.
In the Security Settings window, select Prompt under Download unsigned ActiveX Controls.
When I start a console redirection session, the viewer screen is blank.
If you have Virtual Media privilege but not Console Redirection privilege, you are able to start the viewer so that you can access the virtual media feature, but the managed server's console will not display.
iDRAC6 is not responding during boot.
Remove and reinsert the server.
Check CMC Web interface to see if iDRAC6 appears as an upgradable component. If it does, follow the instructions in "Updating iDRAC6 Firmware Using CMC."
If this does not correct the problem, contact technical support.
When attempting to boot the managed server, the power indicator is green, but there is no POST or no video at all.
This can happen if any of the following conditions is true:
Memory is not installed or is inaccessible.
The CPU is not installed or is inaccessible.
The video riser card is missing or improperly connected.
Also, look for error messages in iDRAC6 log from iDRAC6 Web interface or from the LCD.