Storage area networks (SANs) can be complicated and temperamental beasts. This is especially true when they're...
poorly managed. Troubleshooting is tough because a good design is not always obvious and Fibre Channel (FC) standards are just loose enough to make interoperability a concern. This technical tip will review some common problem areas with SANs, describe how to diagnose problems and offer some suggestions about how to prevent the problems in the first place.
A million things can go wrong in a complex storage network. Based on the symptoms, narrowing a problem down to a probable cause in one of these areas should speed troubleshooting and resolution. Each failure type can be grouped into one of the following areas:
Although FC SANs have been around for 15 or more years, not all devices interoperate well. It is very common for many SAN problems to result from non-interoperable components. All storage vendors publish some form of a support matrix where they document tested and supported configurations of storage array microcode, SAN switch firmware and host hardware/software.
Exceeding the capacity limits
It is probably obvious that saturating SAN ports will cause bottlenecks and those bottlenecks can transform themselves into elusive application problems. It is usually pretty easy to look at a host or storage port on the SAN and determine if it is 100% busy, but it is tougher to determine if an overloaded inter-switch link (ISL) is a culprit. Sometimes the I/O itself isn't a bottleneck, instead limits like fan ratios (number of HBAs zoned to a storage port) and number of switches in a fabric are exceeded, causing connectivity issues.
Incorrect configuration or zoning
Bad or incorrect zoning is one of the most common causes of SAN problems. Maybe it is because we change the SAN zoning most often. This may also be common because zones contain those tricky 16-digit hexadecimal world wide names (WWNs).
Flakey connections and cables
It seems that when fiber cables fail, they rarely fail completely. Instead, they die a slow, painful and intermittent death. On the way to the grave, they often give applications and administrators fits.
Storage array configuration issues
Each brand of storage array is managed a little differently, but all share some basic concepts. Logical unit numbers (LUNs) must be created and assigned to a host HBA through a front-end SAN port. Problems often arise when the storage administrator makes a typo in configuration the array.
Host configuration issues
A lot can go wrong on a server. They represent a large proportion of the SAN component stack, including the volume manager, operating system, multipathing software, HBA driver, HBA firmware and HBA hardware. Each of these components must be configured as per the storage vendors specifications, or you're asking for trouble.
SAN hardware failures
I purposely listed hardware failures last on the list of common SAN problems because while it is usually the first place we look, it's rarely the problem. Today's SAN hardware is very reliable, but it does fail occasionally. Common failures that can affect host access are SPF port failures, port card failures and entire switch failures.
SAN troubleshooting requires an intimate knowledge of the desired configuration and the expected behavior a particular system. When a problem occurs, it's helpful to narrow it by eliminating the properly functioning components in basic areas: SAN, hosts and storage. Ask yourself these questions:
Is it the SAN?
Have any SAN changes occurred recently? Ask around, check the SAN logs and compare the running configuration to the documentation. Is it SAN reporting events or errors that may be related? Look for failed ports, recent port logouts or fabric rebuilds.
Is it the host?
Can other hosts see the storage in question? Can this host see other storage? Is the HBA logged into the fabric? Have any recent host changes occurred? Are there any SAN-related messages in the hosts system message logs?
Is it the storage?
Can other hosts see the storage in question? Is the storage port logged into the fabric? Have any changes occurred on the storage array recently? Are the storage array logs reporting errors?
Check the support matrices
Make a regular practice of reviewing storage matrices and checking your configuration against what is currently supported. Manufactures are constantly finding new bugs that get fixed in new code. Keep your software versions current and supported and you'll avoid a lot of problems.
Document the SAN
This one is huge. It is so important when troubleshooting a problem to understand what the design intent was. Make sure the documentation records hosts, HBAs, WWNs and where they connect. It should include the storage, storage ports and their WWNs. Finally, the SAN documentation should describe the fabrics, ISLs, zone sets, zones and zone members.
Baseline the SAN performance
Unless you record what is happening on an average every day, it will be tough to determine if a busy port is normal or the culprit during a problem. Minimally, record the average port utilization for every port in the SAN.
Plan your changes
To avoid administrator-induced outages, use the SAN documentation to define changes before they happen. If you are making any decisions about what to do when you're executing the change, you're doing it wrong. Also, it is too easy to forget to document a change after it has occurred.
Backup the configurations
After every day of SAN changes, back up and safely store the switch configuration. This will ensure that you can roll back changes quickly from a backup if a switch fails or gets totally messed up during a change. Believe me, it happens a lot and you'll be glad you have a backup when it hits you.
Troubleshooting SANs can be a non-issue when certain things are under control. Consider these best practices day to day to prevent a huge issue when something does go wrong.
About the author: Brian Peterson is an independent IT infrastructure analyst. He has a deep background in enterprise storage and open-systems computing platforms. He has consulted with hundreds of enterprise customers who struggled with the challenges of disaster recovery, scalability, technology refreshes and controlling costs.