Archive for August, 2012

In doing some final testing and handover for upgrading Ports of Auckland’s VMWare hosts to ESX5i, we discovered that on boot the host was appearing to hang at loading the vmw_vaai_cx module. We had already successfully rebooted this test blade a few times, but had just re-enabled the FC ports to allow it access to the SAN volumes again. Suspecting this was related, due to the module name, to the storage I did a quick bit of research and found this article.

http://kb.vmware.com/kb/1016106

The long and the short of it: If you are using Microsoft Failover Clusters with shared volumes (e.g. quorum disks or data disks) in VMWare, by using the Physical RDM mode, then you need to flag to each host that has access to those volumes that they are permanently reserved. This stops the host from repeatedly attempting to get a reservation and inspect the volume. In our case we have 15 RDM volumes so the boot time was fairly excessive.

For each volume you need to get the naa ID of the volume and then run an esxcli command on each host for each volume.

A neat wayto get a list of all RDM disks and their naa ID’s is to use this powershell command:

Get-VM | Get-HardDisk -DiskType “RawPhysical”,”RawVirtual” | Select Parent,Name,DiskType,ScsiCanonicalName,DeviceName

Then you just need to identify the naa ID’s that are connected to more than one VM to identify the likely MSCS candidates.

To flag the volumes you then just need to run esxcli (I prefer the powercli version which makes it easy to apply the command to a group of hosts):

$esxcli= get-esxcli -VMHost ESXhost

$esxcli.storage.core.device.setconfig($false, “naa.60030d90544d5353514c322d46455247“, $true)

Well tomorrow is my last day with the Ports of Auckland (I know, you all thought I’d be here forever). I’m off to take up a role as Technical Solutions Designer with vifx (www.vifx.co.nz). For me its an exciting new opportunity to specialise and dig my roots deeper into virtualisation, datacenter automation and storage, though the move is tinged with sadness in leaving such a unique place such as the port.

It’s been a hell of ride being imersed in the 24/7 365 world of the port and being involved in everything from storage to SQL, from Exchange to CCTV systems. Driving past monsterous rolling tonka toys and going up the quay side cranes is an experience you won’t find in an office block. Most importantly I leave behind a great team of guys who all mucked in together, had more than a few laughs and 4 consecutive Teched/Techfests together.

Now my focus is going to be on virtualisation, storage and of course the Cloud (Yes I have come to accept that the term is here, I’m just going to have to suck it up and use it like I mean it). Hopefully that new found focus will be reflected in this blog with a series of in depth posts, I’m weighing up what to focus on next, ideas anyone?

Anyway, thanks to the port and the team for a great four years, so long and thanks for all the fish and always have a towel close by.

Disclaimer: I’m throwing this post out there before fully proving this, but the hope is that this will either help someone else or elicit information from someone else. I’m not a network engineer but I’ve been down and dirty in the world of network packets a fair amount.

UPDATE 15/8/21012: So I can confirm that we have resolved the issue described below. Long story short, it would seem with the newer IBM AMM Bladecenter firmware revisions that it is necessary for the AMM management network to be on a different subnet than the normal production network. Our guess is that the AMM becomes confused about where network packets should be directed, either on the external interface or the internal blade network.

Background:

We run two IBM Bladecenter H chassis with a single AMM module in each.We’ve had them going on 4 years, initially with no trouble and then progressively run into bugs, as firmware has been update, to support  newer models of blade servers. Just recently I’ve been working on ensuring we are running on recent stable firmware for these servers in preparation for upgrading to ESXi 5.

In terms of networking, our internal network has traditionally been a flat layer2 network, more recently we introduced layer3 core switching and vlans into the mix. Regardless, we have historically had our management ethernet interfaces on the same layer2 subnet as our servers and clients. In the diagram below is a segment of our network as it relates to our Bladecenter for this scenario (In fact its much more complex than that but I’ve cut out the extra faff). This has worked fine for years.

The Problem:

So in the last few days I updated our DR Bladecenter AMM to version 3.62R (our other Bladecenter in production is running 3.62C).

After the restart I noticed that pinging to the device was very patchy; our monitoring system was reporting the AMM plus the cisco and brocade modules management interfaces in the blade, as being down. Pinging either showed good sub 1ms or nothing at all, there was no variation in latency.

The normal troubleshooting routine began, plugging in to the AMM via crossover cable showed a stable connection, nothing in the AMM logs of note, resetting and re-seating the AMM made no difference. Fixing the speed and duplex of the connection made no difference either.

A bit of research brought me to this IBM retain tip ibm.co/Pf7mux.

The tip effectively states that if you have the AMM on the same subnet as the blade servers, then server network connectivity could be lost as the AMM may send a proxy arp to a request directed at the server. WTF!?

What bugs me is I’ve looked at the install guides and there is no warning in regards to this flat configuration and for years it’s been fine, with issues only coming up now with the 3.62 firmware.

We’re now trying to confirm that we have this exact problem by moving the AMM onto its own vlan as well as monitor the AMM’s network traffic to get a better view of what is happening.