I’m throwing this post out there before fully proving this, but the hope is that this will either help someone else or elicit information from someone else. I’m not a network engineer but I’ve been down and dirty in the world of network packets a fair amount.
UPDATE 15/8/21012: So I can confirm that we have resolved the issue described below. Long story short, it would seem with the newer IBM AMM Bladecenter firmware revisions that it is necessary for the AMM management network to be on a different subnet than the normal production network. Our guess is that the AMM becomes confused about where network packets should be directed, either on the external interface or the internal blade network.
We run two IBM Bladecenter H chassis with a single AMM module in each.We’ve had them going on 4 years, initially with no trouble and then progressively run into bugs, as firmware has been update, to support newer models of blade servers. Just recently I’ve been working on ensuring we are running on recent stable firmware for these servers in preparation for upgrading to ESXi 5.
In terms of networking, our internal network has traditionally been a flat layer2 network, more recently we introduced layer3 core switching and vlans into the mix. Regardless, we have historically had our management ethernet interfaces on the same layer2 subnet as our servers and clients. In the diagram below is a segment of our network as it relates to our Bladecenter for this scenario (In fact its much more complex than that but I’ve cut out the extra faff). This has worked fine for years.
So in the last few days I updated our DR Bladecenter AMM to version 3.62R (our other Bladecenter in production is running 3.62C).
After the restart I noticed that pinging to the device was very patchy; our monitoring system was reporting the AMM plus the cisco and brocade modules management interfaces in the blade, as being down. Pinging either showed good sub 1ms or nothing at all, there was no variation in latency.
The normal troubleshooting routine began, plugging in to the AMM via crossover cable showed a stable connection, nothing in the AMM logs of note, resetting and re-seating the AMM made no difference. Fixing the speed and duplex of the connection made no difference either.
A bit of research brought me to this IBM retain tip ibm.co/Pf7mux.
The tip effectively states that if you have the AMM on the same subnet as the blade servers, then server network connectivity could be lost as the AMM may send a proxy arp to a request directed at the server. WTF!?
What bugs me is I’ve looked at the install guides and there is no warning in regards to this flat configuration and for years it’s been fine, with issues only coming up now with the 3.62 firmware.
We’re now trying to confirm that we have this exact problem by moving the AMM onto its own vlan as well as monitor the AMM’s network traffic to get a better view of what is happening.