Archive for the ‘Uncategorized’ Category

This is part of a series of posts on automating Windows and deploying systems in the real world of enterprise operations. I’ve focused the series on the those nuances and problems that I’ve hit trying to deploy the various software stacks found in a typical Windows centric enterprise.

At some point we will hit the need to use DSC resources in Chef recipes. The reality is the Chef cookbooks for Windows only go so far and most Microsoft products outside of Windows itself have no coverage at all. All the DSC resources available are now open source just like Chef cookbooks and are expanding all the time:

I suspect we will see more and more DSC resources come out of Microsoft’s product teams as well.

Chef DSC Resources

There are a couple of key requirements for computers to leverage DSC in Chef. Windows Managmement Framework 5 (Feb Preview or better) is needed for DSC to be available and Chef Client 12.5 or better is required to provide the two Chef resources that allow DSC resources to be consumed in cookbooks.

UPDATE: Chef Client 12.6 has just been released which removes the limitations below (see the changelog for full details, strangely the inclusion of the timeout attribute isn’t in the changelog but was included in the master branch). There is a requirement for a recent version of WMF 5 to be installed as Microsft relaxed the LCM requirements.

There are two Chef resources¬†for using DSC with chef; dsc_script and dsc_resource. These are effectively mutually exclusive as they¬†require the DSC¬†Local Configuration Manager’s RefreshMode¬†to be configured to Push (dsc_script) or Disabled (dsc_resource).

Currently the main limitation of dsc_resource is that this resource has a hard timeout of 10 minutes, thus its¬†not useful for DSC resources like xSQLServerSetup that can run for some time. I’ve logged this issue with Chef¬†in GitHub. It is also a closer analogue of other Chef resources than dsc_script is which makes writing declarations easy if you have prior Chef experience.

dsc_script is more flexible in terms of being able to use more of the DSC functionality, like being able to pass configuration data or use powershell code within the DSC definition. Downside is that it converts the declaration into a MOF file so securing credentials requires that you use the DSC method for encrypting strings. This involves having a certificate that supports encryption loaded on the computer and defining that certificate thumbprint in your Chef recipe. This leads to a bit of double handling where you should use Chef Vault to secure the credentials in the recipe but then have to effectively unencrypt and re-encrypt the secured string for DSC to embed into a MOF file.

Personally I prefer the dsc_resource approach as it’s clean and simple to use in a Chef recipe and I don’t have to deal with the re-encryption of secure strings (and all the troubleshooting pain it brings).

Getting Modules Installed

There are a few ways you can achieve this, you could bundle up the modules as zip files and distribute them within the a Chef cookbook or you can use the new package manager in WMF5..

The easy way – WMF5 Package Manager

By using the Chef resource “powershell_script” you can invoke the install-package cmdlet and use the get-package cmdlet as a guard. Note the use of the -force switch which will force the install even if the package repository is trusted. This isn’t good practice and a better way to do it is to¬† pre-define the trusted repositories and not use the -force switch.

powershell_script "install dsc module" do
  code 'install-package -name "xResourceModule" -force
  not_if 'if(get-package -name "xResourceModule"){$true}else{$false}

end

Some good references for Powershell’s package manager are below:

Managing Configuration Drift

Posted: October 7, 2014 in Uncategorized

Much has been written on the rise of the devops culture and the use of configuration as code tools such as Chef. Puppet and DSC.

However once you have your infrastructure configuration built as code and deployed, how do you validate that it is actually being applied as expected?

Once you have more than a handful of nodes and multiple roles you can’t easily just manually log on to the servers and check. (Not to mention this is totally counter intuitive in the pets vs cattle approach to infrastructure management)

I’ve run into a tool from a start up called Scriptrock called Guardrail. It’s whole reason for existing is to solve the problem of validating deployed configuration. It takes the philosophy that you can’t rely on the police to police themselves. If Chef or DSC says it applied the config can you be sure it actually did?

Guardrail also can take a config policy and scaffold out a Chef, puppet or DSC configuration to speed up automating deployments.

I wholeheartedly recommend checking not out. They have a basic hosted trial option that allows for a rapid POC.

Remember my post from a few months ago where I announced my departure from the Ports of Auckland? Well it turns out that the call of the port is far too strong for me to ignore.

As such I’m returning back to the world of oversized Tonka toys and round the clock operations. Yep, you heard right, and it will be the exact same role that I left. A big thanks to the team and management at the Port for welcoming me back.

A lot of what I mentioned in my departure post all those months ago about what I would miss turned out to be true. On top of that, I’ve rapidly discovered that I miss the day to day “cut and thrust” of managing a complex IT environment with demands that only a true 24/7 operation bring.
Being the owner of an infrastructure platform and being ultimately responsible for it, were things I didn’t realise that I’d miss, but I have. If I make a decision or choice for that environment then I’m the one that has live with it, usually at 3am on a Sunday morning ūüėČ

I’ve also longed to get back to the Datacore SanSymphony virtual SAN I deployed at the Port. I’ve caught up with my old team a few times since I left and their positive comments about SanSymphony reinforced my belief that it was the right choice for the Port. I certainly want to be around for the major hot upgrade to V and to see the benefits that the new version will deliver! Certainly I need to tick off my DCIE and upgrade training soon.

I think that also confirms that this blog will be focused on SanSymphony and vSphere from now on (though I can see AWS popping up here and there as well).

Anyway, roll on the return to the PoAL family, you can always leave the port but the port never leaves you.

In doing some final testing and handover for upgrading Ports of Auckland’s VMWare hosts to ESX5i, we discovered that on boot the host was appearing to hang at loading the vmw_vaai_cx module. We had already successfully rebooted this test blade a few times, but had just re-enabled the FC ports to allow it access to the SAN volumes again. Suspecting this was related, due to the module name, to the storage I did a quick bit of research and found this article.

http://kb.vmware.com/kb/1016106

The long and the short of it: If you are using Microsoft Failover Clusters with shared volumes (e.g. quorum disks or data disks) in VMWare, by using the Physical RDM mode, then you need to flag to each host that has access to those volumes that they are permanently reserved. This stops the host from repeatedly attempting to get a reservation and inspect the volume. In our case we have 15 RDM volumes so the boot time was fairly excessive.

For each volume you need to get the naa ID of the volume and then run an esxcli command on each host for each volume.

A neat wayto get a list of all RDM disks and their naa ID’s is to use this powershell command:

Get-VM | Get-HardDisk -DiskType “RawPhysical”,”RawVirtual” | Select Parent,Name,DiskType,ScsiCanonicalName,DeviceName

Then you just need to identify the naa ID’s that are connected to more than one VM to identify the likely MSCS candidates.

To flag the volumes you then just need to run esxcli (I prefer the powercli version which makes it easy to apply the command to a group of hosts):

$esxcli= get-esxcli -VMHost ESXhost

$esxcli.storage.core.device.setconfig($false, “naa.60030d90544d5353514c322d46455247“, $true)

Well tomorrow is my last day with the Ports of Auckland (I know, you all thought I’d be here forever). I’m off to take up a role¬†as Technical Solutions Designer with vifx (www.vifx.co.nz). For me its an exciting new opportunity to specialise and dig my roots deeper into virtualisation, datacenter automation and storage, though the move is¬†tinged with sadness in leaving such a unique place such as the port.

It’s been a hell of ride being imersed in the 24/7 365 world of the port and being involved in everything from storage to SQL, from Exchange to CCTV systems. Driving past monsterous rolling tonka toys and going up the quay side cranes is an experience you won’t find in an office block. Most importantly I leave behind a great team of guys who all mucked in together, had more than a few laughs and 4 consecutive Teched/Techfests together.

Now my focus is going to be on virtualisation, storage and of course the Cloud (Yes I have come to accept that the term is here, I’m just going to have to suck it up and use it like I mean it). Hopefully that new found focus will be reflected in this blog with a series of in depth posts, I’m weighing up what to focus on next, ideas anyone?

Anyway, thanks to the port and the team for a great four years, so long and thanks for all the fish and always have a towel close by.

Disclaimer: I’m throwing this post out there before fully proving this, but the hope is that this will either help someone else or elicit information from someone else. I’m not a network engineer but I’ve been down and dirty in the world of network packets a fair amount.

UPDATE 15/8/21012: So I can confirm that we have resolved the issue described below. Long story short, it would seem with the newer IBM AMM Bladecenter firmware revisions that it is necessary for the AMM management network to be on a different subnet than the normal production network. Our guess is that the AMM becomes confused about where network packets should be directed, either on the external interface or the internal blade network.

Background:

We run two IBM Bladecenter H chassis with a single AMM module in each.We’ve had them going on 4 years, initially with no trouble and then¬†progressively¬†run into bugs, as firmware has been update, to support ¬†newer models of blade servers. Just recently I’ve been working on ensuring we are running on recent stable firmware for these servers in preparation for upgrading to ESXi 5.

In terms of networking, our internal network has traditionally been a flat layer2 network, more recently we introduced layer3 core switching and vlans into the mix. Regardless, we have historically had our management ethernet interfaces on the same layer2 subnet as our servers and clients. In the diagram below is a segment of our network as it relates to our Bladecenter for this scenario (In fact its much more complex than that but I’ve cut out the extra faff). This has worked fine for years.

The Problem:

So in the last few days I updated our DR Bladecenter AMM to version 3.62R (our other Bladecenter in production is running 3.62C).

After the restart I noticed that pinging to the device was very patchy; our monitoring system was reporting the AMM plus the cisco and brocade modules management interfaces in the blade, as being down. Pinging either showed good sub 1ms or nothing at all, there was no variation in latency.

The normal troubleshooting routine began, plugging in to the AMM via crossover cable showed a stable connection, nothing in the AMM logs of note, resetting and re-seating the AMM made no difference. Fixing the speed and duplex of the connection made no difference either.

A bit of research brought me to this IBM retain tip ibm.co/Pf7mux.

The tip effectively states that if you have the AMM on the same subnet as the blade servers, then server network connectivity could be lost as the AMM may send a proxy arp to a request directed at the server. WTF!?

What bugs me is I’ve looked at the install guides and there is no warning in regards to this flat configuration and for years it’s been fine, with issues only coming up now with the 3.62 firmware.

We’re now trying to confirm that we have this exact problem by moving the AMM onto its own vlan as well as monitor the AMM’s network traffic to get a better view of what is happening.

I’ve just been deploying Alistair Cooke’s awesome Autolab¬†into VMWARE Workstation 8 on Ubuntu. Now, linux is a bit of a new beast for me, being a long time Windows guy, but we’ve all got to broaden our horizons now and then.

As I was registering the pre built VM’s a warning popped up about¬†promiscuous¬†mode not being able to be enabled. Not paying full attention and being keen to get a lab up and going, I ignored the warning and carried on.

The lab progressed smoothly until I went to add the hosts to the VC and saw this warning in the powershell script that adds and configures the hosts. I’d chosen the full autmation including deploying a nested guest.

After a quick bit of investigation with the first tool of choice (ping) I quickly came to the conclusion that there was something not quite right on any network using vlan’s. Having hit this before in my lab that I built for VCP5 practice and the faint memory of a warning earlier on in the day, ¬†I went a googling.

Very quickly I found this VMware kb http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=287. Effectively Workstation configures permissions for vmnet0 through to 3 to only allow root to enable promiscuous mode, which you need turned on for nested esxi with vlans configured, to work.

The moral of the story, do one thing at a time and pay heed to warning messages.