Archive for the ‘Uncategorized’ Category

This is part of a series of posts on automating Windows and deploying systems in the real world of enterprise operations. I’ve focused the series on the those nuances and problems that I’ve hit trying to deploy the various software stacks found in a typical Windows centric enterprise.

At some point we will hit the need to use DSC resources in Chef recipes. The reality is the Chef cookbooks for Windows only go so far and most Microsoft products outside of Windows itself have no coverage at all. All the DSC resources available are now open source just like Chef cookbooks and are expanding all the time:

I suspect we will see more and more DSC resources come out of Microsoft’s product teams as well.

Chef DSC Resources

There are a couple of key requirements for computers to leverage DSC in Chef. Windows Managmement Framework 5 (Feb Preview or better) is needed for DSC to be available and Chef Client 12.5 or better is required to provide the two Chef resources that allow DSC resources to be consumed in cookbooks.

UPDATE: Chef Client 12.6 has just been released which removes the limitations below (see the changelog for full details, strangely the inclusion of the timeout attribute isn’t in the changelog but was included in the master branch). There is a requirement for a recent version of WMF 5 to be installed as Microsft relaxed the LCM requirements.

There are two Chef resources for using DSC with chef; dsc_script and dsc_resource. These are effectively mutually exclusive as they require the DSC Local Configuration Manager’s RefreshMode to be configured to Push (dsc_script) or Disabled (dsc_resource).

Currently the main limitation of dsc_resource is that this resource has a hard timeout of 10 minutes, thus its not useful for DSC resources like xSQLServerSetup that can run for some time. I’ve logged this issue with Chef in GitHub. It is also a closer analogue of other Chef resources than dsc_script is which makes writing declarations easy if you have prior Chef experience.

dsc_script is more flexible in terms of being able to use more of the DSC functionality, like being able to pass configuration data or use powershell code within the DSC definition. Downside is that it converts the declaration into a MOF file so securing credentials requires that you use the DSC method for encrypting strings. This involves having a certificate that supports encryption loaded on the computer and defining that certificate thumbprint in your Chef recipe. This leads to a bit of double handling where you should use Chef Vault to secure the credentials in the recipe but then have to effectively unencrypt and re-encrypt the secured string for DSC to embed into a MOF file.

Personally I prefer the dsc_resource approach as it’s clean and simple to use in a Chef recipe and I don’t have to deal with the re-encryption of secure strings (and all the troubleshooting pain it brings).

Getting Modules Installed

There are a few ways you can achieve this, you could bundle up the modules as zip files and distribute them within the a Chef cookbook or you can use the new package manager in WMF5..

The easy way – WMF5 Package Manager

By using the Chef resource “powershell_script” you can invoke the install-package cmdlet and use the get-package cmdlet as a guard. Note the use of the -force switch which will force the install even if the package repository is trusted. This isn’t good practice and a better way to do it is to  pre-define the trusted repositories and not use the -force switch.

powershell_script "install dsc module" do
  code 'install-package -name "xResourceModule" -force
  not_if 'if(get-package -name "xResourceModule"){$true}else{$false}

end

Some good references for Powershell’s package manager are below:

Advertisement

Managing Configuration Drift

Posted: October 7, 2014 in Uncategorized

Much has been written on the rise of the devops culture and the use of configuration as code tools such as Chef. Puppet and DSC.

However once you have your infrastructure configuration built as code and deployed, how do you validate that it is actually being applied as expected?

Once you have more than a handful of nodes and multiple roles you can’t easily just manually log on to the servers and check. (Not to mention this is totally counter intuitive in the pets vs cattle approach to infrastructure management)

I’ve run into a tool from a start up called Scriptrock called Guardrail. It’s whole reason for existing is to solve the problem of validating deployed configuration. It takes the philosophy that you can’t rely on the police to police themselves. If Chef or DSC says it applied the config can you be sure it actually did?

Guardrail also can take a config policy and scaffold out a Chef, puppet or DSC configuration to speed up automating deployments.

I wholeheartedly recommend checking not out. They have a basic hosted trial option that allows for a rapid POC.

Remember my post from a few months ago where I announced my departure from the Ports of Auckland? Well it turns out that the call of the port is far too strong for me to ignore.

As such I’m returning back to the world of oversized Tonka toys and round the clock operations. Yep, you heard right, and it will be the exact same role that I left. A big thanks to the team and management at the Port for welcoming me back.

A lot of what I mentioned in my departure post all those months ago about what I would miss turned out to be true. On top of that, I’ve rapidly discovered that I miss the day to day “cut and thrust” of managing a complex IT environment with demands that only a true 24/7 operation bring.
Being the owner of an infrastructure platform and being ultimately responsible for it, were things I didn’t realise that I’d miss, but I have. If I make a decision or choice for that environment then I’m the one that has live with it, usually at 3am on a Sunday morning 😉

I’ve also longed to get back to the Datacore SanSymphony virtual SAN I deployed at the Port. I’ve caught up with my old team a few times since I left and their positive comments about SanSymphony reinforced my belief that it was the right choice for the Port. I certainly want to be around for the major hot upgrade to V and to see the benefits that the new version will deliver! Certainly I need to tick off my DCIE and upgrade training soon.

I think that also confirms that this blog will be focused on SanSymphony and vSphere from now on (though I can see AWS popping up here and there as well).

Anyway, roll on the return to the PoAL family, you can always leave the port but the port never leaves you.

In doing some final testing and handover for upgrading Ports of Auckland’s VMWare hosts to ESX5i, we discovered that on boot the host was appearing to hang at loading the vmw_vaai_cx module. We had already successfully rebooted this test blade a few times, but had just re-enabled the FC ports to allow it access to the SAN volumes again. Suspecting this was related, due to the module name, to the storage I did a quick bit of research and found this article.

http://kb.vmware.com/kb/1016106

The long and the short of it: If you are using Microsoft Failover Clusters with shared volumes (e.g. quorum disks or data disks) in VMWare, by using the Physical RDM mode, then you need to flag to each host that has access to those volumes that they are permanently reserved. This stops the host from repeatedly attempting to get a reservation and inspect the volume. In our case we have 15 RDM volumes so the boot time was fairly excessive.

For each volume you need to get the naa ID of the volume and then run an esxcli command on each host for each volume.

A neat wayto get a list of all RDM disks and their naa ID’s is to use this powershell command:

Get-VM | Get-HardDisk -DiskType “RawPhysical”,”RawVirtual” | Select Parent,Name,DiskType,ScsiCanonicalName,DeviceName

Then you just need to identify the naa ID’s that are connected to more than one VM to identify the likely MSCS candidates.

To flag the volumes you then just need to run esxcli (I prefer the powercli version which makes it easy to apply the command to a group of hosts):

$esxcli= get-esxcli -VMHost ESXhost

$esxcli.storage.core.device.setconfig($false, “naa.60030d90544d5353514c322d46455247“, $true)

Well tomorrow is my last day with the Ports of Auckland (I know, you all thought I’d be here forever). I’m off to take up a role as Technical Solutions Designer with vifx (www.vifx.co.nz). For me its an exciting new opportunity to specialise and dig my roots deeper into virtualisation, datacenter automation and storage, though the move is tinged with sadness in leaving such a unique place such as the port.

It’s been a hell of ride being imersed in the 24/7 365 world of the port and being involved in everything from storage to SQL, from Exchange to CCTV systems. Driving past monsterous rolling tonka toys and going up the quay side cranes is an experience you won’t find in an office block. Most importantly I leave behind a great team of guys who all mucked in together, had more than a few laughs and 4 consecutive Teched/Techfests together.

Now my focus is going to be on virtualisation, storage and of course the Cloud (Yes I have come to accept that the term is here, I’m just going to have to suck it up and use it like I mean it). Hopefully that new found focus will be reflected in this blog with a series of in depth posts, I’m weighing up what to focus on next, ideas anyone?

Anyway, thanks to the port and the team for a great four years, so long and thanks for all the fish and always have a towel close by.

Disclaimer: I’m throwing this post out there before fully proving this, but the hope is that this will either help someone else or elicit information from someone else. I’m not a network engineer but I’ve been down and dirty in the world of network packets a fair amount.

UPDATE 15/8/21012: So I can confirm that we have resolved the issue described below. Long story short, it would seem with the newer IBM AMM Bladecenter firmware revisions that it is necessary for the AMM management network to be on a different subnet than the normal production network. Our guess is that the AMM becomes confused about where network packets should be directed, either on the external interface or the internal blade network.

Background:

We run two IBM Bladecenter H chassis with a single AMM module in each.We’ve had them going on 4 years, initially with no trouble and then progressively run into bugs, as firmware has been update, to support  newer models of blade servers. Just recently I’ve been working on ensuring we are running on recent stable firmware for these servers in preparation for upgrading to ESXi 5.

In terms of networking, our internal network has traditionally been a flat layer2 network, more recently we introduced layer3 core switching and vlans into the mix. Regardless, we have historically had our management ethernet interfaces on the same layer2 subnet as our servers and clients. In the diagram below is a segment of our network as it relates to our Bladecenter for this scenario (In fact its much more complex than that but I’ve cut out the extra faff). This has worked fine for years.

The Problem:

So in the last few days I updated our DR Bladecenter AMM to version 3.62R (our other Bladecenter in production is running 3.62C).

After the restart I noticed that pinging to the device was very patchy; our monitoring system was reporting the AMM plus the cisco and brocade modules management interfaces in the blade, as being down. Pinging either showed good sub 1ms or nothing at all, there was no variation in latency.

The normal troubleshooting routine began, plugging in to the AMM via crossover cable showed a stable connection, nothing in the AMM logs of note, resetting and re-seating the AMM made no difference. Fixing the speed and duplex of the connection made no difference either.

A bit of research brought me to this IBM retain tip ibm.co/Pf7mux.

The tip effectively states that if you have the AMM on the same subnet as the blade servers, then server network connectivity could be lost as the AMM may send a proxy arp to a request directed at the server. WTF!?

What bugs me is I’ve looked at the install guides and there is no warning in regards to this flat configuration and for years it’s been fine, with issues only coming up now with the 3.62 firmware.

We’re now trying to confirm that we have this exact problem by moving the AMM onto its own vlan as well as monitor the AMM’s network traffic to get a better view of what is happening.

I’ve just been deploying Alistair Cooke’s awesome Autolab into VMWARE Workstation 8 on Ubuntu. Now, linux is a bit of a new beast for me, being a long time Windows guy, but we’ve all got to broaden our horizons now and then.

As I was registering the pre built VM’s a warning popped up about promiscuous mode not being able to be enabled. Not paying full attention and being keen to get a lab up and going, I ignored the warning and carried on.

The lab progressed smoothly until I went to add the hosts to the VC and saw this warning in the powershell script that adds and configures the hosts. I’d chosen the full autmation including deploying a nested guest.

After a quick bit of investigation with the first tool of choice (ping) I quickly came to the conclusion that there was something not quite right on any network using vlan’s. Having hit this before in my lab that I built for VCP5 practice and the faint memory of a warning earlier on in the day,  I went a googling.

Very quickly I found this VMware kb http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=287. Effectively Workstation configures permissions for vmnet0 through to 3 to only allow root to enable promiscuous mode, which you need turned on for nested esxi with vlans configured, to work.

The moral of the story, do one thing at a time and pay heed to warning messages.

I’m loathed to use the word “Cloud”, it’s become a sales and marketing catch cry and an easy way to encapsulate whichever strategy or technology stack that’s being pushed. The danger now is that whole IT strategies are being simplified and boiled down into the simple question of whether and when to move to a “Cloud Strategy”.

Let’s pick apart the term “Cloud” into its conceptual technological pieces; there are four strands that make up the generally accepted term “Cloud”. (Yes you could argue that some do or don’t apply, especially when you consider SaaS but it still holds true in general terms)

  1. Virtualisation
  2. Automation
  3. Self Service
  4. Service Orientated Model

Potentially if you are thinking “public cloud” then you can add the following

  1. Outsourcing

Now when you step back a bit a look at that breakdown you’ll see that there’s nothing really new here. Granted virtualisation in its current form has only really been around for ten years or so in the x86 world thanks to VMWare, but the concept has been around for a lot longer.

My point however is that all of these concepts have been around for a long time on their own, there’s nothing new here and some companies have embraced some or all of them to varying degrees, depending on the payback to the company in real dollar terms.

What’s changed now is that the whole set of conceptual strategies have been bundled up into a new shiny package and been labelled the “Cloud”. There is no singular one size fits all cloud strategy; when setting any IT strategy, the question should be, ‘what do we need to focus on to enable our company to succeed in its businees strategy? Which concepts and technologies should we invest in, based on its projected payback to the business?’

Even within those base concepts there are options and directions to take, do we need to deliver self provisioning of servers and automated scale out provisioning? Or do we need to focus on delivering on-demand replicas of production systems for test and development?

At least in a New Zealand context, I can’t see businesses en mass being willing to invest (capex or opex) in the full gamut of solutions potentially available under the cloud umbrella if they don’t show a real payback. In fact my fear is the whole repackaging and presentation of the cloud is going to have the effect of obscuring real benefits because the whole “Cloud” concept seems so big and difficult.

The company I work for is a prime example of some parts of the cloud package having greater value than others. Generally our application stack is fairly static, new applications are rolled out relatively infrequently and the changes in demand for those applications doesn’t heavily fluctuate. However change within those applications happens frequently, modifications in settings, code and functionality are happening all the time. Our current static set of test and development systems are frequently out of sync with production or are being used for multiple projects. This  makes it difficult to ensure that the behaviours tested are replicated in production. For us the ability to quickly deploy cloned subsets of production into isolated test environments would be of high value whereas automated breathe-in,breathe-out scale out applications is of little value.

Simply put, adopting a “Cloud” strategy, without putting in the time to understand the business problems and possible solutions is likely to leave you with a shiny package with nothing of value inside.

vCloud Director across sites

Posted: March 24, 2012 in Uncategorized

A good summary into using vCloud Director across multiple sites and thus what VMWARE supports. Simple answer is if your sites have low latency links (20ms or lower) then you can leverage one vCloud installation