TL;DR: We rewrote our internal tooling to use Terraform instead of Chef Provisioning, with much pain and yet much gain in productivity, and we released 0.12 of Cloudsmith with major new features such as support for Open-Source repositories. There was much rejoicing!
Stay A While And Listen
In the past few months, we've been primarily focused on rebuilding on our Infrastructure-as-Code (IaC) toolchain for infrastructure orchestration, which is one of those delightful tenets of DevOps that allows us to automate the provisioning of our (extensive) server farm. Indeed it states in our Tao of Cloudsmith that we must strive to Automate Everything and we certainly try to. We haven't quite found a way to automate ourselves out of a job just yet, but it's in the pipeline. 😉
For those of you without IaC (if you have it, bear with me), imagine this:
Your team are tasked with producing an exact replica of Production, either for replacement or for another environment such as Staging; the Production environment is a 20 strong server farm (in the Cloud) comprising of web servers, database servers, cache servers, asynchronous task worker servers, middleware/queue servers, and storage servers. These have been meticulously built at the platform-level manually or mostly manually (e.g. within AWS) and hook in with another half a dozen platform services (such as Simple Storage Service, EC2 Autoscaling, CloudWatch, Elastic File System, Identity/Access Management, Elastic Load Balancers, Simple Queue Service, etc.)
Even with the assistance of Configuration Management (CM) tools like Chef (or Puppet or Ansible, etc.), and Package Management (PM) tools like Cloudsmith (or those other competitors that definitely don't exist), this still seems like an incredibly daunting process, right? Right. Even boiling the problem down to replacing one server can be tricky depending on just how manual the process was to construct it in the first place. This is where something like IaC pays in dividends beyond the initial spend on the cognitive load of learning and deploying it.
Introducing Infrastructure-as-Code (IaC)
By using specialized declarative language and describing the exact steps necessary to orchestrate your infrastructure, from the top of the platform level right down to machine configuration, you will be able to take an environment like Production and produce a replica such as Staging just by altering the configuration and telling the IaC tool to orchestrate everything for you. I'm skipping a few steps here, such as completing IaC beyond infrastructure orchestration with aforementioned CM/PM tools, but the bottom-line remains true; it is a business redefining capability that enables you to reduce the need for expensive Humans (those pesky engineers).
So back to the task at hand; this is what our team was required to do recently, and although it took us 8-10 weeks to perform the switch-over to full IaC, we were able to produce an entire Staging stack (including all of those servers above) in 30 minutes. We then rolled out a major Production release 3 weeks ago using the new tooling, and it took two hours. Production inevitably had taken a little longer because we had to bring down the current Production first, including confirming backups, testing the move on Staging, and then orchestrating the new setup. Of course we had some teething problems in the new deployment, but any problems were promptly executed with a deft swoop of the IaC tool. Apply!
Wait a second, backtrack there, I hear you say - Doesn't Cloudsmith sing and dance the praises of DevOps? Shouldn't Cloudsmith already have IaC and all the amazing things it brings? Don't you Automate Everything? Yes, yes and yeee... mostly! Here's where the blog turns into a discussion on lessons learned, on "How Not To Do Infrastructure-as-Code". Although as you'll see it's a story of unfortunate decisions and backing the wrong horse; but it shows that technical decisions can have extremely long-reaching consequences.
How Not To Do Infrastructure-as-Code
Back in late 2014 we first started to integrate Chef as our CM tool of choice, and although it was beyond painful (and still is at times), we reaped the reward of being able to build a server programmatically (notice how this thematically matches and complements infrastructure orchestration.) It was then in 2015 that infrastructure orchestration had really started to make a decent impact on the DevOps community, and we noticed. We noticed that we didn't have it, and therefore needed it. So in late 2015 we began our journey that would end up resembling Dante's Inferno and the nine levels of Hell.
Out of the tools available for infrastructure orchestration, we narrowed it down to two, both of which were still cutting their teeth and were early days Alpha-level products, Chef Provisioning and Hashicorp Terraform; but both were Open-Source and gaining traction. Having built our CM tooling on Chef, it wasn't a natural decision at the time to go with Chef Provisioning - Chef with the capability to configure infrastructure beyond just machines. What's not to like? We love Open-Source and contribute when we can, so we even contributed to the project to help it grow and to implement some AWS resources that we needed for our infrastructure, such as the initial PR for CloudWatch Alarm support.
However, by the following year it was still severely lacking in many areas. It was missing critical resources that we need to build out parts of the infrastructure (e.g. RDS Parameter/Option Groups, SQS Queues, EC2 Autoscaling Notifications, S3 Buckets, IAM Policies, etc.) and we had acquired some serious amount of local hackage (the technical term) required to run it properly. This involved all sorts of duct-taping usuable for targeting different environments. It was a complete and utter mess, some of which was perhaps our fault. The result was an IaC tool chain that was flaky at best, didn't always work, wasn't convergent or fully idempotent, didn't like running in parallel, had no dry-run to plan changes, and had no easy way of storing state remotely (so collaborating with others was painful).
This isn't to say that it was an awful product, but it was an awful experience for us and one that I regret immeasurably. We had gained automation in one way, and lost our souls worth of development-time in another. By 2016 we had started to see external proof of our of choice in the wrong tool, such as obvious rot that had set in (in which Chef Provisioning was not keeping up with the rate of change to match platforms such as AWS) and Noah Kantrowitz (a thought-leader in the Chef community known as @coderanger) stating that Chef Provisioning is unlikely to be a good choice for new projects, and that the project had more-or-less been descope. Turns out he was right about it not being a good choice! Gulp
Yep, technical leadership guilt-trip time! Queue masses of finger-pointing and chants of "Who did it ... Who did it!" - The critical mass really hit though when earlier this year we needed to make critical fixes to Production and Chef Provisioning completely failed us. It had stopped working in strange and obscure ways with our newer releases of Chef, and the rot that had set in by not keeping up with AWS meant that there was more and more things that we couldn't deploy. For example, Application Load Balancer (ELB 2.0) was released by AWS, and this was not supported by Chef Provisioning at all, at the time. There was also no Staging environment to assist with testing Production changes because Chef Provisioning wasn't well suited to doing that either. We would have gone to red alert, but that would have meant changing the light bulb.
Rewriting History As We Know It
So the executive decision was taken to rip out Chef Provisioning and to replace it with what was now a much more mature product. One that has kept up to date with the rapid change of AWS. One that now has an extremely strong backing of the community behind it. The King is Dead, Long Live the King: Hashicorp Terraform. Yes, exactly the same tool we had evaluated 2 years ago and chose Chef over it. To be fair it was much less polished at the time and it was anyone's game then. So here we are now at the end of the IaC journey. With the assistance of Terraform describing most of our infrastructure, we finally got that Staging environment (and more) and the capability for high velocity releases to Production with support for the latest AWS components.
We're still not completely there yet for that 100% automation, but we're on the path to it and things are looking much brighter now. The bottom-line is that the path to DevOps bliss can be dark and full of terrors (OK, I stole that one), and that unfortunate decisions can turn into long-term pain and regret. Acruing technical debt to fix bad decisions is only going to lead to more pain, and sometimes it's better to address an issue at its root cause. Although it resulted in a 8-10 week tangent for us, the result is that we can now focus on other more obvious customer-oriented goals, such as implementing additional package formats and features like Package Retention, Geo/IP Restrictions, and a better REST API (with those sweet, sweet, integrations that DevOps love).
So that leads neatly on to our latest release, 0.12, which primarily contains:
Open-Source Repository Support: With a generous allowance of 10GB storage and 100GB (and more can be requested), for free! This is a bit different from our competitors (who don't exist, probably), in that you don't need to activate an Open-Source plan to use them. Every single plan offers the ability to create Open-Source repositories, and the features available to them match your actual tier. So if you enjoy the use of metrics and statistics, you can have them on your Open-Source tier too. To get an idea of what they are like, you can see a live example of an Open-Source Repository on the Cloudsmith account.
Pricing Plan Calculator: You can see it in action at the bottom of the plans/pricing page, but we now have support for a calculator to help you decide on which plan is the best based on your storage, bandwidth and feature requirements. We don't list other package formats apart from Raw since they're available on all tiers.
Fixes All The Things (TM): OK, not actually all the things, but we fixed many things between 0.10 (0.11 was our Terraform-powered release 3 weeks ago) and the current release. This involves lots of page speed improvements, optimization for user on-boarding and workflow (with more to come) and security fixes.
If you want to keep on top of our releases, we maintain a list of release notes that detail changes and we also move things about on our Trello-based roadmap to show you what's coming up and what's been released.
That's A Wrap
That's all until next time (should you wish to hear about it), but I leave you with one final thought: It seemed obvious to us in hindsight, but until you have a fully IaC-capable infrastructure, you don't realise just how much time you waste on building, deploying, rebuilding, redeploying, fighting fires, remembering how to fight fires, etc.
How Actually To Do Infrastructure-as-Code (IaC)
If you're going to build a software-based service (at any scale), and you want to maintain a reasonable level of velocity (and sanity), do your own research and integrate the best tools/practices you can for IaC/DevOps:
Use an Infrastructure Orchestration tool to describe, version and manage your infrastructure programmatically, such as Hashicorp Terraform.
Use a Configuration Management (CM) tool to describe, version and manage your server configurations/contents programmatically, such as Chef, Puppet, Ansible, etc.
Use a Package Management (PM) tool to store, version and distribute your application dependencies and server software, such as Cloudsmith. 😊
Use a Continuous Integation (CI) / Continuous deployment (CD) tool to automate these and bring your testing/deployment pipeline together, such as Jenkins, TravisCI, CircleCI, Chef Automate, etc.
If you haven't given Cloudsmith a chance yet, please endeavor to come and try it out. We don't bite and we're always happy to help you through the process of creating repositories and automating your Package Management roundtrip. We're also good for having a chat about DevOps in general, especially if we can learn something new from how you do it.