Case Study

Writing Amazon's Internal Runbook for the Move to AWS

Authoring the internal migration runbook for Amazon's company-wide mandate to move applications off VMs and onto AWS, in the era before CloudFormation, Lambda, ECS, or most managed services existed. The primitives were EC2, S3, EBS, and early RDS — and the architectural judgment we developed then is exactly what later abstractions let engineers stop thinking about.

Client Amazon.com (Merchants@ organization)
Role Senior Support Engineer
Period 2011
Scale Company-wide mandate · Pre-managed-services era

In 2011, AWS was approximately five years old as a public product, and the AWS that existed in 2011 is not the AWS you know today. EC2 was there, and S3, and EBS, and RDS had just launched the previous year with MySQL support. ELB existed. There were a few instance families, and they behaved in ways that are now quaint — instance store was ephemeral and you were expected to know it, EBS volumes had surprising throughput cliffs, and there were no managed services in the modern sense. CloudFormation wouldn’t exist until the end of the year. Lambda was three years away. ECS was four. The ecosystem of books, blog posts, reference architectures, and Stack Overflow answers that today’s engineers take for granted did not exist. You figured it out from first principles, or you asked the person at the next desk who had figured it out from first principles the week before.

Inside Amazon, there was an internal mandate known as MAWS — Move to AWS. Every team was being pushed to get off the company’s internal virtualization platforms and onto AWS, both because AWS was better and because Amazon was trying to validate the platform it was selling externally by forcing itself to eat its own cooking. The Merchants@ organization, where I was working, had dozens of applications and services in various stages of needing to migrate. Someone needed to write down how. That turned out to be me.

What I wrote became the internal migration runbook: a prescriptive mapping of common application shapes to the AWS primitives that existed, with enough architectural guidance that a team could follow it without making the mistakes the earliest adopters had already made. It was boring in the way that good operational documentation is always boring — mostly a long list of “if your system looks like X, do Y, watch out for Z.” It ended up being used as the baseline for how the broader organization migrated.

This is a retrospective on what that work was actually like, and what it suggests about how to think about infrastructure migrations where no established pattern yet exists.

The thing about first-principles cloud

Modern cloud architects spend most of their time making choices between managed services. Do you use ECS or EKS? Aurora or Postgres on RDS? SQS or MSK? API Gateway or ALB? The decisions matter, but they’re abstraction-layer decisions — the underlying primitives are mostly hidden and rarely interacted with directly.

In 2011, there was no abstraction layer to choose between. There was EC2. There was S3. There was EBS. There was RDS, which was real but limited. There was ELB, which was new and had surprising failure modes. What you built was what you built from these pieces, and if you wanted anything more sophisticated — automatic deployment, rolling updates, service discovery, self-healing clusters — you built it yourself.

This was liberating and terrifying in equal measure. Liberating because the architectural decisions were real — you chose your failure domains, your blast radius, your state management, your durability guarantees, all explicitly. Terrifying because you were responsible for getting them right, and nobody else had a pattern to copy because nobody else had done it yet.

What I came to appreciate, in the course of writing MAWS documentation, is that the primitives force you to understand what you’re actually doing. When you launch an EC2 instance and it fails, the instance is gone. The state on its local disk is gone. If your application stored anything important on that disk, you lost it. This is obvious once you’ve been told, but in 2011 a lot of engineers had never been told — they came from on-premise environments where “the server crashed” meant someone would go restart it and the local state would still be there. The first thing the runbook had to do was re-educate everyone about what cloud instances actually were.

The prescriptive mapping approach

The documentation I wrote took the form of patterns. Each pattern was a common shape an application could have on the old infrastructure, paired with the AWS-native way to run the same application. For each pattern I described the mapping, called out the differences in behavior the team would need to handle, and flagged the mistakes that previous teams had made.

A few of the patterns:

Single-instance web application with local state. The mapping was: move the state to S3 or RDS, then put the stateless web tier behind ELB across multiple availability zones. The trap was that teams would move the compute to EC2 but leave the state on EBS on that single instance — which meant they had migrated to the cloud without gaining any of the availability benefits and had in fact lost some, because an EC2 instance was less durable than their old VM had been.

Shared filesystem for multi-instance coordination. In 2011, EFS didn’t exist. NFS on an EC2 instance was an option but an unreliable one. The pattern we pushed was: refactor the application to use S3 as the coordination point, with file writes as object puts and file reads as object gets. This was architectural surgery in some cases, but the alternative was migrating the problem into the cloud rather than out of it.

Database-backed internal tool. RDS for MySQL was the obvious target. The trap was that RDS at the time had limitations — no stored procedures in some engine versions, no access to the underlying filesystem, restrictions on replication topology — that applications had often grown dependent on without realizing it. The runbook included a pre-migration audit checklist that teams had to work through before they started the migration. Half the value was in catching these dependencies before they caused a failed cutover.

Batch processing job. The pattern was: EC2 spot instances for the compute, S3 for input and output, SQS for work queue. This was the cloud-native version of what most of the batch jobs had been doing on dedicated batch servers, and it was the migration pattern that delivered the cleanest wins — batch is embarrassingly parallel, state is minimal, and spot pricing made the economics dramatic.

Low-latency customer-facing service. This was the hardest pattern. Network latency inside AWS at the time was good but not uniformly good, and instances had noisy-neighbor issues in ways that weren’t well-documented. The runbook’s guidance here was essentially: test your latency distributions under load in the target region before you commit, pick instance families based on measured behavior rather than spec sheets, and expect to iterate on instance sizing after launch.

What was actually hard

Three things, mostly.

Getting teams to accept that ephemeral instance disk was actually ephemeral. I cannot count the number of design reviews where I pointed out that the team was storing important data on the instance’s local storage and the team told me it would be fine because they had backups. Backups are not the same as durability. The number of teams who learned this the hard way, in the period before the runbook was widely adopted, was larger than it should have been.

AMI hygiene. Today AMIs are managed centrally, baked through well-understood pipelines, and updated routinely. In 2011, teams rolled their own AMIs with ad-hoc tooling and often had no clear policy on how long an AMI was good for or when it needed to be rebuilt. The runbook spent a surprising amount of space on what a reasonable AMI policy looked like — how to handle base image updates, how to handle security patches, how to make sure your instance-launch scripts still worked six months after the AMI was baked. This is infrastructure that later tools like Packer and the ECS Image Builder commoditized, but in 2011 it was bespoke.

The humans and the organizational dynamics. Some teams embraced the migration; some resisted it. The teams that resisted had reasons — legitimate ones in some cases, inertia in others. The runbook had to work for both audiences. I learned a lot about how to write documentation that was genuinely useful to the skeptic without being patronizing, and genuinely useful to the enthusiast without being dismissive of the real tradeoffs. The phrase “you might consider” got a lot of work.

Why the runbook survived

The runbook outlasted my time at Amazon and, as I understand it, continued to be used and adapted long after. I think there are two reasons.

The first is that it was prescriptive in the right places. Teams migrating under time pressure need decisions made for them on the parts of the problem that aren’t unique to their application. The runbook made those decisions. You didn’t have to reinvent your AMI baking strategy; you used the one the runbook described. You didn’t have to decide whether to use ELB or a homegrown load balancer; you used ELB with the configuration the runbook specified. The prescription reduced the surface area of decisions each team had to make.

The second is that it left the team’s actual application architecture alone. The runbook didn’t tell teams to rewrite their applications to be more cloud-native — that wasn’t the contract. It told them how to run the applications they already had on AWS primitives. Later migrations could refactor. The first migration just needed to land.

This is, I think, the correct way to approach any infrastructure migration where the target platform is less familiar than the source. Don’t try to improve the application at the same time as you’re moving it. Move it, stabilize it on the new platform, then iterate. Teams that try to do both at once fail at both. The MAWS runbook was, in a quiet way, enforcing that discipline.

The lineage that still matters

The specific advice in the runbook is largely obsolete now. No one needs to be told in 2026 how to bake an AMI, what to do with ephemeral instance storage, or how to design around the limitations of RDS circa 2011. The tools have abstracted most of it away.

But the architectural judgment required in 2011 — the instinct for where the failure domains are, what state lives where, which dependencies survive what kinds of partitions — is the same judgment that matters today when you’re making decisions between managed services. The layer of abstraction has moved up, but what you’re deciding about hasn’t changed. I find that the engineers who built systems on the primitives in 2011, 2012, 2013 tend to make better decisions today than engineers who never had to — because they understand what the abstractions are abstracting over.

That’s the thing about being in the origination layer of an industry pattern. The patterns become commodity; the judgment that produced them becomes scarce.


← All case studies  ·  Engage me on similar work