SEARCH FINANCIAL SERVICES INFRASTRUCTURE SECURITY SCIENCE INTERVIEWS

 

     

Changing the calculus of containers in the cloud

By Werner Vogels, CTO AWS

April 12, 2018

I wrote to you over two years ago about what happens under the hood of Amazon ECS. Last year at re:Invent, we launched AWS Fargate, and today, I want to explore how Fargate fundamentally changes the landscape of container technology.

I spend a lot of time talking to our customers and leaders at Amazon about innovation. One of the things I've noticed is that ideas and technologies which dramatically change the way we do things are rarely new. They're often the combination of an existing concept with an approach, technology, or capability in a particular way that's never been successfully tried before.

The rapid embrace of containers in the past four years is the result of blending old technology (containers) with a new toolchain and workflow (i.e., Docker), and the cloud. In our industry, four years is a long time, but I think we've only just started exploring how this combination of code packaging, well-designed workflows, and the cloud can reshape the ability of developers to quickly build applications and innovate.

Containers solve a fundamental code portability problem and enable new infrastructure patterns on the cloud. Having a consistent, immutable unit of deployment to work with lets you abstract away all the complexities of configuring your servers and deployment pipelines every time you change your code or want to run your app in a different place. But containers also put another layer between your code and where it runs. They are an important, but incremental, step on the journey of being able to write code and have it run in the right place, with the right scale, with the right connections to other bits of code, and the right security and access controls.

Solving these higher order problems of deploying, scheduling, and connecting containers across environments gave us container management tools. Container orchestration has always seemed to me to be very not cloud native. Managing a large server cluster and optimizing the scheduling of containers all backed by a complex distributed state store is counter to the premise of the cloud. Customers choose the cloud to pay as they go, not have to guess capacity, get deep operational control without operational burden, build loosely coupled services with limited blast radii to prevent failures, and self-service for everything they need to run their code.

You should be able to write your code and have it run, without having to worry about configuring complex management tools, open source or not. This is the vision behind AWS Fargate. With Fargate, you don't need to stand up a control plane, choose the right instance type, or configure all the other components of your application stack like networking, scaling, service discovery, load balancing, security groups, permissions, or secrets management. You simply build your container image, define how and where you want it to run, and pay for the resources you need. Fargate has native integrations to Amazon VPC, Auto Scaling, Elastic Load Balancing, IAM roles, and Secrets Management. We've taken the time to make Fargate production ready with a 99.99% uptime SLA and compliance with PCI, SOC, ISO, and HIPAA.

With AWS Fargate, you can provision resources to run your containers at a much finer grain than with an EC2 instance. You can select exactly the CPU and memory your code needs and the amount you pay scales exactly with how many containers you run. You don't have to guess at capacity to handle spikey traffic, and you get the benefit of perfect scale, which lets you offload a ton of operational effort onto the cloud. MiB for MiB, this might mean that cloud native technologies like Fargate look more expensive than more traditional VM infrastructure on paper. But if you look at the full cost of running an app, we believe most applications will be cheaper with Fargate as you only pay what you need. Our customers running Fargate see big savings in the developer hours required to keep their apps running smoothly.

The entire ecosystem of container orchestration solutions arose out of necessity because there was no way to natively procure a container in the cloud. Whether you use Kubernetes, Mesos, Rancher, Nomad, ECS or any other system no longer matters anymore because with Fargate, there is nothing to orchestrate. The only thing that you have to manage is the construction of the applications themselves. AWS Fargate finally makes containers cloud native.

I think the next area of innovation we will see after moving away from thinking about underlying infrastructure is application and service management. How do you interconnect the different containers that run independent services, ensure visibility, manage traffic patterns, and security for multiple services at scale? How do independent services mutually discover one another? How do you define access to common data stores? How do you define and group services into applications? Cloud native is about having as much control as you want and I am very excited to see how the container ecosystem will evolve over the next few years to give you more control with less work. We look forward to working with the community to innovate forward on the cloud native journey on behalf of our customers.

Getting Started

AWS Fargate already seamlessly integrates with Amazon ECS. You just define your application as you do for Amazon ECS. You package your application into task definitions, specify the CPU and memory needed, define the networking and IAM policies that each container needs, and upload everything to Amazon ECS. After everything is setup, AWS Fargate launches and manages your containers for you.

AWS Fargate support for Amazon EKS, the Elastic Kubernetes Service, will be available later in 2018.

Looking back at 10 years of compartmentalization at AWS

| Comments ( 0)

At AWS, we don't mark many anniversaries. But every year when March 14th comes around, it's a good reminder that Amazon S3 originally launched on Pi Day, March 14, 2006. The Amazon S3 team still celebrate with homemade pies!

March 26, 2008 doesn't have any delicious desserts associated with it, but that's the day when we launched Availability Zones for Amazon EC2. A concept that has changed infrastructure architecture is now at the core of both AWS and customer reliability and operations.

Powering the virtual instances and other resources that make up the AWS Cloud are real physical data centers with AWS servers in them. Each data center is highly reliable, and has redundant power, including UPS and generators. Even though the network design for each data center is massively redundant, interruptions can still occur.

Availability Zones draw a hard line around the scope and magnitude of those interruptions. No two zones are allowed to share low-level core dependencies, such as power supply or a core network. Different zones can't even be in the same building, although sometimes they are large enough that a single zone spans several buildings.

We launched with three autonomous Availability Zones in our US East (N. Virginia) Region. By using zones, and failover mechanisms such as Elastic IP addresses and Elastic Load Balancing, you can provision your infrastructure with redundancy in mind. When two instances are in different zones, and one suffers from a low-level interruption, the other instance should be unaffected.

How Availability Zones have changed over the years

Availability Zones were originally designed for physical redundancy, but over time they have become re-used for more and more purposes. Zones impact how we build, deploy, and operate software, as well as how we enforce security controls between our largest systems.

For example, many AWS services are now built so that as much functionality as possible can be autonomous within an Availability Zone. The calls used to launch and manage EC2 instances, fail over an RDS instance, or handle the health of instances behind a load balancer, all work within one zone.

This design has a double benefit. First, if an Availability Zone does lose power or connectivity, the remaining zones are unaffected. The second benefit is even more powerful: if there is an error in the software, the risk of that error affecting other zones is minimized.

We maximize this benefit when we deploy new versions of our software, or operational changes such as a configuration edit, as we often do so zone-by-zone, one zone in a Region at a time. Although we automate, and don't manage instances by hand, our developers and operators know not to build tools or procedures that could impact multiple Availability Zones. I'd wager that every new AWS engineer knows within their first week, if not their first day, that we never want to touch more than one zone at a time.

Availability Zones run deep in our AWS development and operations culture, at every level. AWS customers can think of zones in terms of redundancy, "Use two or more Availability Zones for reliability." At AWS, we think of zones in terms of isolation, "Stay within the Availability Zone, as much as possible."

Silo your traffic or not – you choose

When your architecture does stay within an Availability Zone as much as possible, there are more benefits. One is that the latency within a zone is incredibly fast. Today, packets between EC2 instances in the same zone take just tens of microseconds to reach other.

Another benefit is that redundant zonal architectures are easier to recover from complex issues and emergent behaviors. If all of the calls between the various layers of a service stay within one Availability Zone, then when issues occur they can quickly be remediated by removing the entire zone from service, without needing to identify the layer or component that was the trigger.

Many of you also use this kind of "silo" pattern in your own architecture, where Amazon Route 53 or Elastic Load Balancing can be used to choose an Availability Zone to handle a request, but can also be used to keep subsequent internal requests and dependencies within that same zone. This is only meaningful because of the strong boundaries and separation between zones at the AWS level.

Regional isolation

Not too long after we launched Availability Zones, we also launched our second Region, EU (Ireland). Early in the design, we considered operating a seamless global network, with open connectivity between instances in each Region. Services such as S3 would have behaved as "one big S3," with keys and data accessible and mutable from either location.

The more we thought through this design, the more we realized that there would be risks of issues and errors spreading between Regions, potentially resulting in large-scale interruptions that would defeat our most important goals:

  • To provide the highest levels of availability
  • To allow Regions to act as standby sites for each other
  • To provide geographic diversity and lower latencies to end users

Our experience with the benefits of Availability Zones meant that instead we doubled down on compartmentalization, and decided to isolate Regions from each other with our hardest boundaries. Since then, and still today, our services operate autonomously in each Region, full stacks of S3, DynamoDB, Amazon RDS, and everything else.

Many of you still want to be able to run workloads and access data globally. For our edge services such as Amazon CloudFront, Amazon Route 53, and AWS Lambda@Edge, we operate over 100 points of presence. Each is its own Availability Zone with its own compartmentalization.

As we develop and ship our services that span Regions, such as S3 cross-region object replication, Amazon DynamoDB global tables, and Amazon VPC inter-region peering, we take enormous care to ensure that the dependencies and calling patterns between Regions are asynchronous and ring-fenced with high-level safety mechanisms that prevent errors from spreading.

Doubling down on compartmentalization, again

With the phenomenal growth of AWS, it can be humbling how many customers are being served even by our smallest Availability Zones. For some time now, many of our services have been operating service stacks that are compartmentalized even within zones.

For example, AWS HyperPlane—the internal service that powers NAT gateways, Network Load Balancers, and AWS PrivateLink—is internally subdivided into cells that each handle a distinct set of customers. If there are any issues with a cell, the impact is limited not just to an Availability Zone, but to a subset of customers within that zone. Of course, all sorts of automation immediately kick in to mitigate any impact to even that subset.

Ten years after launching Availability Zones, we're excited that we're still relentless about reducing the impact of potential issues. We firmly believe it's one of the most important strategies for achieving our top goals of security and availability. We now have 54 Availability Zones, across 18 geographic Regions, and we've announced plans for 12 more. Beyond that geographic growth, we'll be extending the concept of compartmentalization that underlies Availability Zones deeper and deeper, to be more effective than ever.

Terms of Use | Copyright © 2002 - 2018 CONSTITUENTWORKS SM  CORPORATION. All rights reserved. | Privacy Statement