AWS Cloud Best Practices

Difference between Traditional and Cloud Computing Environments

  • IT Assets as Provisioned Resources
    • Quick deployment time
    • No hardware commitment
  • Global, Available, and Scalable Capacity
  • Higher-Level Managed Services
    In addition to EC2 instances, you can benefit from a lot of great services which will be provisioned and maintained with minimum effort.
  • Built-In Security
  • Architecting for Cost
    • AWS provides fine-grained billing, which enables you to track the costs associated with all aspects of your solutions
  • Operations on AWS
    • Infrastructure as Code
    • Higher levels of automation of the operational processes as the supporting services, e.g. AWS Auto Scaling and self-healing architectures
    • Full automation through DevOps processes for delivery pipeline and management
  • Disposable Resources Instead of Fixed Servers

Design Principles

  • Scalability
  • Disposable resources instead of fixed servers
  • Automation
  • Loose coupling
  • Services, not servers
  • Databases
  • Managing increasing volumes of data
  • Removing single points of failure
  • Optimize for cost
  • Caching
  • Security


  • Scaling Vertically
    Increase in the specifications of an individual resource, such as upgrading a server with a larger hard drive or a faster CPU

  • Scaling Horizontally
    Increase in the number of resources, such as adding more hard drives to a storage array or adding more servers to support an application

  • Stateless Applications

    • An application that does not need knowledge of previous interactions and does not store session information
    • Given the same input provides the same response to any end-user
    • Can scale horizontally because any available compute resources can service any requests
  • Distributing the load

    • Push model
      • ELB, ALB, Network Load Balancer, Route53 load balancing
    • Pull model
      • for asynchronous, event-driven workloads
      • SQS, Amazon Kinesis
      • compute resources pull and consume messages, processing them in a distributed fashion
  • Stateful Components

    • Can be scaled with session affinity
    • Session Affinity
      • Bind all the transactions of a session to a specific compute resource
      • Existing sessions do not directly benefit from the introduction of newly launched compute nodes

Instantiating Compute Resources

  • Bootstrapping
    • You can set up new EC2 instances with user data scripts and cloud-init directives
    • You can use simple scripts and configuration management tools such as Chef or Puppet
  • Golden Images
    • Can be used to launch EC2 instances, Amazon RDS DB instances, and Amazon Elastic Block Store (Amazon EBS) volumes
    • Results in faster start times and removes dependencies to configuration services or third-party repositories
    • Important in auto-scaled environments to quickly and reliably launch additional resources as a response to demand changes.
  • Containers
    • Docker—an open-source technology that allows you to build and deploy distributed applications inside software containers.
    • Launching from Docker image
    • Amazon Elastic Container Service (Amazon ECS) and AWS Fargate
    • Alternative container environment: Kubernetes and Amazon Elastic Container Service for Kubernetes (Amazon EKS)
  • Hybrid
    • Some parts are in a golden image, while others are configured dynamically through a bootstrapping action.

Infrastructure as Code

  • AWS CloudFormation templates give you an easy way to create and manage a collection of related AWS resources
  • provision and update them in an orderly and predictable fashion
  • CloudFormation templates can live with your application in your version control repository

Automation, Infrastructure Management, and Deployment

  • Serverless
    • AWS CodeBuild, and AWS CodeDeploy support the automation of the deployment of these processes
  • AWS Elastic Beanstalk:
    • You can use this service to deploy and scale web applications and services developed with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker on familiar servers such as Apache, Nginx, Passenger, and IIS. 17
    • Developers can simply upload their application code, and the service automatically handles all the details, such as resource provisioning, load balancing, auto scaling, and monitoring.
  • Amazon EC2 Recovery
    • Creating CloudWatch alarm that monitors EC2 instance and recover if impaired
  • AWS Systems Manager
    • You can automatically collect software inventory, apply OS patches, create a system image to configure Windows and Linux operating systems, and execute arbitrary commands.
  • Auto Scaling
    • You can maintain application availability and scale your Amazon EC2, Amazon DynamoDB, Amazon ECS, Amazon Elastic Container Service for Kubernetes (Amazon EKS) capacity up or down automatically according to the conditions you define

Alarms and Events

  • Amazon CloudWatch alarms
  • Amazon CloudWatch Events
  • AWS Lambda scheduled events
  • AWS WAF security automation

Services, Not Servers

  • Loose Coupling
  • Well-Defined Interfaces
    • Various components to interact with each other only through specific, technology-agnostic interfaces, such as RESTful APIs
    • Can modify the underlying implementation without affecting other components
  • Amazon API Gateway
    • Fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale
  • Service Discovery
    • Because each service can be running across multiple compute resources, there needs to be a way for each service to be addressed
    • EC2-hosted service, a simple way to achieve service discovery is through Elastic Load Balancing (ELB).
    • Because each load balancer gets its own hostname, you can consume a service through a stable endpoint.
  • Asynchronous Integration
    • Another form of loose coupling between services
    • One component generates events and another that consumes them
    • SQS, Amazon Kinesis, cascading Lambda events, AWS Step Functions, or Amazon Simple Workflow Service
    • Decouples components and introduces additional resiliency
  • Managed Services
    • Provide building blocks that developers can consume to power their applications
  • Serverless Architectures
    • Can reduce the operational complexity of running applications. It is possible to build both event-driven and synchronous services for mobile, web, analytics, CDN business logic, and IoT without managing any server infrastructure. These architectures can reduce costs because you don’t have to manage or pay for underutilized servers, or provision redundant infrastructure to implement high availability.


  • Anti-Patterns
    • If your application primarily indexes and queries data with no need for joins or complex transactions—especially if you expect a write throughput beyond the constraints of a single instance—consider a NoSQL database instead
    • If your schema cannot be denormalized and the application requires joins or complex transactions, RDBS should be considered
    • Large binary files should be stored in Amazon S3 with metadata in the database.
  • Databases
    • managed database services that offer enterprise performance at an open-source cost
    • AWS offers different database technologies based on your workload
    • Scalability
      • RDBS (Relational Databases) can scale up by upgrading to a larger instance and can scale horizontally by adding more read replicas
      • Write capacity can be scaled horizontally by data partitioning or sharding. Data is split across multiple database schemas each running its own autonomous primary DB instance. RDS removes the operational overhead of running those instances, however, sharding introduces complexity in your application
  • Data Warehouse
    • combines transactional data from disparate sources to make them available for analysis and decision making
    • Amazon Redshift is a managed data warehouse service providing a scalable, highly available and cost-effective solution.
  • Search
    • Searching enables datasets to be queried that are not precisely structured. AWS supports search services:
      • Amazon ElasticSearch (ES)
      • Amazon CloudSearch
  • Graph Databases
    • Uses graph structures for queries
    • The graph is defined as a consisting of edges (relationships), which directly relate to nodes (data entities) in the store.
    • Relationships allow faster retrieval of complex hierarchical structures in relational systems.
  • Managing Increasing Volumes of Data
    • Data lake architecture

Removing Single Points of Failure

  • Introducing redundancy
    • Standby Redundancy
      When a resource fails, functionality is recovered on a secondary resource with the failover process. During the failover time, the resource remains unavailable.
    • Active Redundancy
      Requests are distributed to multiple redundant compute resources. When one of them fails, the rest can simply absorb a larger share of the workload.
  • Detect Failure
    • You should aim at automatic failure detection and reacting to failure. ELB, Route53 with health checks, ASGs and other methods will help you automatically recover from the failure.
    • Design Good Health Checks
      Configuring the right health checks for your application helps determine your ability to respond correctly and promptly to a variety of failure scenarios. The health checks should reliably assess the health of the back-end nodes. Simple TCP check won’t detect the health state of a web server.
  • Durable Data Storage
    • Synchronous Replication
      The transaction is acknowledged only after being durably stored in both the primary location and its replicas. This will protect the integrity of data in the event of failure. In this case, the primary node is coupled with the replicas.
    • Asynchronous Replication
      Decouples the primary node from the replica, however, introduces replication lag - used to horizontally scale the system’s read capacity for queries that can tolerate the replication lag.
    • Quorum-based replication
      Combines synchronous and asynchronous replication to overcome the challenges of large-scale distributed database systems. Replication to multiple nodes can be managed by defining the minimum number of nodes that must participate in a successful write operation.
    • Examples:
      • Redis in AWS ElasticCache provide asynchronous communication - recent transactions can be lost in the event of a failover
      • RDS with Multi-AZ provides synchronous replication to keep data on the standby node up-to-date with the primary.
  • Automated Multi-Data Center Resilience
    • Disaster Recovery Plan
      (Could consider failover to a distant second data center in the event of major disruption)
      • Low probability but huge impact risk
    • AZs provide a solution for short disruption, during which you

Fault Isolation and Traditional Horizontal Scaling

  • The measures above are insufficient if there is something harmful about the requests themselves.
  • Same scenarios which caused the failure of the primary instances could be replayed to fail the failover instances
  • Shuffle Sharding
    • Fault-isolating improvement
    • Instances are grouped into shards
    • Each customer will be distributed to a specific shard
    • The impact is reduced in direct proportion with the number of shards
    • The client could try every endpoint in a set of sharded resources, until one succeeds, making the client fault tolerant

Optimize for Cost

  • Right Sizing
    • Benchmarking may help in understanding the instance type and number of instances you require
  • Cost Optimization is an iterative process
    • Application and its usage will evolve over time
  • Elasticity
    • Autoscaling can help optimize the cost
    • Automate turning off production workloads when not in use
    • Replacing ec2 workloads with services
  • Take advantage of the variety of purchasing options
    • Reserved Instances
    • Spot Instances


  • Application Data Caching
    • Amazon ElastiCache
    • Amazon DynamoDB Accelerator (DAX)
      Fully managed, highly available, in-memory cache for DynamoDB that delivers performance improvements from milliseconds to microseconds, for high throughput
  • Edge Caching
    • Static content cached at Amazon CloudFront edge location


  • AWS WAF (Web Application Firewall)
  • IAM
    • Granular set of policies for access control of users
    • IAM roles can be assigned to instances to grant access to the resources
  • Data Encryption (in transit / at rest)
  • AWS is responsible for the security of underlying cloud infrastructure
  • You are responsible for securing the workloads you deploy to AWS
  • Amazon Cognito
    • Allows client devices to access AWS resources through temporary tokens with fine-grained permissions
  • Security as Code
    • AWS CloudFormation
  • Real-Time Auditing
    • AWS Config
    • AWS Inspector
    • AWS Trusted Advisor
    • AWS CloudTrail
    • AWS CloudWatch Logs