Docs > AWS Certified SAP > Business Continuity

Business Continuity

Concepts
Disaster Recovery Architectures
Storage Options
Compute Options
HA Approaches for Databases
Network Options
Failure mode and Effects Analysis (FMEA)

Concepts

Business Continuity (BC)
- Seeks to minimize business activity disruption when something unexpected happens
Disaster Recovery (DR)
- Act of responding to these events that threaten business continuity
High Availability
- Designing in redundancies to reduce the chance of impacting service levels
Fault Tolerance
- Design in the ability to tolerate faults
Service Level Argreement (SLA)
- An agreed goal or target for a given service on its performance or availability
Recovery Time Objective (RTO)
- Time taken after disruption to restore business processes to their levels
Recovery Point Objective (RPO)
- Acceptable amount of data loss measured in time.

Business Continuity Plan specify RTO and RPO. RTO and RPO metrics then define amouint of investment on High Availability to be made. They also define what is the process for Disaster Recovery.

Types of Disaster:

Hardware Failure
Deployment Failure
Load Induced (e.g. DDOS attacks)
Data Induced
Credential Expiration
Dependency
Infrastructure
Identifier Exhaustion

Disaster Recovery Architectures

Backup and Restore

Requires Minimum entry point into AWS
Minimal effort to configure
Least flexible, off-site backup storage

Pilot Light

Minimal environment on standby on AWS for failover
Switching to AWS will require a manual intervention
It may take several minutes or hours to spin an environment
AMIs should be up-to-date with on-prem counterparts

Warm Standby

Services are already up and running
Could be considered as a shadown environment or production staging
Resources could scale up to meet the incoming demand
Process can be automated

Multi-Site

Ready at all time to take full production load
Fails over in seconds or less
No or little intervention required to fail over
Most expensive DR option: can be considered as wasteful option
Can be automatically configured through Route53 health checks

Storage Options

Amazon EBS

Annual Failure rate less than 0.2%
Availability target of 99.999% (replicated within a single AZ)
Vulnerable to AZ failure
Easy to snapshot which is stored on S3 and multi-AZ durable
You can copy snapshots to other regions
Supports RAID configurations
- Due to the facts that EBS operates over network, it’s not recommended to operate higher than RAID1

EBS RAID Configurations

RAID0 (Striping)
- No Redundancy
- Highest Speed Reads and Writes
- Highest Capacity
RAID1 (Mirroring)
- 1 drive can safely fail
- Slight decrease in reads and writes
- Capacity reduced by 50% due to mirroring
RAID5
- Redundancy: 1 drive can safely fail
- 2 drives will store the data
- 1 drive stores the parity bit to be able to recreate the data
- good reads (similar to RAID0), but low writes
- capacity of (n-1)/n
RAID6
- 2 drives can safely fail
- minimum of 4 drives needed (2 parity)
- Same reads as RAID0, but worst writes
- capacity of (n-2)/n

S3 Storage

Standard: 99.99% availability = 52 minutes/year
Standard Infrequent Access: 99.9% = 9 hours/year
One-zone Infrequent Access: 99.5% availability = 2 days/year
Eleven 9s of durability: 99.99999999999%
Standard & Standard-IA have multi-AZ durability. One-zone only as single AZ durability
S3 is a backing service for many AWS services

Amazon EFS

Implementation of the NFS file system
True file system as opposed to block EBS or object storage (S3)
File locking, strong consistency, concurrently accessible
Each file object and metadata is stored across multiple AZs
Can be accessed from all AZs concurrently
Mount targets are highly available

Other Options

Amazon Storage Gateway
Snowball
Glacier

Compute Options

Up-to-Date AMIs are critical for rapid fail-over
AMIs can be copied to other regions for safety or DR staging
Horizontally scalable architectures are preferred
Reserved instances is the only way to guarantee that resources will be available when needed
Auto Scaling and Elastic Load Balancing work together to provide automatic recovery by maintaining minimum instances
Route 53 Health Checks also provide “self-healing” redirection of traffic

HA Approaches for Databases

If possible, choose DDB over RDS because of inherent faul tolerance
Choose Aurora because of redundancy and automatic recovery features
If aurora can’t be used choose multi-AZ RDS
Frequent RDS snapshots can protect against data corruption or failure - and they wont’ impact performance of multi-AZ deployment
Regional replication is also an option, but will not be strongly consistent
If hosting database on EC2, you have to develop your own HA plan

Redshift

Currently Redshift doesn’t support multi-AZ deployments
Best HA option is to use multi-node cluster which supports replication and node recovery
Single node Redshift cluster does not support data replication - in case of failure will have to restore an S3 snapshot

Memcached does not support replication

Use multiple nodes in each shard to minimize data loss on node failure
Launch multiple nodes across available AZs to minimize data loss on AZ failure

Redis

Use multiple nodes in each shard and distribute the nodes across multiple AZs
Enable multi-AZ on the replication group to permit automatic failover if the primary node fails
Schedule regular backups of your Redis cluster

Network Options

Subnets should be created in different AZs, resources should be allocated in multiple AZs
Create at least 2 VPN tunnels to your Virtual Private Gateway
Direct connect is not HA by default, you need to establish a secondary connection via another Direct Connect (ideally use another provider) or use a VPN
Route 53’s Health Checks provide basic level of redirecting DNS resolutions
Elastic IPs allow you flexibility to change backing assets without impacting name resolution
For mutli-AZ redundancy of NAT Gateway, create gateways in each AZ with routes for private subnets to use the local Gateway

Failure mode and Effects Analysis (FMEA)

FMEA is a systematic process to examine:

What could go wrong
What impact it might have
What is the likelyhood of it occurring
What is our ability to detect and react

Severity * Proability * Detection = Risk Priorty Number (RPN)

Steps

Round up Possible Failures
Assign scores for each failure mode: customer impact, likelihood, detect and react -> calculate RPN
Prioritize on Risk - implement mitigation plan, additional redundancy