Business Continuity
Concepts
- Business Continuity (BC)
- Seeks to minimize business activity disruption when something unexpected happens
- Disaster Recovery (DR)
- Act of responding to these events that threaten business continuity
- High Availability
- Designing in redundancies to reduce the chance of impacting service levels
- Fault Tolerance
- Design in the ability to tolerate faults
- Service Level Argreement (SLA)
- An agreed goal or target for a given service on its performance or availability
- Recovery Time Objective (RTO)
- Time taken after disruption to restore business processes to their levels
- Recovery Point Objective (RPO)
- Acceptable amount of data loss measured in time.
Business Continuity Plan specify RTO and RPO. RTO and RPO metrics then define amouint of investment on High Availability to be made. They also define what is the process for Disaster Recovery.
Types of Disaster:
- Hardware Failure
- Deployment Failure
- Load Induced (e.g. DDOS attacks)
- Data Induced
- Credential Expiration
- Dependency
- Infrastructure
- Identifier Exhaustion
Disaster Recovery Architectures
- Backup and Restore
- Requires Minimum entry point into AWS
- Minimal effort to configure
- Least flexible, off-site backup storage
- Pilot Light
- Minimal environment on standby on AWS for failover
- Switching to AWS will require a manual intervention
- It may take several minutes or hours to spin an environment
- AMIs should be up-to-date with on-prem counterparts
- Warm Standby
- Services are already up and running
- Could be considered as a shadown environment or production staging
- Resources could scale up to meet the incoming demand
- Process can be automated
- Multi-Site
- Ready at all time to take full production load
- Fails over in seconds or less
- No or little intervention required to fail over
- Most expensive DR option: can be considered as wasteful option
- Can be automatically configured through Route53 health checks
Storage Options
Amazon EBS
- Annual Failure rate less than 0.2%
- Availability target of 99.999% (replicated within a single AZ)
- Vulnerable to AZ failure
- Easy to snapshot which is stored on S3 and multi-AZ durable
- You can copy snapshots to other regions
- Supports RAID configurations
- Due to the facts that EBS operates over network, it’s not recommended to operate higher than RAID1
EBS RAID Configurations
- RAID0 (Striping)
- No Redundancy
- Highest Speed Reads and Writes
- Highest Capacity
- RAID1 (Mirroring)
- 1 drive can safely fail
- Slight decrease in reads and writes
- Capacity reduced by 50% due to mirroring
- RAID5
- Redundancy: 1 drive can safely fail
- 2 drives will store the data
- 1 drive stores the parity bit to be able to recreate the data
- good reads (similar to RAID0), but low writes
- capacity of (n-1)/n
- RAID6
- 2 drives can safely fail
- minimum of 4 drives needed (2 parity)
- Same reads as RAID0, but worst writes
- capacity of (n-2)/n
S3 Storage
- Standard: 99.99% availability = 52 minutes/year
- Standard Infrequent Access: 99.9% = 9 hours/year
- One-zone Infrequent Access: 99.5% availability = 2 days/year
- Eleven 9s of durability: 99.99999999999%
- Standard & Standard-IA have multi-AZ durability. One-zone only as single AZ durability
- S3 is a backing service for many AWS services
Amazon EFS
- Implementation of the NFS file system
- True file system as opposed to block EBS or object storage (S3)
- File locking, strong consistency, concurrently accessible
- Each file object and metadata is stored across multiple AZs
- Can be accessed from all AZs concurrently
- Mount targets are highly available
Other Options
- Amazon Storage Gateway
- Snowball
- Glacier
Compute Options
- Up-to-Date AMIs are critical for rapid fail-over
- AMIs can be copied to other regions for safety or DR staging
- Horizontally scalable architectures are preferred
- Reserved instances is the only way to guarantee that resources will be available when needed
- Auto Scaling and Elastic Load Balancing work together to provide automatic recovery by maintaining minimum instances
- Route 53 Health Checks also provide “self-healing” redirection of traffic
HA Approaches for Databases
- If possible, choose DDB over RDS because of inherent faul tolerance
- Choose Aurora because of redundancy and automatic recovery features
- If aurora can’t be used choose multi-AZ RDS
- Frequent RDS snapshots can protect against data corruption or failure - and they wont’ impact performance of multi-AZ deployment
- Regional replication is also an option, but will not be strongly consistent
- If hosting database on EC2, you have to develop your own HA plan
Redshift
- Currently Redshift doesn’t support multi-AZ deployments
- Best HA option is to use multi-node cluster which supports replication and node recovery
- Single node Redshift cluster does not support data replication - in case of failure will have to restore an S3 snapshot
Memcached does not support replication
- Use multiple nodes in each shard to minimize data loss on node failure
- Launch multiple nodes across available AZs to minimize data loss on AZ failure
Redis
- Use multiple nodes in each shard and distribute the nodes across multiple AZs
- Enable multi-AZ on the replication group to permit automatic failover if the primary node fails
- Schedule regular backups of your Redis cluster
Network Options
- Subnets should be created in different AZs, resources should be allocated in multiple AZs
- Create at least 2 VPN tunnels to your Virtual Private Gateway
- Direct connect is not HA by default, you need to establish a secondary connection via another Direct Connect (ideally use another provider) or use a VPN
- Route 53’s Health Checks provide basic level of redirecting DNS resolutions
- Elastic IPs allow you flexibility to change backing assets without impacting name resolution
- For mutli-AZ redundancy of NAT Gateway, create gateways in each AZ with routes for private subnets to use the local Gateway
Failure mode and Effects Analysis (FMEA)
FMEA is a systematic process to examine:
- What could go wrong
- What impact it might have
- What is the likelyhood of it occurring
- What is our ability to detect and react
Severity * Proability * Detection = Risk Priorty Number (RPN)
Steps
- Round up Possible Failures
- Assign scores for each failure mode: customer impact, likelihood, detect and react -> calculate RPN
- Prioritize on Risk - implement mitigation plan, additional redundancy