Docs > AWS Certified SAP > Data Stores

Data Stores

Amazon S3
Amazon EBS (Elastic Block Storage)
Amazon EFS (Elastic File System)
- EFS Performance Considerations
Amazon Storage Gateway
Amazon WorkDocs
Database on EC2
Amazon RDS
Amazon DynamoDB
Amazon Redshift
Amazon Neptune
Amazon Elasticache
Other Database Options
Database Comparison

Amazon S3

“Secure, durable, highly scalable object storage at a very low cost.1 You can store and retrieve any amount of data, at any time, from anywhere on the web through a simple web service interface. You can write, read, and delete objects containing from zero to 5 TB of data.”

Amazon EBS (Elastic Block Storage)

To be used with EC2
Bound to a single AZ
Use snapshots for backup
Snapshots can be shared across different accounts
Change AZ by launching a volume in different AZ from snapshot
Convert from unencrypted to encrypted volume through snapshot
Snapshots consume storage incrementally
- AWS reorganized necessary data on snapshot deletion, to guarantee restore-ability of all the snapshots
- Snapshot lifecycle policies help automate the creation and deletion of the snapshots

Amazon EFS (Elastic File System)

Based on NFS (Network File System)
Multi-AZ storage
Pay based on the usage
Mount points in single or many AZs
EFS in not an encrypted protocol, use in caution of mounting over the Internet
EFS is as durable and availble as S3
- Amazon DataSync - good alternative for syncrhonizing on-premise storage with EFS / S3
Beware of the cost!
- EFS is about 3 times more expensive than EBS and about 20 times more expensive than S3

EFS Performance Considerations

Burst credits are allocated over time to control throughput
- starts from 2.1TiB with baseline rate of 50MiB/s and a burst rate of 100MiB/s
- Defined by BurstCreditBalance metric in AWS CloudWatch
Supports different 2 performance modes:
- General purpose - default mode for FS with workload up to 7000 IOPS
- Max I/O - workloads demanding higher than 7000 IOPS, optimized for applications where tens / hundreds / thousands of EC2 instances are accessing the file system
  - Systems scale to higher level of aggregate throughput
  - Tradeoff on slightly higher latencies for file operations
If application can handle async writes, you can tradeoff consistency for speed, through enabling asynchronous writes.

Amazon Storage Gateway

Virtual Machine that can run on premise or EC2
Provides local storage resources backed by AWS S3 and Glacier
Contains logic to synchronize data back-and-forth to S3
Useful in cloud migrations

Running Modes

File Gateway
- Allow on-prem or EC2 instances to store objects in S3 via NFS or SMB mount point
Volume Gateway
- Volume Gateway Stored Mode / Gateway-stored Volumes
  - Async replication of data from on-prem to S3, uses iSCSI interface
- Volume Gateway Cached Mode / Gateway-cached volumes
  - Primary data stored in S3 with frequent access data cached locally on-prem, uses iSCSI interface
Tape Gateway / Gateway-Virtual Tape Library
- Virtual media changer and tape library for use with existing backup software, uses iSCSI interface

Cost Model Following cost model components should be considered when using AWS Storage Gateway:

gateway usage
snapshot storage usage
volume storage usage
virtual tape shelf storage
virtual tape library storage
retrieval from virtual tape shelf
data transfer out

Use Cases

Move on-premises backups to the cloud
Shift on-premises storage to cloud-backed file shares
Provide low-latency access for on-premises applications to cloud data

Amazon WorkDocs

Amazon’s alternative to Dropbox / Google Drive
Secure, fully managed file collaboration service
Can integrate with AD for SSO
Web, mobile and native clients (no Linux client yet)
HIPAA, PCI DSS and ISO complant
Available SDK for creation complementary apps

Database on EC2

Run any database with full control and ultimate flexibility
Self-managed backups, redundancy, patching, scale
Good option to run databases not supported by RDS yet

Amazon RDS

Managed database service
Supports most-popular database engines
Structured, relational databases
Automated backups and patching in pre-defined maintenance windows
Push-button scaling, replicaiton and redundancy
Multi-AZ RDs
- Standby instance replication is Synchronous
- Masters can be promoted at any point of time without data loss
Read replication is asynchronous
Read-replicas service regional users
MariaDB is open-source fork of MySQL

Note: Non-transactional storage enginers like MyISAM don’t support replication; you must use InnoDB (XtraDB on MariaDB)

RDS Anti-Patterns

Large BLOBS - use S3
Automated scalability - use DynamoDB
Name/Value data structure - use DynamoDB
Data is not well structure or unpredictable - use DynamoDB
Unsupported by RDS database - use EC2
Complete control over the database - use EC2

Amazon DynamoDB

Key-value store
Managed multi-AZ NoSQL data store
Cross-Region Replication option
Defaults to eventual consistency reads
SDK supports strong read consistncy via a parameter
- May slow down read in case of outages in the write AZ
Priced on throughput
- Read/Write Capacity Units
Autoscale capacity adjusts per configured min/max levels
- DynamoDB won’t scale down
On-Demand Capacity provides flexible capacity at a small premium cost
Achieve ACID compliance with DynamoDB Transactions

Relational vs NoSQL

Relational - structured data
NoSQL - self-contained records

NoSQL Indexes

Primary Key is used to create internal hash
Composite Primary Key key consists of partition key and sort key
- Can have duplicate of partition keys as long as the sort key is different
Global Secondary Index
- If you want a fast query of attributes outside the primary key
Local Secondary Index
- You know the partition key adn want to quickly query on some other attibute
There is a limit to the number of indexes and attributes per index
Indexes take up storage space

Amazon Redshift

Fully managed, clustered peta-byte scale data warehouse
Extremely cost-effective as compared to some other on-premises data warehouse platforms
PostgreSQL compatible with JDBC and ODBC drivers available; comptiable with most BI tools out of the box
Features parallel processing and columnar data stores which are optimized for complex queries
Option to query directly form data files on S3 via Redshift Spectrum

Data Lake

Large repository for a variety of data
Query raw data without extension pre-processing
Lessen time from data collection to data value
Identify correlations between disparate data sets
Data can be located on AWS S3 and queried from BI tools using Amazon Redshift Spectrum

Amazon Neptune

Fully-managed Graph database
Optimized to deal with relationships between objects
- Allows to store interrelationships and query them in very effective manner
Supports open graph APIs for both Gremlin and SPARQL

Amazon Elasticache

Fully managed implementations of 2 popular in-memory data stores - Redis and Memcached
Push-button scalability for memory, writes and reads
In Memory key/value store - not persisten in the traditional sense…
Billed by node size and hours of use

Use Cases

Web Session Store
- Stateless application
Database Caching
- Offload load from database servers, return results faster to users
Leaderboards
- Provide live leaderboard for millions of users in your mobile app
Streaming Data Dashboards
- Provide a landing spot for streaming sensor data on the factory floor, providing live real-time dashboard displays.

Redis vs Memcached

Memcached

Simple, no-frills, straight-forward
You need to scale out and in as demand changes
You need to run multiple CPU corers and threads
You need to cache objects (i.e. database queries)

Redis

You need encryption
You need HIPAA compliance
Support for clustering
You need complex datatypes
You need high-availability (replication)
Pub/Sub capability
Geospacial Indexing
Backup and Restore

Other Database Options

Amazon Athena

SQL Engine overlaid on S3 base on Presto
Query raw data objects as they sit in an S3 bucket
Use or convert your data to Parquet format if possible for a big performance jump
Similar in concept to Redshift Spectrum

Amazon Athena vs Amazon Redshift Spectrum

Athena: Data lives mostly on S3 without the need to perform joins with other data sources
Redshift Spectrum: Want to join S3 data with existing RedShift tables or create union products
Supports Apache Parquet, JSON and Apache ORC formats

Amazon Quantum Ledger Database

Based on blockchain concepts
Provides immutable and transparent journal as a service without having to setup and maintain an entire blockchain framework
Centralized design (as opposed to decentralized consensus-based design for common blockchain frameworks) allows for higher performance and scalability
Append-only concept where each record contributes to the integrity of the chain

Amazon Managed Blockchain

Fully managed blockchain framework supporting open source frameworks of Hyperledger Fabric and Ethereum
Distributed consensus-based concept consisting of a network members (other AWS accounts), nodes (instances) and potentially applications

Amazon Timestream Database

Fully managed database service specifically built for storing and analyzing time-series data
Alternative to DynamoDB or Redshift and includes some built-in analytics like interpolation and smoothing

Use Cases

Industrial Machinery
Sensor Networks
Equipment Telemetry

Amazon DocumentDB

with MongoDB compatibility
AWS’s investion that emulates the MongoDB API so it acts like MongoDB to existing clients and drivers
Fully managed with all the good stuff (multi-AZ, HA, scalability, integrated with KMS, S3 backups)
An option if you currently use MongoDB and want to get out of the server management

Amazon ElasticSearch

Stores and indexes documents (JSON)
Usually referred to as ELK stack:
- ElasticSearch - search and storage
- Kibana - analytics
- LogStash - intake

Database Comparison

Database on EC2
- Ultimate control over database
- Preferred DB not available under RDS
Amazon RDS
- Need traditional database for OLTP
- Your data is well-formed and structured
Amazon DynamoDB
- Name/value pair data or unpredictable data structure
- In-memory performance with persistence
Amazon Redshift
- Massive amounts of data
- Primary OLAP workloads
Amazon Neptune
- Relationships between objects a major portion of data value
Amazon Elasticache
- Fast temporary storage for small amounts of data
- Highly volatile data