Data Stores
Amazon S3
“Secure, durable, highly scalable object storage at a very low cost.1 You can store
and retrieve any amount of data, at any time, from anywhere on the web through
a simple web service interface. You can write, read, and delete objects containing
from zero to 5 TB of data.”
Read More…
Amazon EBS (Elastic Block Storage)
- To be used with EC2
- Bound to a single AZ
- Use snapshots for backup
- Snapshots can be shared across different accounts
- Change AZ by launching a volume in different AZ from snapshot
- Convert from unencrypted to encrypted volume through snapshot
- Snapshots consume storage incrementally
- AWS reorganized necessary data on snapshot deletion, to guarantee restore-ability of all the snapshots
- Snapshot lifecycle policies help automate the creation and deletion of the snapshots
Amazon EFS (Elastic File System)
- Based on NFS (Network File System)
- Multi-AZ storage
- Pay based on the usage
- Mount points in single or many AZs
- EFS in not an encrypted protocol, use in caution of mounting over the Internet
- EFS is as durable and availble as S3
- Amazon DataSync - good alternative for syncrhonizing on-premise storage with EFS / S3
- Beware of the cost!
- EFS is about 3 times more expensive than EBS and about 20 times more expensive than S3
- Burst credits are allocated over time to control throughput
- starts from 2.1TiB with baseline rate of 50MiB/s and a burst rate of 100MiB/s
- Defined by
BurstCreditBalance
metric in AWS CloudWatch
- Supports different 2 performance modes:
- General purpose - default mode for FS with workload up to 7000 IOPS
- Max I/O - workloads demanding higher than 7000 IOPS, optimized for applications where tens / hundreds / thousands of EC2 instances are accessing the file system
- Systems scale to higher level of aggregate throughput
- Tradeoff on slightly higher latencies for file operations
- If application can handle async writes, you can tradeoff consistency for speed, through enabling asynchronous writes.
Read More…
Amazon Storage Gateway
- Virtual Machine that can run on premise or EC2
- Provides local storage resources backed by AWS S3 and Glacier
- Contains logic to synchronize data back-and-forth to S3
- Useful in cloud migrations
Running Modes
- File Gateway
- Allow on-prem or EC2 instances to store objects in S3 via NFS or SMB mount point
- Volume Gateway
- Volume Gateway Stored Mode / Gateway-stored Volumes
- Async replication of data from on-prem to S3, uses iSCSI interface
- Volume Gateway Cached Mode / Gateway-cached volumes
- Primary data stored in S3 with frequent access data cached locally on-prem, uses iSCSI interface
- Tape Gateway / Gateway-Virtual Tape Library
- Virtual media changer and tape library for use with existing backup software, uses iSCSI interface
Cost Model
Following cost model components should be considered when using AWS Storage Gateway:
- gateway usage
- snapshot storage usage
- volume storage usage
- virtual tape shelf storage
- virtual tape library storage
- retrieval from virtual tape shelf
- data transfer out
Use Cases
- Move on-premises backups to the cloud
- Shift on-premises storage to cloud-backed file shares
- Provide low-latency access for on-premises applications to cloud data
Amazon WorkDocs
- Amazon’s alternative to Dropbox / Google Drive
- Secure, fully managed file collaboration service
- Can integrate with AD for SSO
- Web, mobile and native clients (no Linux client yet)
- HIPAA, PCI DSS and ISO complant
- Available SDK for creation complementary apps
Database on EC2
- Run any database with full control and ultimate flexibility
- Self-managed backups, redundancy, patching, scale
- Good option to run databases not supported by RDS yet
Amazon RDS
- Managed database service
- Supports most-popular database engines
- Structured, relational databases
- Automated backups and patching in pre-defined maintenance windows
- Push-button scaling, replicaiton and redundancy
- Multi-AZ RDs
- Standby instance replication is Synchronous
- Masters can be promoted at any point of time without data loss
- Read replication is asynchronous
- Read-replicas service regional users
- MariaDB is open-source fork of MySQL
- Note: Non-transactional storage enginers like MyISAM don’t support replication; you must use InnoDB (XtraDB on MariaDB)
RDS Anti-Patterns
- Large BLOBS - use S3
- Automated scalability - use DynamoDB
- Name/Value data structure - use DynamoDB
- Data is not well structure or unpredictable - use DynamoDB
- Unsupported by RDS database - use EC2
- Complete control over the database - use EC2
Amazon DynamoDB
- Key-value store
- Managed multi-AZ NoSQL data store
- Cross-Region Replication option
- Defaults to eventual consistency reads
- SDK supports strong read consistncy via a parameter
- May slow down read in case of outages in the write AZ
- Priced on throughput
- Read/Write Capacity Units
- Autoscale capacity adjusts per configured min/max levels
- DynamoDB won’t scale down
- On-Demand Capacity provides flexible capacity at a small premium cost
- Achieve ACID compliance with DynamoDB Transactions
Relational vs NoSQL
- Relational - structured data
- NoSQL - self-contained records
NoSQL Indexes
- Primary Key is used to create internal hash
- Composite Primary Key key consists of partition key and sort key
- Can have duplicate of partition keys as long as the sort key is different
- Global Secondary Index
- If you want a fast query of attributes outside the primary key
- Local Secondary Index
- You know the partition key adn want to quickly query on some other attibute
- There is a limit to the number of indexes and attributes per index
- Indexes take up storage space
Amazon Redshift
- Fully managed, clustered peta-byte scale data warehouse
- Extremely cost-effective as compared to some other on-premises data warehouse platforms
- PostgreSQL compatible with JDBC and ODBC drivers available; comptiable with most BI tools out of the box
- Features parallel processing and columnar data stores which are optimized for complex queries
- Option to query directly form data files on S3 via Redshift Spectrum
Data Lake
- Large repository for a variety of data
- Query raw data without extension pre-processing
- Lessen time from data collection to data value
- Identify correlations between disparate data sets
- Data can be located on AWS S3 and queried from BI tools using Amazon Redshift Spectrum
Amazon Neptune
- Fully-managed Graph database
- Optimized to deal with relationships between objects
- Allows to store interrelationships and query them in very effective manner
- Supports open graph APIs for both Gremlin and SPARQL
Amazon Elasticache
- Fully managed implementations of 2 popular in-memory data stores - Redis and Memcached
- Push-button scalability for memory, writes and reads
- In Memory key/value store - not persisten in the traditional sense…
- Billed by node size and hours of use
Use Cases
- Web Session Store
- Database Caching
- Offload load from database servers, return results faster to users
- Leaderboards
- Provide live leaderboard for millions of users in your mobile app
- Streaming Data Dashboards
- Provide a landing spot for streaming sensor data on the factory floor, providing live real-time dashboard displays.
Redis vs Memcached
Memcached
- Simple, no-frills, straight-forward
- You need to scale out and in as demand changes
- You need to run multiple CPU corers and threads
- You need to cache objects (i.e. database queries)
Redis
- You need encryption
- You need HIPAA compliance
- Support for clustering
- You need complex datatypes
- You need high-availability (replication)
- Pub/Sub capability
- Geospacial Indexing
- Backup and Restore
Other Database Options
Amazon Athena
- SQL Engine overlaid on S3 base on Presto
- Query raw data objects as they sit in an S3 bucket
- Use or convert your data to Parquet format if possible for a big performance jump
- Similar in concept to Redshift Spectrum
Amazon Athena vs Amazon Redshift Spectrum
- Athena: Data lives mostly on S3 without the need to perform joins with other data sources
- Redshift Spectrum: Want to join S3 data with existing RedShift tables or create union products
- Supports Apache Parquet, JSON and Apache ORC formats
Amazon Quantum Ledger Database
- Based on blockchain concepts
- Provides immutable and transparent journal as a service without having to setup and maintain an entire blockchain framework
- Centralized design (as opposed to decentralized consensus-based design for common blockchain frameworks) allows for higher performance and scalability
- Append-only concept where each record contributes to the integrity of the chain
Amazon Managed Blockchain
- Fully managed blockchain framework supporting open source frameworks of Hyperledger Fabric and Ethereum
- Distributed consensus-based concept consisting of a network members (other AWS accounts), nodes (instances) and potentially applications
Amazon Timestream Database
- Fully managed database service specifically built for storing and analyzing time-series data
- Alternative to DynamoDB or Redshift and includes some built-in analytics like interpolation and smoothing
Use Cases
- Industrial Machinery
- Sensor Networks
- Equipment Telemetry
Amazon DocumentDB
- with MongoDB compatibility
- AWS’s investion that emulates the MongoDB API so it acts like MongoDB to existing clients and drivers
- Fully managed with all the good stuff (multi-AZ, HA, scalability, integrated with KMS, S3 backups)
- An option if you currently use MongoDB and want to get out of the server management
Amazon ElasticSearch
- Stores and indexes documents (JSON)
- Usually referred to as ELK stack:
- ElasticSearch - search and storage
- Kibana - analytics
- LogStash - intake
Other intake solutions:
Database Comparison
- Database on EC2
- Ultimate control over database
- Preferred DB not available under RDS
- Amazon RDS
- Need traditional database for OLTP
- Your data is well-formed and structured
- Amazon DynamoDB
- Name/value pair data or unpredictable data structure
- In-memory performance with persistence
- Amazon Redshift
- Massive amounts of data
- Primary OLAP workloads
- Amazon Neptune
- Relationships between objects a major portion of data value
- Amazon Elasticache
- Fast temporary storage for small amounts of data
- Highly volatile data