Data Stores


Amazon S3

“Secure, durable, highly scalable object storage at a very low cost.1 You can store and retrieve any amount of data, at any time, from anywhere on the web through a simple web service interface. You can write, read, and delete objects containing from zero to 5 TB of data.”

Read More…

Amazon EBS (Elastic Block Storage)

  • To be used with EC2
  • Bound to a single AZ
  • Use snapshots for backup
  • Snapshots can be shared across different accounts
  • Change AZ by launching a volume in different AZ from snapshot
  • Convert from unencrypted to encrypted volume through snapshot
  • Snapshots consume storage incrementally
    • AWS reorganized necessary data on snapshot deletion, to guarantee restore-ability of all the snapshots
    • Snapshot lifecycle policies help automate the creation and deletion of the snapshots

Amazon EFS (Elastic File System)

  • Based on NFS (Network File System)
  • Multi-AZ storage
  • Pay based on the usage
  • Mount points in single or many AZs
  • EFS in not an encrypted protocol, use in caution of mounting over the Internet
  • EFS is as durable and availble as S3
    • Amazon DataSync - good alternative for syncrhonizing on-premise storage with EFS / S3
  • Beware of the cost!
    • EFS is about 3 times more expensive than EBS and about 20 times more expensive than S3

EFS Performance Considerations

  • Burst credits are allocated over time to control throughput
    • starts from 2.1TiB with baseline rate of 50MiB/s and a burst rate of 100MiB/s
    • Defined by BurstCreditBalance metric in AWS CloudWatch
  • Supports different 2 performance modes:
    • General purpose - default mode for FS with workload up to 7000 IOPS
    • Max I/O - workloads demanding higher than 7000 IOPS, optimized for applications where tens / hundreds / thousands of EC2 instances are accessing the file system
      • Systems scale to higher level of aggregate throughput
      • Tradeoff on slightly higher latencies for file operations
  • If application can handle async writes, you can tradeoff consistency for speed, through enabling asynchronous writes.

Read More…

Amazon Storage Gateway

  • Virtual Machine that can run on premise or EC2
  • Provides local storage resources backed by AWS S3 and Glacier
  • Contains logic to synchronize data back-and-forth to S3
  • Useful in cloud migrations

Running Modes

  • File Gateway
    • Allow on-prem or EC2 instances to store objects in S3 via NFS or SMB mount point
  • Volume Gateway
    • Volume Gateway Stored Mode / Gateway-stored Volumes
      • Async replication of data from on-prem to S3, uses iSCSI interface
    • Volume Gateway Cached Mode / Gateway-cached volumes
      • Primary data stored in S3 with frequent access data cached locally on-prem, uses iSCSI interface
  • Tape Gateway / Gateway-Virtual Tape Library
    • Virtual media changer and tape library for use with existing backup software, uses iSCSI interface

Cost Model Following cost model components should be considered when using AWS Storage Gateway:

  • gateway usage
  • snapshot storage usage
  • volume storage usage
  • virtual tape shelf storage
  • virtual tape library storage
  • retrieval from virtual tape shelf
  • data transfer out

Use Cases

  • Move on-premises backups to the cloud
  • Shift on-premises storage to cloud-backed file shares
  • Provide low-latency access for on-premises applications to cloud data

Amazon WorkDocs

  • Amazon’s alternative to Dropbox / Google Drive
  • Secure, fully managed file collaboration service
  • Can integrate with AD for SSO
  • Web, mobile and native clients (no Linux client yet)
  • HIPAA, PCI DSS and ISO complant
  • Available SDK for creation complementary apps

Database on EC2

  • Run any database with full control and ultimate flexibility
  • Self-managed backups, redundancy, patching, scale
  • Good option to run databases not supported by RDS yet

Amazon RDS

  • Managed database service
  • Supports most-popular database engines
  • Structured, relational databases
  • Automated backups and patching in pre-defined maintenance windows
  • Push-button scaling, replicaiton and redundancy
  • Multi-AZ RDs
    • Standby instance replication is Synchronous
    • Masters can be promoted at any point of time without data loss
  • Read replication is asynchronous
  • Read-replicas service regional users
  • MariaDB is open-source fork of MySQL
  • Note: Non-transactional storage enginers like MyISAM don’t support replication; you must use InnoDB (XtraDB on MariaDB)

RDS Anti-Patterns

  • Large BLOBS - use S3
  • Automated scalability - use DynamoDB
  • Name/Value data structure - use DynamoDB
  • Data is not well structure or unpredictable - use DynamoDB
  • Unsupported by RDS database - use EC2
  • Complete control over the database - use EC2

Amazon DynamoDB

  • Key-value store
  • Managed multi-AZ NoSQL data store
  • Cross-Region Replication option
  • Defaults to eventual consistency reads
  • SDK supports strong read consistncy via a parameter
    • May slow down read in case of outages in the write AZ
  • Priced on throughput
    • Read/Write Capacity Units
  • Autoscale capacity adjusts per configured min/max levels
    • DynamoDB won’t scale down
  • On-Demand Capacity provides flexible capacity at a small premium cost
  • Achieve ACID compliance with DynamoDB Transactions

Relational vs NoSQL

  • Relational - structured data
  • NoSQL - self-contained records

NoSQL Indexes

  • Primary Key is used to create internal hash
  • Composite Primary Key key consists of partition key and sort key
    • Can have duplicate of partition keys as long as the sort key is different
  • Global Secondary Index
    • If you want a fast query of attributes outside the primary key
  • Local Secondary Index
    • You know the partition key adn want to quickly query on some other attibute
  • There is a limit to the number of indexes and attributes per index
  • Indexes take up storage space

Amazon Redshift

  • Fully managed, clustered peta-byte scale data warehouse
  • Extremely cost-effective as compared to some other on-premises data warehouse platforms
  • PostgreSQL compatible with JDBC and ODBC drivers available; comptiable with most BI tools out of the box
  • Features parallel processing and columnar data stores which are optimized for complex queries
  • Option to query directly form data files on S3 via Redshift Spectrum

Data Lake

  • Large repository for a variety of data
  • Query raw data without extension pre-processing
  • Lessen time from data collection to data value
  • Identify correlations between disparate data sets
  • Data can be located on AWS S3 and queried from BI tools using Amazon Redshift Spectrum

Amazon Neptune

  • Fully-managed Graph database
  • Optimized to deal with relationships between objects
    • Allows to store interrelationships and query them in very effective manner
  • Supports open graph APIs for both Gremlin and SPARQL

Amazon Elasticache

  • Fully managed implementations of 2 popular in-memory data stores - Redis and Memcached
  • Push-button scalability for memory, writes and reads
  • In Memory key/value store - not persisten in the traditional sense…
  • Billed by node size and hours of use

Use Cases

  • Web Session Store
    • Stateless application
  • Database Caching
    • Offload load from database servers, return results faster to users
  • Leaderboards
    • Provide live leaderboard for millions of users in your mobile app
  • Streaming Data Dashboards
    • Provide a landing spot for streaming sensor data on the factory floor, providing live real-time dashboard displays.

Redis vs Memcached

Memcached

  • Simple, no-frills, straight-forward
  • You need to scale out and in as demand changes
  • You need to run multiple CPU corers and threads
  • You need to cache objects (i.e. database queries)

Redis

  • You need encryption
  • You need HIPAA compliance
  • Support for clustering
  • You need complex datatypes
  • You need high-availability (replication)
  • Pub/Sub capability
  • Geospacial Indexing
  • Backup and Restore

Other Database Options

Amazon Athena

  • SQL Engine overlaid on S3 base on Presto
  • Query raw data objects as they sit in an S3 bucket
  • Use or convert your data to Parquet format if possible for a big performance jump
  • Similar in concept to Redshift Spectrum

Amazon Athena vs Amazon Redshift Spectrum

  • Athena: Data lives mostly on S3 without the need to perform joins with other data sources
  • Redshift Spectrum: Want to join S3 data with existing RedShift tables or create union products
  • Supports Apache Parquet, JSON and Apache ORC formats

Amazon Quantum Ledger Database

  • Based on blockchain concepts
  • Provides immutable and transparent journal as a service without having to setup and maintain an entire blockchain framework
  • Centralized design (as opposed to decentralized consensus-based design for common blockchain frameworks) allows for higher performance and scalability
  • Append-only concept where each record contributes to the integrity of the chain

Amazon Managed Blockchain

  • Fully managed blockchain framework supporting open source frameworks of Hyperledger Fabric and Ethereum
  • Distributed consensus-based concept consisting of a network members (other AWS accounts), nodes (instances) and potentially applications

Amazon Timestream Database

  • Fully managed database service specifically built for storing and analyzing time-series data
  • Alternative to DynamoDB or Redshift and includes some built-in analytics like interpolation and smoothing

Use Cases

  • Industrial Machinery
  • Sensor Networks
  • Equipment Telemetry

Amazon DocumentDB

  • with MongoDB compatibility
  • AWS’s investion that emulates the MongoDB API so it acts like MongoDB to existing clients and drivers
  • Fully managed with all the good stuff (multi-AZ, HA, scalability, integrated with KMS, S3 backups)
  • An option if you currently use MongoDB and want to get out of the server management

Amazon ElasticSearch

  • Stores and indexes documents (JSON)
  • Usually referred to as ELK stack:
    • ElasticSearch - search and storage
    • Kibana - analytics
    • LogStash - intake

Other intake solutions:

  • CloudWatch
  • Firehose
  • IoT

Database Comparison

  • Database on EC2
    • Ultimate control over database
    • Preferred DB not available under RDS
  • Amazon RDS
    • Need traditional database for OLTP
    • Your data is well-formed and structured
  • Amazon DynamoDB
    • Name/value pair data or unpredictable data structure
    • In-memory performance with persistence
  • Amazon Redshift
    • Massive amounts of data
    • Primary OLAP workloads
  • Amazon Neptune
    • Relationships between objects a major portion of data value
  • Amazon Elasticache
    • Fast temporary storage for small amounts of data
    • Highly volatile data