Deep Dive on Amazon S3 & Amazon Glacier Storage Management (reInvent 2017)

Storage Management on S3

  • Organize
    • Object Tagging
  • Monitor and Analyze
    • S3 Inventory
    • Amazon CloudWatch
    • Storage Class Analysis
    • AWS CloudTrail
  • Act
    • Cross Region replications
    • Event Notification
    • Lifecycle Policy
  • Security Management
    • AWS KMS
    • AWS IAM
    • Bucket Permissions Check
    • Encryption Status in S3 Inventory
    • Default Encryption
    • Trusted advisor
    • Amazon Macie

User Permission Management By Tagging

{
    "version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::Project-bucket/*",
            "Condition": {"StringEquals": {"s3:RequestObjectTag/Project": "x"}}
            
        }
    ]
}

S3 Inventory

  • Generates a CSV / ORC file based of all objects in S3 bucket with respect to filter criteria.
  • Triggers business workflows and applications such as secondary index, garbage collection, data auditing and offline analytics.

Features:

  • Save time
  • Daily or Weekly delivery
  • Delivery notification
  • Delivery to S3 bucket
  • Same set of metadata as the LIST API
  • Can add size, last modified date, storage class, etag or replication status
  • Object-level Encryption Status
  • Encrypt Inventory with SSE-S3 or SSE-KMS
  • CSV or ORC output format
  • Query with Athena, Redshift Spectrum or any Hive tools

S3 Inventory can be queried with Amazon Athena:

CREATE EXTERNAL TABLE my_inventory_table(
    `bucket` string,
    `key` string,
    `version_id` string,
    `is_latest` boolean, 
    `is_delete_marker` boolean, 
    `size` bigint, 
    `last_modified_date` timestamp, 
    `e_tag` string,
    `storage_class` string, 
    `is_multipart_uploaded` boolean,
    `replication_status` string,
    `encryption_status` string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymLinkTextInputFormat'
LOCATION 's3://bucketname/inventory/output_destination/hive'

Storage Class Analysis

  • Data-driven storage management for S3
  • Daily Storage Class Analysis
  • Export Analysis data to your S3 Bucket
  • Filter by Bucket, Prefix, or Object Tags

Process:

  1. Monitors access patterns to understand your storage usage
  2. After 30 days, recommends when to move objects to Standard - Infrequent Access
  3. Export file includes a daily report of storage, retrieved bytes, and GETs by object age

Object-Level Logging

  • Allows Logging CloudTrail for Read / Write Events on the Objects

Cross-Region Replication (CRR)

Use cases:

  • Compliance
  • Lower latency
  • Security

Features:

  • Ownership overwrite for cross-account CRR
  • Support SSE-KMS Encrypted objects
  • Choose any S3 Storage Class as target
  • Choose any AWS region as target
  • Bi-directional replication
  • Lifecycle Policy

Automate with Trigger-Based Workflow Amazon S3 event notifications

  • Notifications when objects are created via Put, Post, Copy, Multipart Upload, or Delete
  • Filter on prefixes and suffixes
  • Trigger workflow with Amazon SNS, Amazon SQS, and Amazon Lambda functions

Default Encryption

  • Automatically encrypts all objects written to your Amazon S3 bucket
  • Choose SSE-S3 or SSE-KMS
  • Makes it easy to satisfy compliance needs

Amazon Macie

  • Security service that uses machine learning to automatically discover, classify and protect sensitive data in AWS
  • Recognizes sensitive data
  • Continuously monitors data access
  • Provides dashboards and alerts

AlertLogic Use Case on AWS S3

S3 Object Management

  • S3 Object Keys use hash prefix for performance: logmsgs-001:/X-OGA/11543.2016-03/...
  • S3 Objects written with two Tags
    • Customer identitfier (cid=1234567890)
    • Date (date=2017-06)
  • AWS KMS used to generate data encryptionkeys
    • Customer Master Key (CMK) for each data type with automatic rotation enabeld
    • Data Keys generated per-customer/per-month

Tags with Lifecycle Expiration Policies

  • Per Customer Expiration Rule
  • Uses cid and date tags as filter
  • Indepdendent of object create time
<Rule>
    <ID>expiration-12345</ID>
    <Status>Enabled</Status>
    <Filter>
        <And>
            <Tag>
                <Name>cid</Name>
                <Value>12345</Value>
            </Tag>
            <Tag>
                <Name>date</Name>
                <Value>2015-09</Value>
            </Tag>
        </And>
    </Filter>
    <Expiration>
        <!-- Depends entirely on the tag values -->
        <Days>0</Days>
    </Expiration>
</Rule>

Tags with Lifecycle Transition Policies

  • One Transition Rule per month
  • Uses date tag as filter
<Rule>
    <ID>transition-ia-3months</ID>
    <Status>Enabled</Status>
    <Filter>
        <And>
            <Tag>
                <Name>date</Name>
                <Value>2016-07</Value>
            </Tag>
        </And>
    </Filter>
    <Transition>
        <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
</Rule>

Demonstrate Scale of Storage Solution (AWS re:Invent 2017)

  • Scaled wrokload 100x successfully
    • 140PB/month of customer data
    • 30k writes/second sustained
    • Write latency 200ms at 95th percentile
    • Read latency 125ms at 95th percentile
  • Limited only by resources driving traffic