CRISP-DM on AWS

CRISP-DM

  • Stands for Cross Industry Standard Process - Data Mining
  • Is a framework to Data Science Project

Phases of CRISP-DM


Business Understanding

1. Understanding business requirements

  • Questions from the business perspective which need answering
  • Highlight project’s critical features
  • People and resources required

2. Analyzing supporting information

  • List required resources and assumptions
  • Analyze associated risks
  • Plan for contingencies
  • Compare costs and benefits

3. Converting to a Data Mining or Machine Learning problem

  • Review machine learning question
  • Create technical data mining objective
  • Define the criteria for successful outcome of the project

4. Preparing a preliminary plan

  • Number and duration of stages
  • Dependencies
  • Risks
  • Goals
  • Evaluation methods
  • Tools and techniques

Data Understanding

  1. Data collection
    • Detail Various sources and steps to extract data
    • Analyze data for additional requirements
    • Consider other data sources
  2. Data properties
    • Describe the data, amount of data used, and metadata properties
    • Fidn key features and relationshps in the data
    • Use tools and techniques to explore data properties
  3. Quality
    • Verifying attributes
    • Identifying missing data
    • Reveal inconsitencies
    • Report solution

AWS tools for Data Understanding

  • Amazon Athena
    • Run interactive SQL queries on Amazon S3 data
    • Schema-on-read
    • Serverless
  • Amazon QuickSight
    • Fast, cloud-powered business intelligence and data visualization service
  • AWS Glue
    • Managed Extract-Transform-Load (ETL) service

Data Preparation Tasks & Modeling

Data Prepation Tasks

1. Final Dataset Selection

  • Total size
  • Included and Excluded columns
  • Record selection
  • Data type

2. Data Preparation

  • Cleaning
    • Missing data
      • Dropping rows
      • Add default value or mean value
      • Use statistical methods to calculate the value
    • Address noise values
  • Transformed
    • Derive additional attributes from the original
    • Normalization
    • Attribute transformation
  • Merging
    • Merging data into the final data set
    • Null values introduced may require a cleaning iteration
  • Formatting
    • Rearrange attributes
    • Randomly shuffle data
    • Remove constraints of the modeling tool (Unicode characters …)

Data Modeling

1. Model selection and creation

  • Identify modeling technique (Regression for numeric values, Recurrent NN for sequence prediction…)
  • Constraints of mdoeling technique and tool

2. Model testing plan

  • Test/train dataset split (30% test, 70% train)
  • Model evaluation criterion

3. Parameter tuning/testing

  • Build multiple models with different parameter settings
  • Describe the trained models and report on the findings

AWS Tools for Data Preparation and Modeling

  • AWS EMR + Spark
    • IPython notebooks, Zeppelin notebooks, R studio
    • Scala, Python, R, Java, SQL
    • Cost savings: Leverage Spot instances for task nodes
  • AWS EC2 + Deep Learning AMI
    • GPU CUDA support for training
    • Preinstalled deep learning frameworks
      • MXNet, TensorFlow, Caffe2, Torch, Keras, Theano…
    • Includes Python Anaconda Data Science Platform with popular libraries like Numpy, Sikit-learn
    • You can install R Studio on EC2 Deep Learning AMI

Evaluation

  • Accuracy of the model
  • Model generalization on unseen/unknown data
  • Evaluation of the model using existing business criteria

Reviewing the project

  • Assess the steps taken in each phase
  • Perform quality assurance checks

2. Make the final decision to deploy or not

Based on complete evaluation and business goals acceptance criteria we will take a decision wether a model will be deployed or not. This requires careful analysis of the false positives and true negatives.

Running Jupyter Notebook on EC2 Instance

  1. Create instance using Deep Learning AMI
  2. Connect to the instance using SSH
  3. Run: screen (read more …)
  4. Run Jupyter Notebook: jupyter notebook --no-browser OR jupyter notebook --no-browser --ip=0.0.0.0 --port=[choose your port]
  5. Copy-paste the URL containing the token to the browser and access the example

Deployment

1. Planning deployment

  • Runtime
    • AWS EC2 Instances
    • AWS EC2 Container Service
    • AWS Lambda
      • Trained model could be saved to S3 and then loaded in Lambda function
  • Application Deployment
    • AWS CodeDeploy
    • AWS OpsWorks
    • AWS Elastic Beanstalk
  • Infrastructure Deployment
    • AWS CloudFormation
    • AWS OpsWorks
    • AWS Elastic Beanstalk
  • Code Management
    • AWS CodeCommit
    • AWS CodePipeline
    • AWS Elastic Beanstalk

2. Maintenace and monitoring

Monitoring

  • AWS CloudWatch
  • AWS CloudTrail
  • AWS Elastic Beanstalk

3. Final report

  • Highlight processes used in the project
  • Analyze if all the goals for the project were met
  • Detail the findings of the project
  • Identify and explain the model used and reason behind using the model
  • Identify the customer groups to target using this model

4. Project review

  • Assess the outcomes of the project
  • Summarize the results and write thorough documentation
  • Common pitfalls
  • Choosing the right ML solution
  • Generalize the whole process to make it useful for the next iteration