Docs > AWS Certified MLS > CRISP-DM on AWS

CRISP-DM on AWS

CRISP-DM

Stands for Cross Industry Standard Process - Data Mining
Is a framework to Data Science Project

Phases of CRISP-DM

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Business Understanding

1. Understanding business requirements

Questions from the business perspective which need answering
Highlight project’s critical features
People and resources required

2. Analyzing supporting information

List required resources and assumptions
Analyze associated risks
Plan for contingencies
Compare costs and benefits

3. Converting to a Data Mining or Machine Learning problem

Review machine learning question
Create technical data mining objective
Define the criteria for successful outcome of the project

4. Preparing a preliminary plan

Number and duration of stages
Dependencies
Risks
Goals
Evaluation methods
Tools and techniques

Data Understanding

Data collection
- Detail Various sources and steps to extract data
- Analyze data for additional requirements
- Consider other data sources
Data properties
- Describe the data, amount of data used, and metadata properties
- Fidn key features and relationshps in the data
- Use tools and techniques to explore data properties
Quality
- Verifying attributes
- Identifying missing data
- Reveal inconsitencies
- Report solution

AWS tools for Data Understanding

Amazon Athena
- Run interactive SQL queries on Amazon S3 data
- Schema-on-read
- Serverless
Amazon QuickSight
- Fast, cloud-powered business intelligence and data visualization service
AWS Glue
- Managed Extract-Transform-Load (ETL) service

Data Preparation Tasks & Modeling

Data Prepation Tasks

1. Final Dataset Selection

Total size
Included and Excluded columns
Record selection
Data type

2. Data Preparation

Cleaning
- Missing data
  - Dropping rows
  - Add default value or mean value
  - Use statistical methods to calculate the value
- Address noise values
Transformed
- Derive additional attributes from the original
- Normalization
- Attribute transformation
Merging
- Merging data into the final data set
- Null values introduced may require a cleaning iteration
Formatting
- Rearrange attributes
- Randomly shuffle data
- Remove constraints of the modeling tool (Unicode characters …)

Data Modeling

1. Model selection and creation

Identify modeling technique (Regression for numeric values, Recurrent NN for sequence prediction…)
Constraints of mdoeling technique and tool

2. Model testing plan

Test/train dataset split (30% test, 70% train)
Model evaluation criterion

3. Parameter tuning/testing

Build multiple models with different parameter settings
Describe the trained models and report on the findings

AWS Tools for Data Preparation and Modeling

AWS EMR + Spark
- IPython notebooks, Zeppelin notebooks, R studio
- Scala, Python, R, Java, SQL
- Cost savings: Leverage Spot instances for task nodes
AWS EC2 + Deep Learning AMI
- GPU CUDA support for training
- Preinstalled deep learning frameworks
  - MXNet, TensorFlow, Caffe2, Torch, Keras, Theano…
- Includes Python Anaconda Data Science Platform with popular libraries like Numpy, Sikit-learn
- You can install R Studio on EC2 Deep Learning AMI

Evaluation

Accuracy of the model
Model generalization on unseen/unknown data
Evaluation of the model using existing business criteria

Reviewing the project

Assess the steps taken in each phase
Perform quality assurance checks

2. Make the final decision to deploy or not

Based on complete evaluation and business goals acceptance criteria we will take a decision wether a model will be deployed or not. This requires careful analysis of the false positives and true negatives.

Running Jupyter Notebook on EC2 Instance

Create instance using Deep Learning AMI
Connect to the instance using SSH
Run: screen (read more …)
Run Jupyter Notebook: jupyter notebook --no-browser OR jupyter notebook --no-browser --ip=0.0.0.0 --port=[choose your port]
Copy-paste the URL containing the token to the browser and access the example

Deployment

1. Planning deployment

Runtime
- AWS EC2 Instances
- AWS EC2 Container Service
- AWS Lambda
  - Trained model could be saved to S3 and then loaded in Lambda function
Application Deployment
- AWS CodeDeploy
- AWS OpsWorks
- AWS Elastic Beanstalk
Infrastructure Deployment
- AWS CloudFormation
- AWS OpsWorks
- AWS Elastic Beanstalk
Code Management
- AWS CodeCommit
- AWS CodePipeline
- AWS Elastic Beanstalk

2. Maintenace and monitoring

Monitoring

AWS CloudWatch
AWS CloudTrail
AWS Elastic Beanstalk

3. Final report

Highlight processes used in the project
Analyze if all the goals for the project were met
Detail the findings of the project
Identify and explain the model used and reason behind using the model
Identify the customer groups to target using this model

4. Project review

Assess the outcomes of the project
Summarize the results and write thorough documentation
Common pitfalls
Choosing the right ML solution
Generalize the whole process to make it useful for the next iteration