CRISP-DM on AWS
CRISP-DM
- Stands for Cross Industry Standard Process - Data Mining
- Is a framework to Data Science Project
Phases of CRISP-DM
Business Understanding
1. Understanding business requirements
- Questions from the business perspective which need answering
- Highlight project’s critical features
- People and resources required
2. Analyzing supporting information
- List required resources and assumptions
- Analyze associated risks
- Plan for contingencies
- Compare costs and benefits
3. Converting to a Data Mining or Machine Learning problem
- Review machine learning question
- Create technical data mining objective
- Define the criteria for successful outcome of the project
4. Preparing a preliminary plan
- Number and duration of stages
- Dependencies
- Risks
- Goals
- Evaluation methods
- Tools and techniques
Data Understanding
- Data collection
- Detail Various sources and steps to extract data
- Analyze data for additional requirements
- Consider other data sources
- Data properties
- Describe the data, amount of data used, and metadata properties
- Fidn key features and relationshps in the data
- Use tools and techniques to explore data properties
- Quality
- Verifying attributes
- Identifying missing data
- Reveal inconsitencies
- Report solution
AWS tools for Data Understanding
- Amazon Athena
- Run interactive SQL queries on Amazon S3 data
- Schema-on-read
- Serverless
- Amazon QuickSight
- Fast, cloud-powered business intelligence and data visualization service
- AWS Glue
- Managed Extract-Transform-Load (ETL) service
Data Preparation Tasks & Modeling
Data Prepation Tasks
1. Final Dataset Selection
- Total size
- Included and Excluded columns
- Record selection
- Data type
2. Data Preparation
- Cleaning
- Missing data
- Dropping rows
- Add default value or mean value
- Use statistical methods to calculate the value
- Address noise values
- Transformed
- Derive additional attributes from the original
- Normalization
- Attribute transformation
- Merging
- Merging data into the final data set
- Null values introduced may require a cleaning iteration
- Formatting
- Rearrange attributes
- Randomly shuffle data
- Remove constraints of the modeling tool (Unicode characters …)
Data Modeling
1. Model selection and creation
- Identify modeling technique (Regression for numeric values, Recurrent NN for sequence prediction…)
- Constraints of mdoeling technique and tool
2. Model testing plan
- Test/train dataset split (30% test, 70% train)
- Model evaluation criterion
3. Parameter tuning/testing
- Build multiple models with different parameter settings
- Describe the trained models and report on the findings
AWS Tools for Data Preparation and Modeling
- AWS EMR + Spark
- IPython notebooks, Zeppelin notebooks, R studio
- Scala, Python, R, Java, SQL
- Cost savings: Leverage Spot instances for task nodes
- AWS EC2 + Deep Learning AMI
- GPU CUDA support for training
- Preinstalled deep learning frameworks
- MXNet, TensorFlow, Caffe2, Torch, Keras, Theano…
- Includes Python Anaconda Data Science Platform with popular libraries like Numpy, Sikit-learn
- You can install R Studio on EC2 Deep Learning AMI
Evaluation
- Accuracy of the model
- Model generalization on unseen/unknown data
- Evaluation of the model using existing business criteria
Reviewing the project
- Assess the steps taken in each phase
- Perform quality assurance checks
2. Make the final decision to deploy or not
Based on complete evaluation and business goals acceptance criteria we will take a decision wether a model will be deployed or not. This requires careful analysis of the false positives and true negatives.
Running Jupyter Notebook on EC2 Instance
- Create instance using Deep Learning AMI
- Connect to the instance using SSH
- Run:
screen
(read more …)
- Run Jupyter Notebook:
jupyter notebook --no-browser
OR jupyter notebook --no-browser --ip=0.0.0.0 --port=[choose your port]
- Copy-paste the URL containing the token to the browser and access the example
Deployment
1. Planning deployment
- Runtime
- AWS EC2 Instances
- AWS EC2 Container Service
- AWS Lambda
- Trained model could be saved to S3 and then loaded in Lambda function
- Application Deployment
- AWS CodeDeploy
- AWS OpsWorks
- AWS Elastic Beanstalk
- Infrastructure Deployment
- AWS CloudFormation
- AWS OpsWorks
- AWS Elastic Beanstalk
- Code Management
- AWS CodeCommit
- AWS CodePipeline
- AWS Elastic Beanstalk
2. Maintenace and monitoring
Monitoring
- AWS CloudWatch
- AWS CloudTrail
- AWS Elastic Beanstalk
3. Final report
- Highlight processes used in the project
- Analyze if all the goals for the project were met
- Detail the findings of the project
- Identify and explain the model used and reason behind using the model
- Identify the customer groups to target using this model
4. Project review
- Assess the outcomes of the project
- Summarize the results and write thorough documentation
- Common pitfalls
- Choosing the right ML solution
- Generalize the whole process to make it useful for the next iteration