Home | About Me
| | | | | | | | | |

Elements of Data Science

AWS Certified MLS | 05 Nov 2019
Tags:


  • Non-Linear support vector machines
  • </ul>](#ul-linon-linear-support-vector-machinesli-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

  • </ul>](#ul-lili-ul)

    What is Data Science?

    Data Science Definition Processes and systems to extract knowledge or insights from data, either structured on unstructured. (Wikipedia)

    Machine Learning Artificial Intelligence machines that improve their predictions by learning from large amounts of input data.

    Learning Is the process of estimating underflying function $f$ by mapping data attributes to some target value.

    Training Set Is a set of labeled examples $(x, f(x))$ where $x$ is the input variarbles and $f(x)$ is the observed target truth.

    Goal Given a training set, find approximation $f^`$ of $f$ that best generalizes, or predicts, labels for new measures. Results are measured by quality, e.g. error rate, sum squared error.

    Features Can be defined as Features, Attributes, Independent Variables, Predictors.

    Label Can be defined as Label, Target, Outcome, Class, Dependent Variable, Response.

    Dimensionality Refers to the number of Features.

    Types of Learning

    Key Issues in ML

    Data Quality

    Model Quality

    Computation Speed and Scalability AWS SageMaker

    Supervised Methods

    Linear Methods

    Univariate Linear Regression

    Multivariate Linear Regression

    Logistic Regression and Linear Separability

    Intermediary variable $z$ will be a linear combination of features and used with the sigmoid function

    Logistic regression finds the best weight vector by fitting the training data

    where

    Linear Separability


    Problem Formulation and Exploratory Data Analysis

    Data collection

    AWS provides a comprehensive tool kit for sharing and analyzing data at any scale

    Data Sampling

    Sampling Methods

    Issues with Sampling

    Consider using validation data that was gathered after your training data was gathered.

    Data Labeling

    Labeling Components

    Amazon Mechanical Turk

    Managing Labelers

    Sampling and Treatment Assignment

    Exploratory Data Analysis

    Domain Knowledge

    Amazon ML Solutions Lab

    Data Schema

    Pandas DataFrame Merge/Join

    import pandas as pd
    
    df = pd.DataFrame({"Name": ["John", "Bob", "Jim", "Kate"], "Job": ["Accountant", "Programmer", "Programmer", "Marketing"]})
    df_1 = pd.DataFrame({"VP": ["Tom", "Andy", "Kate"], "Job": ["Accountant", "Programmer", "Marketing"]})
    
    df_merged = df.merge(df_1, on="Job", how="inner")
    
    print(df_merged)
    

    Data Statistics

    1. Look into each Feature one at a time
    2. Assess Interactions between the Features (relationships)

    Descriptive Statistics

    import pandas as pd
    import seaborn as sb
    import matplotlib.pyplot as plt
    
    df = pd.DataFrame({"Name": ["John", "Bob", "Jim", "Kate"], "Job": ["Accountant", "Programmer", "Programmer", "Marketing"], "Salary": [1000, 2500, 2750, 1800]})
    df_1 = pd.DataFrame({"VP": ["Tom", "Andy", "Kate"], "Job": ["Accountant", "Programmer", "Marketing"]})
    
    df_merged = df.merge(df_1, on="Job", how="inner")
    
    print(df_merged["Job"].value_counts())
    sb.distplot(df_merged["Job"].value_counts())
    plt.show()
    

    Basic Plots

    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    
    dataset = load_breast_cancer()
    cols = [
        'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
        'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
        'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39'
    ]
    
    df = pd.DataFrame(dataset['data'], columns=cols)
    df['target'] = dataset.target
    
    # show first a few rows
    print(df.head())
    # show data type for each column
    print(df.info())
    # show summary statistics for each columns
    print(df.describe())
    # check the target variable properties
    print(df['target'].value_counts())
    
    # Density Plot
    df['V11'].plot.kde()
    plt.show()
    # Histogram
    df['V11'].plot.hist()
    plt.show()
    # Box Plot
    df.boxplot(['V11'])
    plt.show()
    # Scatter Plots (detecting relationship between variables)
    df.plot.scatter(x='V11', y='V12')
    plt.show()
    # Scatter Matrix Plot
    pd.plotting.scatter_matrix(df[['V11', 'V21', 'V31']], figsize=(15,15))
    plt.show()
    

    Correlations

    Correlation values are between -1 and 1.

    Correlation Matrices Measure the linear dependence between features; can be visualized with heat maps

    Correlation Matrix Heatmap Apply color coding to the correlation matrix for easy detection of correlation among the attributes

    Generating Heatmap

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    
    dataset = load_breast_cancer()
    cols = [
        'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
        'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
        'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39'
    ]
    
    df = pd.DataFrame(dataset['data'], columns=cols)
    
    col = ['V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19']
    heatmap = np.corrcoef(df[col].values.T)
    
    fig, ax = plt.subplots(figsize=(15, 15))
    im = ax.imshow(heatmap, cmap='PiYG', vmin=1)
    fig.colorbar(im)
    ax.grid(False)
    [[ax.text(j, i, round(heatmap[i, j], 2), ha="center", va="center", color="w") for j in range(len(heatmap))] for i in
     range(len(heatmap))]
    
    ax.set_xticks(np.arange(len(col)))
    ax.set_yticks(np.arange(len(col)))
    ax.set_xticklabels(col)
    ax.set_yticklabels(col)
    
    plt.show()
    

    Generating Heatmap Using Seaboarn

    import pandas as pd
    import numpy as np
    import seaborn
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    
    dataset = load_breast_cancer()
    cols = [
        'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
        'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
        'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39'
    ]
    
    df = pd.DataFrame(dataset['data'], columns=cols)
    
    col = ['V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19']
    heatmap = np.corrcoef(df[col].values.T)
    
    seaborn.heatmap(heatmap, yticklabels=col, xticklabels=col, cmap='PiYG', annot=True)
    plt.show()
    

    Data Issues


    Data Processing and Feature Engineering

    Encoding Categorical Variables

    Encoding Ordinals

    Types of Categorical Variables

    Pandas support special dtype="category"

    Encoding Categorical Variables Example

    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    
    df = pd.DataFrame([
        ['house', 3, 2572, 'S', 1372000, 'Y'],
        ['apartment', 2, 1386, 'N', 699000, 'N'],
        ['house', 3, 1932, 'L', 800000, 'N'],
        ['house', 1, 851, 'M', 451000, 'Y'],
        ['apartment', 1, 600, 'N', 324000, 'N']
    ])
    
    df.columns = ['type', 'bedrooms', 'area', 'garden_size', 'price', 'loan_approved']
    print(df)
    
    # Converting garden_size using mapping
    mapping = dict({'N': 0, 'S': 5, 'M': 10, 'L': 20})
    df['num_garden_size'] = df['garden_size'].map(mapping)
    
    # Converting label loan_approved using LabelEncoder
    loan_enc = LabelEncoder()
    df['num_loan_approved'] = loan_enc.fit_transform(df['loan_approved'])
    
    print(df)
    

    Encoding Nominals

    One-Hot Encoding Explode nominal attributes into many binary attributes, one for each discrete value

    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    import pandas as pd
    
    df = pd.DataFrame({"Fruits": ['Apple', 'Banana', 'Banana', 'Mango', 'Banana']})
    
    type_labelenc = LabelEncoder()
    num_type = type_labelenc.fit_transform(df["Fruits"])
    
    print(num_type) 
    # output: 
    # [0 1 1 2 1]
    
    print(num_type.reshape(-1, 1)); 
    # output: 
    # [[0]
    # [1]
    # [1]
    # [2]
    # [1]]
    
    
    type_enc = OneHotEncoder()
    type_trans = type_enc.fit_transform(num_type.reshape(-1, 1)).toarray()
    
    print(type_trans)
    # output:
    # [[1. 0. 0.]
    # [0. 1. 0.]
    # [0. 1. 0.]
    # [0. 0. 1.]
    # [0. 1. 0.]]
    
    

    Using Pandas’ function:

    import pandas as pd
    
    df = pd.DataFrame({"Fruits": ['Apple', 'Banana', 'Banana', 'Mango', 'Banana']})
    
    dummies = pd.get_dummies(df)
    print(dummies)
    

    Encoding with Many Classes

    Handling Missing Values

    Most ML Algorithms cannot deal with missing values automatically

    Check the Missing Values using Pandas

    import pandas as pd
    
    df = pd.DataFrame({
        "Fruits": ["Banana", "Apple", "Mango", "Mango", "Apple"],
        "Number": [5, None, 3, None, 1]
    })
    
    # Display the total number of missing values for each column
    print(df.isnull().sum())
    
    # Display the total number of missing values for each row
    print(df.isnull().sum(axis=1))
    

    Important to Consider:

    Treating Missing Values

    Dropping The Missing Values

    import pandas as pd
    
    df = pd.DataFrame({
        "Fruits": ["Banana", "Apple", "Mango", "Mango", "Apple"],
        "Number": [5, None, 3, None, 1]
    })
    
    # Drop the rows with null values
    print(df.dropna())
    
    # Drop the columns with null values
    print(df.dropna(axis=1))
    

    Imputing (Replacing) the Missing Values

    Imputing using the mean strategy:

    from sklearn.impute import SimpleImputer
    import numpy as np
    
    arr = np.array([
        [5, 3, 2, 2],
        [3, None, 1, 9],
        [5, 2, 7, None]
    ])
    
    imputer = SimpleImputer(strategy='mean')
    imp = imputer.fit_transform(arr)
    print(imp)
    

    Feature Engineering

    Filtering and Scaling

    Filter Examples:

    Scaling:

    Scaling Transformation in Sklearn:

    Standard Scaler

    from sklearn.preprocessing import StandardScaler
    import numpy as np
    
    arr = np.array([
        [5, 3, 2, 2],
        [2, 3, 1, 9],
        [5, 2, 7, 6]
    ], dtype=float)
    
    scale = StandardScaler()
    print(scale.fit_transform(arr))
    print(scale.scale_)
    

    MinMaxScaler (produces values between 0 and 1)

    from sklearn.preprocessing import MinMaxScaler
    import numpy as np
    
    arr = np.array([
        [5, 3, 2, 2],
        [2, 3, 1, 9],
        [5, 2, 7, 6]
    ], dtype=float)
    
    scale = MinMaxScaler()
    print(scale.fit_transform(arr))
    print(scale.scale_)
    

    Transformation

    Polynomial Transformation

    from sklearn.preprocessing import PolynomialFeatures
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'a': np.random.rand(5), 'b': np.random.rand(5)})
    
    cube = PolynomialFeatures(degree=3)
    cube_features = cube.fit_transform(df)
    
    cube_df = pd.DataFrame(cube_features, columns=[
        '1', 'a', 'b', 'a^2', 'ab', 'b^2', 'a^3', 'ba^2', 'ab^2', 'b^3'
    ])
    
    print(cube_df)
    

    Radial Basis Function

    Text-Based Features

    Bag-of-words Model


    Model Training, Tuning, and Debugging

    Supervised Learning: Neural Networks

    Supervised Learning: K-Nearest Neighbors

    Characteristics of K-Nearest Neighbor Algorithm

    Supervised Learning: Linear and Non-Linear Support Vector Machines

    Types of SVM

    Supervised Learning: Decision Trees and Random Forests

    Building a Decision Tree

    Types of Decision Trees

    Common Decision Tree Algorith

    Model Training: Validation Set

    Splitting Data: Training, Testing, Validation**

    Training Validation Testing Datasets

    Model Training: Bias Variance Tradeoff

    Bias

    Variance

    $Total Error(x) = Bias^2 + Variance + Irreducible Error$

    Learning Curve

    Model Debugging: Error Analysis

    Model Tuning: Regularization

    Regularization Techniques

    Regularization in Linear Models

    $ penalty = \sum_{j=1}^{n} \lvert w_j \rvert $

    $ penalty = \sum_{j=1}^{n} \lvert w_j^2 \rvert $

    L2 Regularization In Neural Network

    $ penalty = (\sum_{j=1}^n \lvert w^{[j]} \rvert^2 ) \frac{\lambda}{2 m} $

    $n$ - the number of layers $w^{[j]}$ - the weight matrix for the $j^th$ layer $m$ - the number of inputs $\lambda$ - the regularization parameter

    Scikit Learn Support

    Model Tuning: Hyperparameter Tuning

    Parameter vs Hyperparameter

    Tuning Hyperparameters

    Model Tuning

    Training Data Tuning

    Possible Issues and Solutions

    Feature Set Tuning

    Dimensionality Reduction

    Model Tuning: Feature Extraction

    Feature Selection vs Feature Extraction

    Model Tuning: Bagging/Boosting

    Bagging

    Boosting


    Model Evaluation and Model Productionizing

    Using ML Models in Production

    Aspects to Consider

    Types of Production Environments

    Model Evaluation Metrics

    Confusion Matrix

    Confusion Matrix

    Metrics

    $ Accuracy = \frac{ TP + TN }{ TP + TN + FP + FN }$

    $ Precision = \frac{ TP }{ TP + FP }$

    $ Recall = \frac{ TP }{ TP + FN }$

    $ F1-Score = \frac{ 2 x Precision x Recall }{ Precision + Recall }$

    Cross Validation

    Cross validation is a model validation technique, for assessing the prediction performance of the model. Based on this, certain chunk of data, referred to as testset, will be excluded from the training cycle and utilized for testing stage.

    K-Fold Cross Validation

    Steps

    Leave-one Out Cross Validation

    Stratified K-fold Cross Validation

    Metrics for Linear Regression

    $ Mean Squared Error $

    $ R^2 $

    Using ML Models in Production: Storage

    Considerations

    Model and Pipeline Persistence

    Model Deployment

    Using ML Models in Production: Monitoring and Maintenance

    Monitoring Considerations:

    Expected Changes

    Using ML Models in Production: Using AWS

    Common Mistakes