Machine Learning Summary

General Idea

No Free Lunch Theorem (no “best”)
CV for complex parameter and model
Machine Learning comes with assumptions
Types of learning:
- Supervised Learning: with response y
- Unsupervised Learning: without response y
- Semi-supervised Learning: a small amount of labeled data with a large amount of unlabeled data
Types of Model:
- Regression: the response is quantitative
- Classification: the response is qualitative
Trade-offs in Machine Learning:
- Accuracy vs. Interpretability
- Bias vs. Variance
- Complexity vs. Scalability
- Domain-knowledge vs. Data-driven
- More data vs. Better algorithm

Preprocessing Data

Sampling: select a subset of observations
Feature Extraction: Select input variables
- Lasso Regularization
- Kernel in the Convolution Operator
Scale Data: Unify the unit, especially important for distance measure models
- Normalization: If you scale the data using following approach: x_nrm = (x - min(x))/(max(x) - min(x)), it’s called as normalization.
- Standardization: if you scale the data using following approach: x_std = (x - mean(x))/sqrt(Var(x)), it’s called as standardization. Over here, you want the features to have zero mean and a unit variance.
- In Python, when normalization, one should only fit on the training data set once, and then use the fitted scaler or standardizer to transform the test data set.
- In Python, Normalizer works on rows instead of columns. In most cases, MinMaxScaler or StandardScaler will be more helpful
Handling Missing Data: Using Data Imputation to deal with the missing data
- In Python, we use SimpleImputer
Label Encode: Change discrete label to continuous
- In Python, we use LabelEncoder, OneHotEncoder
Construct Train and Test set: Split dataset to train set and test set
- In Python, we use model_selection.train_test_split
Generate polynomial and interaction features: extension of the Linear Model
- In Python, we use PolynomialFeatures

Resampling

Repeatedly drawing sample from a training set and refitting a model on each sample to obtain more information about the fitted model
If we are in a data-rich situation, the good approach for both model selection and model assessment is to randomly divide the dataset into three parts: training set, validation set, and test set (50%, 25%,25%)
- The training set is used to fit the models
- The validation set is used to estimate prediction error for model selection
- The test set is used for assessment of the prediction error of the final chosen model
Model Assessment: having chosen a final model, estimating its prediction error on new data
- Test error: the average prediction error of a machine learning method on new observations
- Train error: the average loss over the training sample
- Test error ! = Train error
  - normally training error will be smaller and smaller, with the model becomes more complex; test error will go smaller at the beginning, but then it will go higher because of the overfitting issue
  - There are some methods to make mathematical adjustments to the training error rate in order to estimate the test error rate: Cp statistic, AIC, BIC
Model Selection: estimating the performance of different models in order to choose the best one
- Validation set approach: Put models fitted form train set into validation set and find one with lowest error rate (MSE)
  - Advantage: Conceptually simple and easy implementation
  - Disadvantage:
    - The validation set error rate(MSE) can be highly variable
    - Less observations in training set
    - Thus, the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set
  - We can use Validation set for early stopping, such as in Neural Network and Boosting tree
- Leave One Out CV:
  - a single observation is used for the validation set and the remaining observations (n-1) make up the training set. Repeat the process n times, and find the average error rate
  - Advantage:
    - LOOCV has far less bias and won’t overestimate the test error rate
    - There is no randomness in the training/validation set splits, so we will get same results always
  - Disadvantage:
    - LOOCV is computationally expensive and has higher variance
- K-fold CV: Always 5-fold or 10-fold
  - shuffle the data once at the beginning, then stay the same order
  - randomly divide the data set of into K folds
  - one fold as validation set, K-1 folds as training set
  - repeat K times and get the average
  - each of the K validation folds is distinct from the other K – 1 folds used for training
- LOOCV vs. K-fold CV
  - In LOOCV, all trained models are very close to one another, so LOOCV does not model sampling from the population very well
  - In K-fold CV, your K models are different enough, so K-fold CV error models repeated sampling from the population at the expense of some bias. Therefore, we choose K=5 or K=10 fold CV, which are empirically balanced trade off for variance and bias
CV on Classification Problems
- Rather than using MSE, we use the number of misclassified observation
- If there is class imbalanced issue:
  - we can use precision, recall, or F scores to evaluate the performance
  - sometimes stratified cross validation is used, which seeks to ensure that each fold has the same ratio of each class as the original data set.
Bootstrap
- The bootstrap is used to quantify uncertainty (variance) associated with a given estimator or machine learning method; it is a general tool for assessing statistical accuracy
- Monte Carlo Simulation
  - To study the statistical properties of f(z), you draw multiple samples z(i) from the population and estimate fˆ(z(i)) from each sample and build a distribution for f(z(i))
- Bootstrap randomly selected samples with the same size as the original sample are drawn from the training sample with replacement. This is done B times, producing B bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the B replications.
- The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets; this enables use to estimate the variability of our estimate without generating additional samples.
- In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thoughts
  - For example, if the data is a time series, we cannot simply sample the observations with replacement. However, we can instead create blocks of consecutive observations, and sample those with replacement. Then, we paste together sampled blocks to obtain a bootstrap dataset.
- Bootstrap Percentile confidence interval
1. Create B bootstrap datasets from original dataset with the same size N.
2. Calculate the mean of each bootstrap sample m1, m2 ,…,mB. Consider them as a sample of sample means
3. Order the sample means and call them m(1), m(2),…,m(B)
4. The middle (1-α)B of the sample yield the (1-α) Confidence interval for the mean
- Each bootstrap sample has significant overlap with the original data, around 2/3
  - Solution: leave-one-out bootstrap, but has a training-set-size bias

Algorithm

KNN Classifier

Lazy Learning Algorithm, no training needed, just measure the distance
Choice of K: it is better to have an odd number
- small K : high variance, low bias (if k = 1, training error will be 0)
- large K: low variance and high bias
Advantages:
- Simple to implement
- Few tuning parameter: just K and distance
- Flexible: classes do not need to be linearly separable
Disadvantages:
- Computationally expensive: we need to calculate distance from new observation to all samples
- Sensitive to imbalanced dataset: may get poor results for infrequent classes
- Sensitive to irrelevant inputs: irrelevant inputs make distances less meaningful for identifying similar neighbors
From homework:
- In Python, we use KNeighborsClassifier
- Use different K to fit the model, comparing the train/test error to choose the optimal K
- We can calculate the confusion matrix, true positive rate, true negative rate, precision score, f1 score to evaluate the performance
- We can use different distance metrics in KNeighborsClassifier:
  - Euclidean/Minkowski: default, p = 2,
    sqrt(sum((x - y)^2))
  - Manhattan: p = 1, sum(|x - y|)
  - Chebyshev: p → ∞, max(|x - y|)
  - Mahalanobis: sqrt((x - y)’ V^-1 (x - y))

knn = KNeighborsClassifier(n_neighbors= 5,metric = 'mahalanobis',metric_params = {'V':np.cov(X_train.values,rowvar = False)})

The majority polling decision can be replaced by weighted decision so that closer neighbors of a query point will have a greater influence than neighbors which are further away. Just change weights from “uniform” to “distance”

knn = KNeighborsClassifier(n_neighbors = 5, weights = 'distance')

Linear Regression

It assumes that the dependence of Y on X1,X2,… Xp is linear
The function is like Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
- Assumptions of error terms:
  - ε1,ε2,…,εn, are uncorrelated since correlated error terms will affect the calculations of the standard error and confidence interval
  - independent from X
  - E[ε] = 0
  - error terms have a constant variance
We try to decrease the reducible error, but for original irreducible error (missing data, wrong data), we can only ignore it first.
Least square approach: choose the model that minimizes the RSS (Residual Sum of Squares)
An estimate (anything with a hat) is always a random variable. The standard deviation of the distribution of an estimate is its standard error (SE) (low variance is desirable). The standard error of an estimator reflects how it varies under repeated sampling.
These standard errors can be used to compute confidence intervals and perform hypothesis testing
A 95% confidence interval is defined as a range of values such that under repeated sampling (a very large number of times), 95% of times, the true parameter is within the confidence interval that we find
Hypothesis Testing: find out if X is significant or not
- Single feature:
  - small sample size: using t-test
  - large sample size: using z-test
    - the quantity z-score follows a standard normal standardization N(0,1).
    - Note: This statement is true for any maximum likelihood estimate, provided that the number of samples is large
- Multiple features:
  - Using F-test
- In software, we can use P-value to evaluate the significance (small P-value is desirable)
Assessing the overall accuracy of the model:
- RSE (Residual Standard Error): estimate the standard deviation of the noise
- TSS(Total Sum of Squares): sum of the square of difference between each yi and y mean
- Regression SS: Explained variation attributable to the linear relationship between X and Y
- RSS: Unexplained variation attributable to the linear relationship between X and Y
- TSS = Regression SS + RSS
- R^2 = Regression SS / TSS: the ratio of variation explained to total variation (we want higher R^2)
  - In simple linear regression setting (only X and Y), R^2 = r ^2, where r is the correlation between X and Y (r = Sxy/SxSy)
  - R^2 is between 0 and 1, r is between 1 and -1
Multiple Linear Regression:
- we assume that the predictors are uncorrelated, each coefficient can be estimated and tested separately
- β0 : the average of y that cannot be explained by Xi
- Variable Selection (Naive Method):
  - Forward Selection:
    - begin with the null model – Y = β0
    - fit p simple linear regressions and add to the null model the variable that results in the lowest RSS – Y = β0 + β1X1 …
    - continue until some stopping rule is satisfied (e.g. when all remaining variables have a P-value above some threshold) – Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
  - Backward Selection:
    - Start with all variables in the model – Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
    - Remove the variable with the largest P-value, so the new (p-1) variables model is fit – Y = β0 + β1X1 + β2X2 + · · · + βp-1Xp-1 + ε
    - Continue until a stopping rule is reached (e.g. we may stop when all remaining variables have a significant P-value defined by some significance threshold)
- Regression model with qualitative predictors
  - Create dummy variables:
    - Xi1 = {1 if the person is Asian, 0 if the person is not Asian}
    - Xi2 = {1 if the person is Caucasian, 0 if the person is not Caucasian}
    - Yi = β0 + β1Xi1 + β2Xi2 + εi
      - Yi = β0 + β1 + εi, if the i th person is Asian
      - Yi = β0 + β2 + εi, if the i th person is Caucasian
      - Yi = β0 + εi, if the i th person is African American – baseline
    - There will always be one fewer dummy variable than the number of levels. If you use the same number of dummy variables as the number of levels, one of your variables would be “co-linear” with others.
- Extensions of the linear model: add Polynomial or Interaction terms to the model
  - Interactions: combine multiple predictors as a new predictor
    - Hierarchy Principle: If we include an interactive term into the model, we should also include the main effects, even if the P-values of main effects are large (not significant), which is better for interpretation
    - Interaction can also be between Quantitative and Qualitative
  - We can use R^2 to evaluate if adding those terms is useful or not
Potential Problems when fit a linear regression model (Detail: ISLR 3.33)
- Non-linearity of the response-predictor relationships
  - Residual plots are a useful graphical tool for identifying non-linearity. If the pattern of residual plot is obviously non-linear, we should consider to add non-linear term into the model.
- Correlation of error terms: if the error terms are correlated (e.g. time-series data), we may have an unwarranted sense of confidence in our model. The estimated standard errors will tend to underestimate the true standard errors so that the true confidence interval should be narrower than the estimate.
  - Plot the residuals from our model as a function of time. If the error terms are positively correlated, then we may see tracking in the residuals—that is, adjacent residuals may have similar values.
- Non-constant variance of error terms (violate the assumption)
  - One can identify non-constant variances in the errors, or heteroscedasticity (异方差性), from the presence of a funnel shape in the residual plot.
  - One possible solution is to transform the response Y using a concave function such as log Y or Y. Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.
- Outliers: observations for which the response yi is unusual given the predictor xi
  - If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor.
- High-leverage points: observations with high leverage high leverage have an unusual value for xi
  - high leverage observations tend to have a sizable impact on the estimated regression line; like outliers, we want to remove those High-leverage observations
  - In order to quantify an observation’s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage.
- Collinearity: two or more predictor variables are closely related to one another.
  - collinearity results in a decline in the t-statistic so that the power of hypothesis test will be reduced
  - A simple way to detect collinearity is to look at the correlation matrix of the predictors
Advantages:
- Simple approach to supervised learning
From homework
- Dataset structure (how many rows/columns) – df.shape
- Pairwise scatterplots of all the variables – sns.pairplot (seaborn)
- Statistic summary of dataset – df.describle()
- It is possible that though predictors are statistically significant, their R^2 have huge differences so that their plots look so different and we may think some predictors are not significant from the plots
- In Python, when we fit a multiple regression model and we want to know their P-values, we can do the following steps (sm is statsmodels.api)

from patsy import dmatrices
y,X = dmatrices('PE ~ AT + V + AP + RH', data = ccpp, return_type= 'dataframe')
model = sm.OLS(y,X)
result1 = model.fit()
print(result1.pvalues)

In Python, if we want to know P-values, we need to use statsmodels.api library

    X = ccpp.iloc[:,10]X1 = sm.add_constant(X)model = sm.OLS(y,X1)result = model.fit()coef.append(result.params)print(result.summary())

Logistic Regression

Classification General Idea:
- Often we are more interested in estimating the probabilities that X belongs to each category in set C (set of categories)
- Multi-class classification: there are more than two classes in a classification task. The assumption is that each sample is assigned to one and only one label (e.g. an animal can be a horse or a bird, but cannot be both at the same time)
- Multi-label classification: each sample has a set of target labels (e.g. topics that are relevant for a document)
- We can also have both multi-class and multi-label in a dataset
Logistic Regression can guarantee the output is between 0 and 1
For single feature:
- Parameters control shape and location of sigmoid curve
  - β0 controls location of midpoint
  - β1 controls slope of rise
- The odds is p(x) / 1 - p(x):
  - the decision boundary of logistic classifier is p(x) / 1 - p(x) = 1, so p(x) = 0.5
  - log odds (logit): log(p(x) / 1 - p(x)) = β0 + β1x
- One can see class in a binary classification problem as a Bernoulli random variable that can take two values 0 and 1:
  Pr(Y=1|X=x) = p(x)
  Pr(Y=0|X=x) = 1-p(x)
  - can be rewritten as Pr(Y=y|X=x)=[p(x)]^y [1-p(x)]^1-y, y = 0 or 1
  - we assume that we have an independent sample, so for each pair (xi, yi), we have
    Pr(Yi=yi|Xi=xi)=[p(xi)]^yi [1-p(xi)]^1-yi, yi = 0 or 1
  - Since we know xi, yi (data points), we just need to find the β0 and β1 to maximize the likelihood, so this is called Maximum Likelihood function
    - In R, we use glm function to fit the model
    - In Python, we use linear_model.LogisticRegression
  - In a linear model, if the errors belong to a normal distribution, the least squares estimators are also the maximum likelihood estimators.
Multiple Features:
- the decision boundary is still p(x) = 0.5
- sometimes, we may face confounding problem (correlated features). It will affect the coefficients we got, but we can use it for data-driven decision-making.
- we may also face class imbalance problem
  - Marginal Imbalance: one class is rare compared to the other classes (Pr(Y=1) ≈ 0)
    - < 15% imbalanced
    - < 5% seriously imbalanced
  - Conditional Imbalance: it is easy to predict the correct labels in most cases
    - Pr(Y=1| X=0) ≈ 0 Pr(Y=1 | X=1) ≈ 1
  - Solution: SMOTE (Synthetic Minority Over-Sampling Technique)

from imblearn.over_sampling import SMOTEsmo = SMOTE()
X_smo, y_smo = smo.fit_resample(x_train,Y_train)

Case-Control sampling: we do not have the population data. Therefore, if our sample data is skewed, though we can estimate the regression parameters βj accurately (if our model is correct); the constant term β0 is incorrect.
- Can correct the estimated intercept by a simple transformation (detail on p70)
- Often cases are rare and we take them all; up to five times that number of controls is sufficient.
- When you balance the data, you are increasing variance. We are making our algorithm overfit to the minority data.
Logistic Regression with more than two classes:
- In R, we can use glmnet package
- Here there is a linear function for each class
- Multi-class logistic regression is also referred as multinomial regression
From homework:
- Sometimes, though you have multiple classes, if you only want to classify one class from the others, you can regard it as a binary classification problem. Take your target class as 1, others as 0.
- To break each column in the dataset into (approximately) equal length time series, we can use np.array_split()
- Applying Python’s Recursive Feature Elimination, which is a backward selection algorithm, to logistic regression.
  - from sklearn.feature_selection import RFECV Use RFECV to determine the number of features used in RFE
  - from sklearn.feature_selection import RFE Select which specific features to use
- Normally, we can use linear_model.LogisticRegression, but if we want p-values, we need to use statsmodels
- Singular Matrix Issue when using Logistic Regression in Python
  - It is because the default solver is the Newton method needs the Hessian matrix of the training data can be inverted. If the data set cannot, we need to change the method to quasi-Newton methods, for example, the BFGS method, which does not need to calculate the inverse of the Hessian matrix, to solve the problem.
- If we need to read folders from a big folder, we can use os.listdir() to get those folders. Then, we can try to read each csv file in each folder

import pandas as pd
import ospath = 'AReM'
folders = os.listdir(path)
folders.sort()
datasets = []for folder in folders:  list_ = os.listdir(path + '/' + folder)list_.sort()list_.sort(key = len)print(folder)for file in list_:print(file)df = pd.read_csv(path + '/' + folder + '/' + file, index_col = None, header = 4, usecols = [0,1,2,3,4,5,6])datasets.append(df)

L1-penalized Binary logistic regression:
- For L1-penalized logistic regression, you may want to use normalized/standardized features
- CV for λ, the weight of L1 penalty (or C, the budget)

from sklearn.linear_model import LogisticRegressionCVC_ = []
score = []range_Cs = [1e-4,1e-2,1,1e2,1e4]
lrcv = LogisticRegressionCV(Cs = range_Cs, penalty = 'l1', solver = 'liblinear', cv =5, max_iter = 300)
lr_new = lrcv.fit(X,y)
C_.append(lr_new.C_)
score.append(lr_new.score(X,y))

Multi-class classification
- you may need to encode the class first

lrcv_multi = LogisticRegressionCV(Cs = range_Cs, penalty = 'l1', solver = 'saga', cv =5, multi_class = 'multinomial', max_iter = 10000)

(Bayesian) Discriminant Analysis

Here the approach is to model the distribution of X in each of the classes separately, and then use Bayes Theorem to flip things around and obtain Pr(Y|X) – posterior probability
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis. Other distributions can be used as well, but we focus on normal distribution first
Bayes Theorem:
- A, B: events
- P(A|B): posterior probability; probability of A given B is true
- P(B|A): probability of B given A is true
- P(A): prior probability
- P(B): independent probability of B
Bayes Theorem for classification:
- fk(x) = Pr(X = x|Y = k) is the density for X in class k. Here we will use normal densities for these, separately in each class
- πk = Pr(Y = k) is the marginal or prior probability for class k
- Bayes’s Optimal Classifier: classify to the largest posterior
Why discriminant analysis?
- When the classes are well-separated, logistic regression is unstable
  - The algorithms that solve logistic regression are iterative algorithms, so they will never stop (sigmoid curve). You can only force it to stop, but in this way, the estimate βi and P-value will be wrong, though you may get a good classifier
- If n << p, the linear discriminant model is more stable
- Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data (in Python, we can choose how many components to use)
  - the discriminant variable (component) is the change of variables. It is linear combination of the original features.

This is possible because there’s an inherent dimension reduction in LDA. LDA makes distance comparison in the space spanned by different class means. Two distinct points lie on a 1d line; three distinct points lie on a 2d plane. Similarly, K class means lie on a hyperplane with the number of dimensions at most (K-1). When making distance comparison in this space, distances orthogonal to this subspace would add no information since they contribute equally for each class. Hence, by restricting distance comparisons to this subspace only would not lose any information useful for LDA classification. That means, we can safely transform our task from a p-dimensional problem to a (K-1)-dimensional problem by an orthogonal projection of the data onto this subspace. When p is much larger than K, this is a considerable drop in the number of dimensions.
– from https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b

Linear Discriminant Analysis

when p = 1, only one predictor
- Gaussian Density
- Back to Bayes Formula, we assume all σk = σ (with the same covariance matrix Σ)
  - we want to find optimal k = k* that maximize Pk(x)
  - it can be simplification to get the discriminant score δk(x), which is a linear function of x
- When p>1, we have multiple features
  - Density:
  - Discriminant function:
    - for simplification: δk(x) = xT*C+C0 (C is vector, C0 is constant)
      δk(x) = ck0 + ck1x1 + ck2x2 + . . . + ckpxp
- From δk(x) to probabilities
  - Once we have estimates δˆ (x), we can turn these into estimates for class probabilities. So classifying to the largest δˆ (x) amounts to classifying to k the class for which Pr^ (Y = k|X= x) is largest. Seems that we are doing something close to logistic regression, we only find δˆ (x) in a different way.
    - When K = 2, we classify to class 2 if Pr^(Y = 2|X = x) ≥ 0.5, else to class 1

Quadratic Discriminant Analysis

By altering forms of fk(x), we get different classifier
With Gaussians but different covariance matrix Σ in each class, we get quadratic discriminant analysis (we will not assume all σk = σ)

Naive Bayes

With conditional independence model in each class we get Naïve Bayes. For Gaussian, this means the Σk are diagonal.
- assume fk(x1, x2, …,xn) = fk(x1) fk(x2)… fk(xn), each fk(xi) is independent
- New sample is classified to Y=k if πkΠi fk(xi) is maximal.
There is a problem: Since we assume conditional independence, the final result will be 0 if any one of the conditional probabilities is 0. Therefore, if there is a class imbalanced issue, the result will not be accurate
- So, we can use some methods to improve the estimation:
  - Laplace
  - m-estimate
- In Python, sklearn, it is recommended to switch Laplace smoothing on
Advantage:
- good for computation (from joint probability to conditional probability)
- Rather robust to isolated noise samples, since we average large samples
- Handles missing value by ignoring them (do not disregard the record/data point, just disregard the missing feature)
- Rather robust to irrelevant attributes
Disadvantage:
- strong assumption
- Not robust to redundant attributes (correlated attributes), because they break down the conditional independence assumption
Classifier Summary (ISLR 4.5)
- For a two class problem, LDA has the same form as LR. The difference is how the parameters are estimated
  - LR used the maximum likelihood based on Pr(Y|X) (known as discriminative learning that ‌learns the conditional probability distribution)
  - LDA used the estimated mean and variance form a normal distribution based on Pr(X,Y) (known as generative learning that learns the joint probability distribution)
- Despite has differences, in practice, the results of LR and LDA are often very similar
- LR can also add quadratic terms in the model
- If Gaussian assumptions are met, LDA may be better than LR; But if Gaussian assumptions are not met, LR may be better
- LR is good for Bernoulli classification (K=2)
- LDA is useful when n is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K>2.
- KNN can dominate LDA and LR if the decision boundary is highly non-linear, but KNN cannot tell us which predictor is more important
- QDA is a compromise between non-parametric KNN method and LDA and LR methods since it has wider range (with quadratic terms)
- Naïve Bayes is useful when p is very large.
Evaluation method for Classification Algorithms:
- Confusion Matrix:
  - True Positive
  - False Positive
  - True Negative
  - False Negative
- Sensitivity / Recall score/ TPR = TP/(TP + FN)
- Specificity / TNR = TN/(TN + FP)
- FPR = FP/(TN+FP) – 1 - Specificity
- Positive Predictive Value / Precision = TP/(TP + FP)
- Negative Predictive Value = TN/(TN+FN)
- F1 score = 2* (Precision * Recall） / （Precision + Recall)
  - F1 score might be better than accuracy if there is imbalanced class issue
- Fβ score: skewed compromise between precision and recall
  - Fβ = (β^2+1)(Precision * Recall) / (β^2*Precision) + Recall
    - β>1, emphasis on Recall
    - β between [0,1), emphasis on Precision
    - β = 1, Fβ = F1
- Evaluation for multiple-class:
  - A macro average just averages the individually calculated scores of each
    class, that Weights each class equally
    - PRE macro = (PRE1 + PRE2 + … +PREk) / k
  - A micro average calculates the metric by first pooling all instances of each class, that Weights each instance equally
    - PRE micro = (TP1+TP2+…+TPk) / (TP1+TP2+…+TPk + FP1+FP2+…+FPk)
- Note that all of the measures used to evaluate types of error can be computed over both training and test sets
- To optimize the model, we can change threshold to some value in [0,1]
- ROC plot & AUC score:
  - FPR is horizontal axis, TPR is vertical axis
  - Area under ROC curve is AUC score
  - We can show multiple class in one plot
From Homework:
- Gaussian Naive Bayes: from sklearn.naive_bayes import GaussianNB
- Multinomial priors Naive Bayes: from sklearn.naive_bayes import MultinomialNB
- Confusion Matrix: cm = metrics.confusion_matrix (y_train,y_pred)
- Evaluation scores: sklearn.metrics
- ROC & AUC: ROC and AUC needs y_score, but not y_pred

# for binary class
y_pred = model.predict(X_train)
y_score = model.decision_function(X_train)roc = metrics.roc_curve(y_train, y_score)
roc_auc = metrics.roc_auc_score(y_train, y_score)
plt.plot(roc[0],roc[1],label = 'ROC Curve')
plt.plot([0,1],[0,1])   # diagonal line
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()# for multiple class
from sklearn.preprocessing import label_binarize
from itertools import cyclen_classes = 7
labels = ['bending1','bending2','cycling','lying','sitting','standing','walking']
fpr = dict()
tpr = dict()
roc_auc = dict()
y_test = label_binarize(y_test, classes=list(range(n_classes)))   # Binarize labels in a one-vs-all fashion
y_score_ = multi_y_score[4]for i in range(n_classes):fpr[i], tpr[i], _ = metrics.roc_curve(y_test[:,i], y_score_[:,i])roc_auc[i] = metrics.auc(fpr[i], tpr[i])colors = cycle(['blue', 'red', 'green','pink','black','yellow','orange'])for i, color in zip(range(n_classes), colors):plt.plot(fpr[i], tpr[i], color=color, lw=1.5, label='ROC curve of class {0}' ''.format(labels[i], roc_auc[i]))plt.plot([0, 1], [0, 1], 'k', lw=1.5)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC for multi-class data')
plt.legend(loc="lower right")
plt.show()

Feature Selection & Regularization

To improve the simple linear model, the ordinary least squares fitting can be replaced with some alternative fitting procedures. It is good for:
- Prediction Accuracy: especially when p > n, to control variance
- Model Interpretability: By removing irrelevant features — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted.
Three classes of methods:
- Subset Selection: Identifying a subset of the p predictors that are related to the response and fitting a model using least squares on the reduced set of variables
- Shrinkage/ Regularization: Fitting a model involving all p predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates
- Dimension Reduction: The p predictors are projected into a M-dimensional subspace, where M < p
Subset Selection:
- The best way is to fit the model with every one, every two … every p predictors and find the best one with smallest RSS or largest R^2 in each size as M0, M1,…,Mp. Then, select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R^2
  - BUT the best way cannot be applied with very large p. Computational expensive and will lead to overfitting and high variance of the coefficient estimates
- Stepwise Selection
  - Forward Stepwise Selection:
    - Start with null model, add the term one by one (one with smallest RSS or largest R^2). Then, we will get p+1 models with different number of terms M0, M1, … Mp. Select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R^2
  - Backward Stepwise Selection:
    - Begins with the full model containing all p predictors and then iteratively removes the least useful predictor, one at a time. So, we will also get M0, M1, …, Mp (this time, Mp has the fewest predictors). Then, select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R^2
  - NOTE: Backward stepwise selection requires n>p. In contrast, Forward stepwise selection can be used when n<p. Therefore, forward stepwise selection is the only viable subset method when p is very large
  - Both Forward and Backward stepwise selection only need 1+ p*(p+1)/2 models, way easier than the best subset approach
  - BUT both Forward and Backward stepwise selection cannot guarantee to yield the best model
- Evaluation of the optimal model
  - We don’t use RSS or R^2 for selecting the best model since they are about training error, but we want low test error
  - We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting
    - These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables
      - Mallow’s Cp: small is desirable
      - AIC: small is desirable
        In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and Cp and AIC are equivalent.
      - BIC: small is desirable; results in the selection of smaller models than Cp
      - Adjusted R^2: large is desirable
        The intuition is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS, but adding noise term will lead (n-d-1) increase, so the overall RSS/(n-d-1) will increase, so overall adjusted R2 will decrease.
  - We can directly estimate test error using CV or other approaches
    - We just need to CV for optimal k, and return model Mk^
    - This approach provides a direct estimate of the test error, and doesn’t require an estimate of the error variance σ^2
Shrinkage / Regularization Method (Ridge and Lasso)
- we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero
- It may not be immediately obvious improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance
  - var(yˆ) = var(βˆ0+βˆ1X1+…+βˆpXp), so βˆi goes smaller, the variance becomes smaller
- λ≥ 0 is a tuning parameter, to be determined separately by CV, we choose the one with smallest error rate. Finally, the model is refit using all observations and the best λ. Note that β0 is not regularized
- λ ∈ {10^-4, 10^-3, 10^-2, 10^-1, 1, 10, 100, 1000, 10000}
- Ridge Regression (L2 penalty)
  - it is best to apply ridge regression after standardizing the predictors
  - when we pick the optimal λ, we can reduce the variance significantly, with very little increase in bias
- Lasso Regression (L1 penalty)
  - The L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large
  - the Lasso performs variable selection
- Why Lasso can force coefficient estimates to exactly 0?
  - value of constraint: s; Hyper parameter, found by CV
    - when λ goes bigger, s becomes smaller; when λ goes smaller, s becomes bigger
  - To plot the relation, RSS can be shown as a contour (ellipse’s function). As β^ goes smaller, the RSS becomes bigger and the contour becomes bigger. When the contour first contacts the constraint area, we pick that β^. The lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. However, since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero, but just close to 0.
  - constraint regions for different power (q):
    - q>1: no motivation sparsity (cannot force to 0)
    - q<1: non-convex constraints -> non-convex optimization problem (can force to 0)
  - The constraint region should be a closed region
- Conclusion:
  - We cannot say Lasso nor the Ridge Regression dominate the other, it depends
  - In general, we expect Lasso be better when we have a small subset of predictors. However, we never know prior the number of predictors that is related to the response. We can use CV to determine which approach is better on a particular data set
  - Ridge regression more or less shrinks every dimension of the data by the same proportion; Lasso more or less shrinks all coefficients toward 0 by a similar amount, and sufficiently small coefficients are shrunken all the way to 0.
- Elastic Net (combine Ridge and Lasso)
  - Lasso is good for eliminating irrelevant variables, but it does not handle redundant variables
  - Therefore, if we have groups of very correlated variables, we can use Elastic Net to combine Ridge and Lasso.
  - α ∈ [0,1]. When α = 1, it reduces to the L1-norm with α = 0, it reduces to the squared L2-norm
Dimension Reduction Methods
- Transform the predictors and then fit a least squares model using the transformed variables. Only feature transformation, not feature selection
- Let Z1, Z2, …, ZM represent M < p linear combinations of our original p predictors
- Dimension reduction serves to constrain the estimated βj coefficients, so it will reduce variance
- Principal Component Regression (unsupervised):
  - The first principal component is that (normalized) linear combination of the variables with the largest variance. Subject to Σφ1j^2 = 1. Since we are only interested in variance, we assume that each of the variables in X (vector) has been centered to have mean zero (that is, the column means of X are zero)
    - Z1 = φ11X1 + φ12X2 +…+ φ1pXp (φ1j can be found by eigen decomposition)
    - The first principal component loading vector has a very special property: it defines the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness)
  - The second principal component has largest variance, subject to being uncorrelated with the first PC (orthogonal to the 1st PC)
  - Each principal component loading vector is unique, up to a sign flip (the sign may be different, but value is unique)
  - In PCR, the number of principal components, M, is typically chosen by CV
  - When performing PCR, standardizing each predictor prior to generating the principal components is recommended. This standardization ensures that all variables are on the same scale. PCA was performed after standardizing each variable to have mean zero and standard deviation one. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model.
    - In certain settings (e.g. genes), however, the variables may be measured in the same units. In this case, we might not wish to scale the variables to have standard deviation one before performing PCA.
  - Proportion of Variance Explained (PVE)
    - Evaluation of each Principal component (how much information that PCs catch)
    - the PVE of the mth principal component is given by
    - The PVE of each principal component is a positive quantity
    - In total, there are min(n − 1, p) principal components, and their PVEs sum to one
      - we usually not interested in all of the PCs, but we would like to use the smallest number of principal components required to get a good understanding of the data
      - We typically decide on the number of principal components required to visualize the data by examining a scree plot, and find the elbow point to determine the number of PCs to use
  - In R, we use biplot to plot the PCA. It displays both the principal component scores and the principal component loadings
  - Drawback: Since PCR does not use response Y, the response does not supervise the identification of the PCs, so there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response
- Partial Least Squares (a supervised alternative to PCR)
  - Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1,…,ZM that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response.
  - from Stanford STATS 202 Lecture Notes
- PLS vs. PCR
  - To form each component, PLS give more weight to the variables/residuals that are more correlated with the response
  - To form each residual, PLS removes the information explained by the previous components
  - In practice, PLS often performs no better than ridge regression or PCR
  - Compare to PCR, PLS has less bias and more variance (a strong tendency to overfit)

Tree-based Methods

Decision Tree (Regression)
- Easy to interpret, but not competitive on prediction accuracy
- These involve stratifying or segmenting the predictor space into a number of simple regions
- Tree-building Process
  1. Divide the predictor space to non-overlapping regions R1,R2,…,Rj
  2. For every observation that falls into the region Rj, we make the same prediction, which is simply the mean of the response values for the training observations in Rj
  - The goal is to find R1, R2, …, Rj that minimize the RSS
- BUT, it is computationally infeasible to consider every possible partition of the feature space into j regions, so we take top-down, greedy approach that is known as recursive binary splitting
  - It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step
  - We first select the predictor Xj and cut point s, so we get two new regions {X|Xj<s} and {X|Xj>=s} leads to the smallest RSS
  - Then, we split one of two previously identified regions with cut point s and go on (minimize the RSS)
  - The process continues until a stopping criterion is reached or we can prune the tree
- Terminology for Trees
  - The points along the tree where the predictor space is split are referred to as internal nodes
  - Regions R1, R2, …, Rj are known as terminal nodes
- Prune a Tree
  - If there are too many splits, the tree may overfit the data
  - A good strategy is to grow a very large tree and then prune it back in order to obtain a subtree. A smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias.
  - Intuitively, our goal is to select a subtree that leads to the lowest test error rate, and we can use cost complexity pruning method. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α (chosen by CV)
    - α controls a trade-off between the subtree’s complexity and its fit to the training data
    - When we get the optimal α, and then return to the full data set and obtain the subtree corresponding to the chosen α
  - Try to minimize this equation
    - |T| indicates the number of terminal nodes of the tree T
    - yˆRm is the mean of the training observations in Rm (sample mean)
  - from ISLR
Decision Tree (Classification)
- Similar to Regression, but response is Qualitative and we predict by the most commonly occurring class
- We cannot use RSS, but instead of using classification error rate, which is the fraction of training observations in the Region that E=1−max(pˆ )
- However classification error is not sufficiently sensitive for tree- growing, and in practice two other measures are preferable when we care about the node purity
  - Gini Index (data purity):
    - if Pˆmk close to 0 or 1, G will be small. If G is small, it represents that a node contains predominantly observations from a single class
  - Deviance/cross-entropy
    - Since 0≤ Pˆmk ≤ 1, it follows that 0 ≤ −pˆmk log(pˆmk), the entropy will take on a value near zero if the pˆmk ’s are all near zero or near one. We expect a small entropy score
  - It turns out that the Gini index and the cross-entropy are very similar numerically.
  - Classification error rate, Gini index, and Cross-Entropy might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal
- Advantages:
  - Easy interpretation
  - Display graphically
  - Trees can easily handle qualitative predictors without the need to create dummy variables
- Disadvantages:
  - Low predictive accuracy
  - Trees can be very non-robust. A small change in the data can cause a large change in the final estimated tree.
  - BUT we can use ensemble method: Bagging, Random Forest, Boosting
Bagging
- Bootstrap aggregation
- reduce variance, improve predictive accuracy
- While bagging can improve predictions for many regression methods, it is particularly useful for decision trees.
  - fˆ*b(x) is prediction at the bth set
- For classification, we record the class predicted by each of the B trees and take majority vote: the overall prediction is the most commonly occurring class among the B predictions
  - Weighted Majority Pulling: Assign more votes to classifiers that you trust more (a measure of confidence can be 1/variance)
- Out-of-bag Error Estimation (test error measurement)
  - on average, each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations
  - We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation, which we average. This average OOB prediction result will be the test error for bagged model
  - Drawbacks: it is no longer clear which variables are most important to the procedure, so bagging improves prediction accuracy at the expense of interpretability
  - We can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees)
    - In the case of bagging regression trees, we can record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor.
    - Similarly, in the context of bagging classification trees, we can add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all B trees.
Random Forest
- Improved Bagged Trees by way of a small tweak that de-correlates the trees. This reduces the variance when we average the trees.
- A random selection of m predictors is chosen as split candidates from the full set of p predictors, m ≈ √p
  - Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors (e.g. gene experiment)
- In building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors
  - Reason: If there is one very strong predictor along with a number of other moderately strong predictors. In bagging, we will always use those predictors, that will make all bags look similar. Random Forest will overcome this problem to make the result more accurate
Boosting
- Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification, not only for decision tree
- Each tree in bagging is independent of the other trees, but each tree in boosting is grown using information from previously grown trees (use residuals to improve the models)
- Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
- Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly
- from ISLR
  - Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals
  - Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm
  - By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals
- Three tuning parameters of Boosting Tree:
  - B: Number of trees. We need to use CV to select B. If B is too large, boosting can overfit.
    - BUT, Training boosting trees is not fast, so sometimes we may not want to CV since it takes too much time. So, we can use a validation set for early stopping. When the validation error starts to increase, we stop.
  - λ: Shrinkage parameter. A small positive number. Typical values are 0.01 or 0.001
  - d: Number of splits in each tree. It controls the complexity of the boosted ensemble. Often d=1 works well
- Extremely Randomized Trees
  - The idea is to make the trees even weaker, but to compensate that with the large number of estimators in the ensemble
  - Not only the features and samples are randomized, but also we don’t choose the optimal predictor to split, we just randomly grow each tree
  - This makes weaker trees, but makes training each tree faster
- Boosting Classification:
  - Boosting iteratively learns weak classifiers
    - A weak classifier is one whose error rate is only slightly better than random guessing
  - The purpose of boosting is to combine a sequence of weak classifier to be a strong classifier. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction.
    - Up-weight data are difficult since they are incorrectly classified in the previous round
    - Down-weight data are easy because they are correctly classified in the previous round
  - Boosting classification intuition:
    - We adaptively weigh each data case
    - Data cases which are wrongly classified get high weight (the algorithm will focus on them)
    - Each boosting round learns a new (simple) classifier on the weighed dataset
    - These classifiers are weighed to combine them into a single powerful classifier
    - Classifiers that obtain low training error rate have high weight
    - We stop by CV or Early Stopping using validation set
- Advantage:
  - Boosting is remarkably resistant to overfitting, and it is fast and simple
  - It improves the performance of many kinds of machine learning algorithms, not only decision tree
- Boosting does not work when:
  - There is not enough data
  - Base learner is too weak or too strong (e.g. Boosting cannot improve Random Forest to 99.99% accuracy)
  - Susceptible to noisy data
Stacking (ensemble method)
- Basic idea: use the output of multiple classifiers as input to a meta-model (meta-learner). We ‘stack’ the meta-model on top of the base models
- However, Naïve implementation of stacking prefers over-fitted models
  - Underlying problem: the outputs of the base models have been adapted to the labels. To avoid preference for overfitted models, inputs to the meta- model should not have seen the labels for the data points themselves
- Usually, the meta-model is relatively simple (e.g. linear regression or logistic regression)
- Empirical Recommendation: Do not have your base-learners as the meta-learner.
From Homework:
- SMOTE is essentially a time consuming method (4 hours for one procedure)
- Use a data imputation technique to deal with the missing values in the data set

from sklearn.impute import SimpleImputer
import numpy as npdata1 = train.iloc[:,5:128].replace('?',np.nan)
imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imp_mean.fit(data1)
new_data = imp_mean.transform(data1)
new_data = pd.DataFrame(new_data)

Fit a Ridge regression model on the training set, with λ chosen by cross-validation.

from sklearn.linear_model import RidgeCVrange_ = [10e-3, 10e-2, 10e-1, 1, 10, 100, 1000]
clf = RidgeCV(alphas = range_, cv = 5)
clf.fit(X_train, y_train)
y_pred_l2 = clf.predict(X_test)
l2_test_error = mean_squared_error(y_test,y_pred_l2)
l2_test_error

Fit a Lasso regression model on the training set, with λ chosen by cross-validation.

from sklearn.linear_model import LassoCV# without standardization
reg = LassoCV(alphas = range_, cv = 5, random_state = 0)
reg.fit(X_train, y_train)
y_pred_l1 = reg.predict(X_test)
l1_test_error = mean_squared_error(y_test, y_pred_l1)# with standardization
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.fit_transform(X_test) # do not forget to apply standardization to the test samples
reg.fit(X_train_scale, y_train)
y_pred_l1_scale = reg.predict(X_test_scale)
l1_test_error_scale = mean_squared_error(y_test, y_pred_l1_scale)

Fit a PCR model on the training set, with M (the number of principal components) chosen by cross-validation
- sklearn does not have an implementation of PCA and regression combined, so we need to do PCA first to get all components. Then, use CV to choose the number of components we use and use those chosen components to make prediction

from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from collections import defaultdict pca = PCA()
X_reduced = pca.fit_transform(sclae(X_train))  # do not forget to scale the data
print(len(pca.components_))# 10-fold CV with shuffle
n = len(X_reduced)
kf_10 = KFold(n_splits = 10, shuffle = True, random_state = 0)regr = LinearRegression()
mse_train = []# Calculate MSE with only the intercept
score = -1 * cross_val_score(regr, np.ones((n,1)), y_train.ravel(), cv = kf_10, scoring='neg_mean_squared_error').mean()
mse_train.append(score)# Calculate MSE using CV for the 122 principle components, adding one component at the time.
for i in np.arange(1, 123):score = -1 * cross_val_score(regr, X_reduced[:,:i], y_train.ravel(), cv = kf_10, scoring='neg_mean_squared_error').mean()mse_train.append(score)# Plot results
plt.figure(figsize = (10,8))
plt.plot(mse_train, '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')'''From the plot, we can choose M=7'''
X_reduced_test = pca.transform(scale(X_test))[:,:8]  # do not forget to scale the dataregr = LinearRegression()
regr.fit(X_reduced[:,:8], y_train)# Prediction with test data
y_pred_PCR = regr.predict(X_reduced_test)
PCR_test_error = mean_squared_error(y_test, y_pred_PCR)

use XGBoost to fit the model tree. Determine α (the l1 regularization term) using cross-validation.

import xgboost as xgb
from sklearn.model_selection import GridSearchCVdtrain = xgb.DMatrix(X_train, label = y_train)
dtest = xgb.DMatrix(X_test, label = y_test)param = {'reg_alpha':[1e-3, 1e-2, 0.1, 1, 100, 1000]}gscv = GridSearchCV(estimator = xgb.XGBRegressor( learning_rate =0.1,objective= 'reg:squarederror', seed = 123), param_grid = param, cv=5)
gscv.fit(X_train, y_train)
gscv.best_params_   # α = 1xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1, alpha = 1)
xg_reg.fit(X_train, y_train)
y_pred_xg = xg_reg.predict(X_test)
xg_test_error = mean_squared_error(y_test, y_pred_xg)

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(oob_score = True)
rf.fit(x_train, Y_train)y_pred_train = rf.predict(x_train)
y_pred_test = rf.predict(x_test)
y_score_train = rf.predict_proba(x_train)[:,1]  # y_score
y_score_test = rf.predict_proba(x_test)[:,1]# OOB error
obb_error = 1 - rf.oob_score_

In the case of a univariate tree, only one input dimension is used at a tree split. In a multivariate tree, or model tree, at a decision node all input dimensions can be used and thus it is more general. In univariate classification trees, majority polling is used at each node to determine the split of that node as the decision rule. In model trees, a (linear) model that relies on all of the variables is used to determine the split of that node (i.e. instead of using Xj > s as the decision rule, one has Σ βjXj > s. as the decision rule). Alternatively, in a regression tree, instead of using average in the region associated with each node, a linear regression model is used to determine the value associated with that node.
- use sklearn to call Weka (JAVA) to train Logistic Model Trees for classification

Support Vector Machines

Hyperplane
- A hyperplane in P dimensions is a flat affine subspace of dimension P-1 (e.g. In p = 2 dimensions a hyperplane is a line)
- In general the equation for a hyperplane has the form
  β0 + β1X1 + β2X2 + . . . + βpXp =0, if β0 = 0, the hyperplane goes through the origin
- The vector β= (β1,β2,···,βp) is called the normal vector — it points in a direction
  orthogonal to the surface of a hyperplane
- If f(x) = β0 + β1X1 + β2X2 + . . . + βpXp
  - f(x)>0 goes for one class
  - f(x)<0 goes for another class
Maximal Margin Classifier
- Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes
- The fact that the maximal margin hyperplane depend directly on only a small subset of the observations (support vectors) is an important property
- For non-separable data or noisy data, Maximal Margin Classifier cannot performance well, so we need support vector classifier that maximizes a
  soft margin
Support Vector Classifier (linear separable)
- Greater robustness to individual observations
- Better classification of most of the training observations
- Support Vectors Classifier allow some misclassified points εi that violates the margin even the hyperplane
  - If εi = 0 then the ith observation is on the correct side of the margin
  - If εi> 0 then the ith observation is on the wrong side of the margin, so it violates the margin
  - Ifεi >1 then it is on the wrong side of the hyperplane
- BUT we need a hyper parameter C, a regularization parameter, to determine how tolerance we can be. The tuning parameter C is like a budget for the amount that the margin can be violated by the n observations
  - If C=0, then there is no budget for violations to the margin, so all ε ’s = 0,
    which leads to the maximal margin hyperplane optimization problem
  - For C > 0, no more than C observations can be on the wrong side of the hyperplane, because if an observation is on the wrong side of the hyperplane then εi >1, and the optimization problem requires that the sum of εi ’s be less than C
  - When C is large, we have wider margin
  - When C is small, we have narrower margin, and it is easy to be overfitting. Low bias, but high variance
  - C can be chosen by CV, and controls the bias-variance trade-off of the statistical learning technique
- Support Vectors
  - Only observations that either lie on the margin or that violate the margin will affect the hyperplane, and hence the classifier obtained, which are support vectors
  - The fact that the support vector classifier’s decision rule is based only on a potentially small subset of the training observations (the support vectors) means that it is quite robust to the behavior of observations that are far away from the hyperplane (outliers)
    - Compare to LDA: LDA needs the mean and within-class covariance matrix computed using all of the observations
    - Compare to Logistic Regression: unlike LDA, Logistic Regression has very low sensitivity to observations far from the decision boundary. In fact, the support vector classifier and logistic regression are closely related.
- However, sometimes a linear boundary simply won’t work, no matter what value of C
Support Vector Machines
- Feature Expansion (add non-linear terms):
  - Enlarge the space of features by including non-linear terms; Hence go from a p- dimensional space to a M > p dimensional space
  - Fit a support vector classifier in the enlarged space (e.g. β0 + β1X1 + β2X1^2 + β3X2 = 0), this results in non-linear decision boundaries in the original space
- BUT Polynomials (especially high- dimensional ones) get wild rather fast. It’s difficult to control the computations
- There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers — through the use of kernels (Any linear model can be turned into a non-linear model by applying the kernel trick to the model: replacing its features (predictors) by a kernel function). A kernel is a function that quantifies the similarity of two observations using inner product. K(x, xi) is kernel function, and it changes with different type of kernel is
  - For SVC, the Kernel is the usual inner product, which is called a linear kernel, because φ(x) = x
  - Polynomial Kernel of degree d: computes the inner-products needed for d dimensional polynomials-basis functions (d>1). In practice, we need CV for both (C, d). It is like to combine several linear SVC to a non-linear SVM
  - Radial Kernel (Gaussian Kernel): In practice, we need CV for both (C, gamma)
    - if a given test observation x* is far away from a training observation xi in terms of Euclidean distance, then the exponent will be very negative, and so K(x*,xi) will be very tiny. Therefore, xi will play virtually no role in f(x*)
    - The radial kernel has very local behavior: only nearby training observations have an effect on the class label of a test observation (similar to KNN)
- Multi-class SVM (more than 2 class)
  - OVA (One Versus All)
    - Fit K different 2-class SVM classifiers fˆk(x), k = 1, . . . , K; each class versus the rest. Classify x to the class for which fˆk(x) is largest**.
  - OVO (One Versus One)
    - Fit all pairwise classifiers fˆkl(x). Classify x to the class that wins the most pairwise competitions.*
  - If K is not too large, use OVO
- Multi-class and Multi-label SVM
  - Binary Relevance: The simplest technique, which treats each label as a separate binary or multi-class classification problem
    - If there are three labels Y1, Y2,Y3, for a new data point x*, predict Y1, Y2,Y3 separately
    - Note1: Any classifier (e.g. Naïve Baye’s, Logistic Regression, and SVM) can be used for predicting each label
    - Note 2: Each label may give rise to a binary or multi-class classification problem
  - Classifier Chains: the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
    - Note: This is quite similar to binary relevance, the only difference being it forms chains in order to preserve label correlation
  - Label Power-set: This method transforms the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data
- Evaluation Metrics for multi-class classification problem:
  - Hamming Loss: The fraction of the wrong labels to the total number of labels
  - From Homework: Calculate the average Hamming distance, Hamming score, and Hamming loss between the true labels and the labels assigned by clusters
    - Hamming Loss = 1 - Hamming Score

def hamming_distance(y_true, y_major):x_,y_ = y_true.shapedistance = 0for i in range (x_):for j in range (y_):if (y_true.iloc[i][j] != y_major[j]):distance = distance +1return distance/x_def hamming_score(y_true, y_major):x_,y_ = y_true.shapescores = 0for i in range(x_):count = 0for j in range(y_):if(y_true.iloc[i][j] == y_major[j]):count = count+1score = count/y_scores = scores+scorereturn scores/x_  def hamming_loss(y_true, y_major):x_,y_ = y_true.shapeloss = 0for i in range(x_):count=0for j in range(y_):if(y_true.iloc[i][j] != y_major[j]):count=count+1loss = loss + count/y_return loss/x_

SVM and Logistic Regression
- Similar loss function: Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results
  - SVM: hinge loss function
  - Logistic Regression:
    - we can rewrite the logistic regression when Y is -1 or 1 (instead of 0,1) as
    - negative log likelihood loss function
- Loss + Penalty
  - the original penalty in SVC is a L2 penalty, but it can be changed to L1
  - Because the hinge loss is not differentiable, the solution paths for different values of λ may have jumps. So, sometimes we replace the hinge loss with a differentiable squared hinge loss
  - When λ is large, βi are small, more violations to the margin are tolerated -> large C -> high bias, low variance
  - When λ is small, βi are large, less violations to the margin are tolerated -> small C-> low bias, high variance
- SVM vs. LR Summary
  - When the classes are well separated, SVM tend to behave better than logistic regression. So does LDA.
  - In more overlapping regimes, logistic regression is often preferred.
  - SVMs are popular in high-dimensional classification problems with p>>n
  - If you wish to estimate probabilities, LR is the choice
  - For nonlinear boundaries, kernel SVMs are popular
    - In fact, Kernel can be used on other classification algorithms
From Homework:
- When you do multi-label tasks, SMOTE also needs to be done separately on each label
- Train a L2-penalized SVM for each of the labels, using Gaussian kernels and one versus all classifiers
  - Separate the multi-labels and solve it one by one

# separate the labels
y_train_1 = y_train['Family']
y_train_2 = y_train['Genus']
y_train_3 = y_train['Species']
y_test_1 = y_test['Family']
y_test_2 = y_test['Genus']
y_test_3 = y_test['Species']from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCVdef svm_l2(x,y):svm = SVC(kernel = 'rbf', decision_function_shape = 'ovr')c_range = np.logspace(-2, 10 ,13)gamma_range = np.logspace(-9,3,13)parameters = {'C': c_range, 'gamma': gamma_range}gscv = GridSearchCV(svm, param_grid = parameters, cv = 10)gscv.fit(x,y)return gscv.best_params_, gscv.best_score_, gscv.best_estimator_family_result = svm_l2(X_train_std, y_train_1)
genus_result = svm_l2(X_train_std, y_train_2)
species_result = svm_l2(X_train_std, y_train_3)

Train L1-penalized SVM. The convention is to use L1 penalty with linear kernel

from sklearn.svm import LinearSVCdef svm_l1(x,y):svc = LinearSVC(penalty = 'l1', multi_class = 'ovr', dual = False, random_state = 0, max_iter = 100000)c_range = np.logspace(-2, 10 ,13)parameters = {'C': c_range}gscv = GridSearchCV(svc, param_grid = parameters, cv = 10)gscv.fit(x,y)return gscv.best_params_, gscv.best_score_, gscv.best_estimator_

Unsupervised Learning

Unsupervised learning is more subjective than supervised learning as there is no simple goal for the analysis. It is often easier to obtain unlabeled data
We discuss two methods:
- principal components analysis: a tool used for data visualization (low-dimensional display) or data pre-processing is before supervised techniques are applied
- clustering : a broad class of methods for discovering unknown subgroups in data.

Clustering

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. The data in one group since the observations within each group are quite similar to each other
We must define what it means for two or more observations to be similar or different (e.g. distance or correlation?)
- this is often a domain- specific consideration that must be made based on knowledge of the data being studied
K-Means Clustering
- Collectively Exhaustive: each observation belongs to at least one of the K clusters
- Mutual Exclusive: the clusters are non-overlapping: no observation belongs to more than one cluster
- A good clustering is one for which the within-cluster variation is as small as possible
- within-cluster variation
  - |Ck| denotes the number of observations in the kth cluster
- we try to minimize Total WCV
- A local optimum K-Means Clustering Algorithm
  1. Find K using TWCV, Elbow Chart
  2. Randomly assign a number, from 1 to K, to each of the observations
    These serve as initial cluster assignments for the observations
  3. Iterate until the cluster assignments stop changing:
    a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster
    b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance)
- K-Means can guarantee to find local minimum TWCV, but cannot guarantee to find global minimum TWCV because they have different random selected centroids.
- There are two approaches to avoid this problem
  1. Repeat K-means: Repeating the algorithm and initialization of centroids several times and pick the clustering approach that has small intra-cluster distance and large inter-cluster distance.
  2. K-Means++: K-Means++ is a smart centroid initialization technique.
- K-Means ++ Algorithm
  - from https://towardsdatascience.com/understanding-k-means-k-means-and-k-medoids-clustering-algorithms-ad9c9fbf47ca
  - K-Means++ is a smart centroid initialization technique and the rest of the algorithm is the same as that of K-Means.
  - The steps to follow for centroid initialization are:
    - Pick the first centroid point (C1) randomly
    - Compute distance of all points in the dataset from the selected centroid. The distance of xi point from the farthest centroid can be computed by
      - di: Distance of xi point from the farthest centroid
      - m: number of centroids already picked
    - Make the point xi as the new centroid that is having maximum probability proportional to di.
    - Repeat the above two steps till you find k-centroids
  - In Python, just use kmeans = KMeans(n_clusters = 3, init = 'k-means++'), which is the default
- How to choose K?
  - If we already know how many clusters we want, it is easy
  - Otherwise, Hard, and not a perfect answer now
  - Several approaches:
    - Elbow Chart
      - Plot the TWCV. The plot will decrease with K increases and find the “elbow” point of K. Within-cluster variation measures how tightly grouped the clusters are.
      - Plot the TBCV (Total Between Cluster Variation). The plot will increase with K increases and find the “elbow” point of K. Between-cluster variation measures how spread apart the groups are from each other
    - CH index (combine WCV & BCV): Ideally we’d like our clustering assignments C to simultaneously have a small W and a large B
      - choose the value of K with the largest score CH(K)
      - BUT CH index is not defined for K=1
    - Silhouette Analysis
      - Assume the data have been clustered via any technique, such as k-means, into k clusters
      - For each sample xi, let ai be the average distance between xi and all other data within the same cluster. Can interpret ai as a measure of how well xi is assigned to its cluster (the smaller the value, the better the assignment)
      - Let bi be the lowest average distance of xi to all points in any other cluster that does not include xi. The cluster with this lowest average dissimilarity is said to be the “neighboring cluster” of xi, because it is the next best fit cluster for point xi
      - We will have -1≤ si ≤ 1
        si close to one means that xi is appropriately clustered
        
        si close to negative one means that xi should be clustered in its neighboring cluster
        
        si near zero means that xi is on the border of two natural clusters
      - The average is over the entire dataset is a measure of how appropriately the data have been clustered
      - Therefore, we prefer higher average si
  - Cross Validation for K, still need to find elbow point
    - can use the sum of the squared distances to the centroids for k-means as objective function
- From Homework: CH index for choosing K

from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
from collections import defaultdict
import operatork_value = range(2,51,1)   # CH index cannot use k=1
ch_score = defaultdict(list)
best_k = []for t in range(1,51):     #simulation 50 timesfor k in k_value:kmeans = KMeans(n_clusters = k)kmeans.fit(X)score = calinski_harabasz_score(X, kmeans.labels_)ch_score[k].append(score)best = max(ch_score.items(), key=operator.itemgetter(1))[0]best_k.append(best)

K-Medoids Clustering
- Similar to K-means, but in each step we do not compute the centroid of each cluster
- We compute the medoid, which is a data point that has the smallest
  average distance with other data, so the centroid will be exactly a point in the cluster
Hierarchical Clustering
- Hierarchical clustering does not require to choose K
- The most common type of Hierarchical Clustering is bottom-up or agglomerative clustering, and refers to the fact that dendrogram is built from the leaves and combining clusters up to the trunk
- The approach in words:
  - Start with each point in its own cluster.
  - Identify the closest two clusters and merge them.
  - Repeat.
  - Ends when all points are in a single cluster.
- For any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different the two observations are.
  - observations that fuse at the very bottom of the tree are quite similar to each other
  - whereas observations that fuse close to the top of the tree will tend to be quite different
- Nested Cluster
  - Clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.
  - In other words, the height of the cut to the dendrogram serves the same role as the K in K-means clustering: it controls the number of clusters obtained
  - Hierarchical clustering can sometimes yield worse (i.e. less accurate) results than K - means clustering for a given number of clusters
- From ISLR
- Dissimilarity between Groups
  - The concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of observations
  - This extension is achieved by developing the notion of linkage, which defines the dissimilarity between two groups of observations
    - Complete: Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
    - Average: Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
    - Single: Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
    - Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions, whereby two clusters are fused at a height below either of the individual clusters in the dendrogram. This can lead to difficulties in visualization as well as in interpretation of the dendrogram
  - Average and complete linkage are generally preferred over single linkage, as they tend to yield more balanced dendrograms
  - The resulting dendrogram typically depends quite strongly on the type of linkage used
- Choice of Dissimilarity Measure
  - Normally, we use Euclidean distance
  - We can also use correlation-based distance, that focuses on the shapes of observation profiles rather than their magnitudes
- Practical Issue: both for K-Means and Hierarchical Clustering
  - We need to consider whether or not the variables should be scaled to have standard deviation one before the dissimilarity between the observations is computed. it depends on what you want.
  - In the case of hierarchical clustering
    - What dissimilarity measure should be used?
    - What type of linkage should be used?
    - Where should we cut the dendrogram in order to obtain clusters?
  - K-Means Clustering: How to choose K?
  - No single best solution, validation first, do multiple times, find the best one
  - Outliers may be an issue because it actually not belongs to any cluster. Mixture models , amount to a soft version of K-means clustering, are an attractive approach for accommodating the presence of such outliers
  - clustering methods generally are not very robust to changes of the data. So, we recommend clustering subsets of the data in order to get a sense of the robustness of the clusters obtained.
- Most importantly, we must be careful about how the results of a clustering analysis are reported. These results should not be taken as the absolute truth about a data set. Rather, they should constitute a starting point for the development of a scientific hypothesis and further study, preferably on an independent data set.
Variable Selection for Clustering (different clusters with different features)
- The goal of feature selection for unsupervised learning is to find the smallest feature subset that best uncovers “interesting natural” groupings (clusters) from data according to the chosen criterion.
- k depends on the feature subset, so in each step of subset selection, the best k has to be found
- Large number of k-means variations were proposed to handle feature selection.
  Most start cluster the data into k clusters. Then, assign weight to each feature. The feature that minimizes the within- cluster distance / maximizes between- cluster distance is preferred, hence, gets higher weight.
Fisher’s Linear Discriminant Analysis
- Whereas PCA seeks directions that are efficient for representation, discriminant analysis seeks directions that are efficient for discrimination. It is in some sense a supervised version of PCA, can provide greater separation of different groups
- Once the transformation from the p-dimensional original feature space to a lower dimensional subspace is done using PCA or Fisher’s LDA, classification methods can be used to train pertinent classifiers.

Semi-supervised Learning

Unlabeled data is cheaper to get
Assumption: Examples from the same class follow a coherent distribution. Unlabeled data can give a better sense of the class separation boundary
So, when we only have a few labeled data, we can use unlabeled data to separate the class better
Two types of Semi-supervised Learning Problem
- Transductive: Produce label only for the available unlabeled data. The output of the method is not a classifier
- Inductive: Not only produce label for unlabeled data, but also produce a classifier
To solve Inductive problem, there are two algorithmic approaches
- Classifier based methods (Self-Training): Start from initial classifier(s), and
  iteratively enhance it (them)
  - Co-Training, Yarowsky Algorithm, and Their Combination
- Data based methods: Discover an inherent geometry in the data, and exploit it in finding a good classifier
  - Manifold Regularization, Harmonic Mixtures, Information Regularization
Self-Training (Bootstrap)
Yarowsky Algorithm
- Algorithm Process
  - Train supervised model on labeled data L
  - Test on unlabeled data U
  - Add the most confidently classified members of U to L (in SVM, most confidently means the farthest point to the margin)
  - Repeat
- Advantages:
  - The simplest semi-supervised learning method.
  - A wrapper method, applies to existing (complex) classifiers.
  - Often used in real tasks like natural language processing.
- Disadvantages:
  - Early mistakes could reinforce themselves.
- Refinement:
  - Heuristic solutions, e.g. “un-label” an instance if its confidence falls below a threshold
  - Reduce weight of unlabeled data to increase power of more accurate labeled data
Co-Training (learn from each other)
- Instances contain two sufficient sets of features
  - i.e. an instance is x=(x1,x2)
  - Each set of features is called a View
- Two views are independent given the label
- Two views are consistent: optimal c(x) = c1(x1) = c2(x2)
  - c1: A classifier trained on View 1, label some observations
  - c2: A classifier trained on View 2, label some observations
Yarowsky + Co-Training
- Using Yarowsky Algorithm, but view is switched in each iteration
- Uses all the unlabeled data for training
- In each iteration, only one of the classifiers labels the data
  - e.g. Train->use c1 label->Train-> use c2 label -> Train …
- Has been used successfully with neural networks and support vector machines
General Multi-view Learning
- Train multiple diverse models using different views on labeled dataset L. Those instances in unlabeled dataset U which most models agree on are placed in L
Semi-Supervised SVM (S3VM)
- Maximize margin of both L and U. Decision surface placed in non-dense spaces
- Assumes classes are “well-separated”
- Can also try to simultaneously maintain class proportion on both sides similar to labeled proportion (e.g. labeled: 30% + 70% -, try to keep the 30%, 70% for U)
Cluster-and-Label Approach
- Assumption: Clusters coincide with decision boundaries
- BUT Poor results if this assumption is wrong
- Process
  - Cluster labeled and unlabeled data (mix)
  - For each cluster, train a classifier based on the labeled points within that cluster
  - Label all data in each cluster using the classifier designed for that cluster.
  - Train a model based on the whole data (that is now labeled)
- Clustering both U and L as Pre-processing for Supervised learning
  - Disregard labels & cluster
  - Build a local model for each cluster
  - Assume that there is a test point X*. Determine in what cluster it falls first, and then use the model associated with that cluster to label it
Passive Learning
- Random select data every time until finishing labeling each instance
Active Learning
- Types of Active Learning
  - Stream-Based Active Learning
    - Consider one unlabeled example at a time. Decide whether to query its label or ignore it
  - Pool-Based Active Learning
    - Given a large unlabeled pool of examples
    - Rank examples in order of informativeness (Uncertainty/Surprise/Doubt = Informative; In SVM, that will be the point that is closest to the margin)
    - Query the labels for the most informative example(s)
- Process:
  - Active Learning proceeds in rounds
  - Each round has a current model(learned using the labeled data seen so far)
  - The current model is used to assess informativeness of unlabeled examples using one of the query selection strategies
  - The most informative example(s) is/are selected
  - The labels are obtained (by the labeling oracle, costly)
  - The training data will be updated including new labeled data
  - The model is re-trained using the new training data
  - Repeat until we have budget left for getting labels
- Query Selection Strategies (assess informativeness of unlabeled examples)
  - Uncertainty Sampling
    - Select examples which the current model is the most uncertain about
    - Ways to measure uncertainty:
      - Distance to hyperplane (closer, more uncertain)
      - label probability (≈ 0.5)
  - Query By Committee (QBC)
    - QBC uses a committee of models
    - All models trained using the currently available labeled data L
    - All models vote their predictions on the unlabeled pool
    - The example(s) with maximum disagreement is/are chosen for labeling
    - Each model in the committee is retrained after including the new example(s)
  - However, these two strategies have a drawback. Both of them may wrongly think an outlier to be an informative example because outliers are always uncertainty. Therefore, other robust query selection methods exist to deal with outliers. Instead of using the confidence of a model on an example, see how a labeled example affects the model itself. The example(s) that affects the model the most is probably the most informative
  - Expected Model Change (using the law of Total Expectation)
    - Select the example whose inclusion brings about the maximum change in the model (e.g. the gradient of the loss function with regard to the parameters)
    - Using the law of Total Expectation: E[ΔJ]
  - Expected Error Reduction
    - Select example that reduces the expected generalization error the most
    - Using the law of Total Expectation: E[Δe]
Passive vs. Active
- For the same amount of instances queries, active learning has better accuracy rate than that of passive learning. When the number of instance queries goes infinity, they comes closer and closer
From Homework:
- Find the unlabeled data point that is the farthest to the decision boundary of the SVM. Let the SVM label it (ignore its true label), and add it to the labeled data, and retrain the SVM

 while len(unlabel_X_norm)>0:svm_linear = LinearSVC(penalty = 'l1', C = best_c, dual = False, random_state = 42, max_iter = 10000)svm_linear.fit(label_X_norm, label_y)distance = np.abs(svm_linear.decision_function(unlabel_X_norm))  # decision function is Predict confidence scores for samplesmax_distance = np.argmax(distance)X_far = unlabel_X_norm[max_distance]y_far = svm_linear.predict(X_far.reshape(1,-1))X = np.vstack((label_X_norm, X_far)) # add it to the labeled datay = np.hstack((label_y, y_far))unlabel_X_norm = np.delete(unlabel_X_norm, max_distance, axis=0) # delete from unlabeled data

Passive Learning: Train a SVM with a pool of 10 randomly selected data points from the training set using linear kernel and L1 penalty. Repeat this process by adding 10 other randomly selected data points to the pool, until you use all the 900 points. Calculate the test error for each SVM.

def passive(x_tr,y_tr,x_te,y_te):te_error = []pool_index = []unpool_index = train.index.tolist()while len(unpool_index)>0:pool = random.sample(unpool_index,10)for i in pool:pool_index.append(i)result = svm_l1(x_tr[pool_index], y_tr[pool_index])best_c = result['C']svm_linear = LinearSVC(C = best_c, penalty = 'l1', dual = False, random_state = 42, max_iter = 10000)model = svm_linear.fit(x_tr[pool_index], y_tr[pool_index])y_pred = model.predict(x_te)error = 1 - accuracy_score(y_te, y_pred)te_error.append(error)unpool_index = set(unpool_index) - set(pool_index)return te_error

Active Learning: Train a SVM with a pool of 10 randomly selected data points from the training set using linear kernel and L1 penalty. Choose the 10 closest data points in the training set to the hyperplane of the SVM and add them to the pool. Train a new SVM using the pool until using all 900 points. Calculate the test error for each SVM.

def active(x_tr,y_tr,x_te,y_te):te_error_ac = []tr_size = x_tr.shape[0]indices = np.random.choice(list(range(0, 900)), size=900, replace=False,p=[1 / tr_size for i in range(tr_size)])X_train_new = x_tr[indices[0:10]]y_train_new = y_tr[indices[0:10]]rest_train = x_tr[indices][10:]i = 0while i<= 90:result = svm_l1(X_train_new, y_train_new)best_c = result['C']svm_linear = LinearSVC(C = best_c, penalty = 'l1', dual = False, random_state = 42, max_iter = 10000, tol = 0.01)model = svm_linear.fit(X_train_new, y_train_new) y_pred = model.predict(x_te)error = 1 - accuracy_score(y_te, y_pred)te_error_ac.append(error)if len(rest_train) == 0:breakdistance = np.abs(model.decision_function(rest_train))best_distance = np.argsort(distance, axis=0)X_train_new = np.append(X_train_new, x_tr[best_distance[0:10], :], axis=0)y_train_new = np.append(y_train_new, y_tr[best_distance[0:10]], axis=0)rest_train = x_tr[best_distance[10:]]i = i + 1return te_error_ac

Neural Network and Deep Learning

Perceptron
- Perceptron is an algorithm for binary classification that uses a linear prediction function:
  - This is called a step function
  - weights: β = (β1,β2,…,βp)
  - By convention, ties are broken in favor of the positive class. If “βTx + β ” is exactly 0, output +1 instead of -1
  - In the same way that linear regression learns the slope parameters to best fit the data points, perceptron learns the parameters to best separate the instances.

from Wikipedia: In mathematics, a function on the real numbers is called a step function (or staircase function) if it can be written as a finite linear combination of indicator functions of intervals. Informally speaking, a step function is a piecewise constant function having only finitely many pieces.

How to learn the weights β?
1. Initialize all weights β to 0 or randomly
2. Iterate through the training data. For each training instance, classify the instance
  a) If the prediction is correct, keep it!
  b) If the prediction is wrong, modify the weights by update rule
3. Repeat step 2 some number of times until the prediction is correct
Update Rule: Update the β to get the correct prediction
- β(i+1) = β (i) + 0.5[ y(i)- f(x(i))] x(i) ; y(i) and f(x(i)) can only be 1 or -1
  - If y(i) = f(x(i)), no change
  - If y(i) = 1, f(x(i)) = -1, the weights will move towards to x(i)
  - If y(i) = -1, f(x(i)) = 1, the weights will move away to x(i)
- For the bias term β0: β0(i+1)= β0(i)+.5[e(i)]
  - If e(i) is positive, increase the bias term
  - If e(i) is negative, decrease the bias term
- Move the decision boundary up or down to correct the classifier
In Neural Networks, we prefer to use w to replace β as weights, b to replace β0 as bias term
Perceptron vs. SVC (both linear classifiers)
- SVC’s optimization needs all data to yield a classifier, so it is a batch algorithm. It needs to maximize the margin between classes
- Perceptron adapts its weights online, based on the data points that are presented to it. Thus it is an online algorithm. It just needs to separate the classes.
Multi-class Perceptron
- Using S hyperplanes, one can partition the space Rp to 2S regions. Each region can represent a class
  - The weights associated with each hyperplane can be represented as the columns of a weight matrix W: W = [w1|w2|…|wS]
  - The biases can be augmented in a vector b: b = [b1 b2…bS]T
  - e(i) = y(i) - f(w(i)xi + bi)
- Update Rule for Multi-class:
  - W(i+1) = W (i) + 0.5x(i)eT(i)
  - bT(i+1) = bT(i) + 0.5eT(i)
  - eT(i) = [e1(i) | e2(i) |…| eS(i)]
  - Example
Learning Rate
- Let’s make a modification to the update rule: replace 0.5 with α, learning rate
  - W(i+1) = W (i) + α x(i)eT(i)
  - bT(i+1) = bT(i) + α eT(i)
- How to choose the α?
  - α can’t be too small, the algorithm will be slow because the updates won’t make much progress
  - α can’t be too large, the algorithm will be slow because the updates will “overshoot” and may cause previously correct classifications to become incorrect
  - One choice: in first iterations choose a large learning rate and then reduce it α(i) = α0/(i+c)
Perceptron as A Neural Architecture
- A Single “Neuron”
- Multiple “Neurons”
The Perceptron Learning Rule acts like a simple neural system: it adapts its weights based on the error that is a result of mismatch between its output and the desired output. This general process is the basis of supervised learning with a large class of neural networks
HOWEVER, if the training set is not linear separable, the Perceptron rule never converges. The general solution is to represent the data in a feature space where they are linearly separable. One idea is to use layers of perceptrons and adjust their weights to learn the feature representations – Multi-Layer Perceptron

Multi-Layer Perceptron

MLPs are also called Feedforward Neural Networks. Especially when the number of layers is large (e.g. M >10), they are called Deep Feedforward Neural Networks
MLPs can essentially be represented as nested functions: g(x) = g(M)(g(M-1)(…(g(3)(g(2)(g(1)(x))))…))
- e.g. a three layer perceptron can be represented as: g(x) = g(3)(g(2)(g(1)(x)))
Each g(i) is a layer of of neurons with its weight matrix W(i) and bias vector b(i) and activation function f(i)
This means the output of the ith layer, a(i) is: a(i) =f(i) (W(i)T a(i-1) + b(i)) = g(i) (a(i-1))
- n(i) = W(i)T a(i-1) + b(i)
- a(i) = f(i) (n(i))
- a(0) = x
MLP consists of multiple layers of Perceptrons
Each layer may have a different number of neurons
Different types of activation functions can be used, not just the sign (threshold) function
- Types of Activation Function
  - sigmoid functions (value between 0 and 1)
  - tanh sigmoid functions (value between -1 and 1)
  - ReLu (Rectified Linear Unit) for Deep Learning: All negative change to 0
Architectural Consideration
- Choice of depth (number of layers) of network
- Choice of width (number of neurons) of each layer
- Deeper networks have
  - Far fewer units in each layer
  - Often generalize well to the test set
  - BUT They are often more difficult to train
    - Ideal network architecture must be found via experimentation guided by validation set error
How to train MLP?
- MLPs are trained using optimization algorithms. Optimization algorithms need calculation of Gradients to be numerically solved
- Back-propagation Gradient calculation
  - We would like to minimize some objective function J, the expected sum of square errors by finding “suitable” weight matrices for each layer
  - BUT, since we don’t know population data, we can only use training data. Alternatively, we can use a subset of L pairs or even only one pair for J
  - Then, approximate Gradient Descent (what we train)
    - The Back-propagation term is the latter term (gradient descent)
    - This gradient descent algorithm cannot get the global minimum, just local minimum
- We present each (randomly selected) training sample (or batch of samples) and use the forward and backward paths to calculate sensitivities and update the weights.This is called an iteration.
  - Forward Propagation -> Error Estimation -> Backward Propagation -> S(M)
- Each time we finish presenting the whole training set to the network, we complete an epoch.
  - An epoch includes multiple iterations.
  - Training the network requires multiple epochs.
Regularization Methods
- One can add L1 and L2 regularizers to the objective function in back-propagation
  - L2 regularization is equivalent to weight decay
    - W(m)(k+1) = (1-ηα)W(m)(k) - α a(m-1) s(m)T
    - η: decay rate/ forgetting factor
- Empirical regularization is also very popular for Neural Networks
  - Adding Noise and transformed versions of training data to the training set in order to find better W(m) that can always separate the classes even with few noises. BUT too much noise (too similar transformation) is harmful
  - Early Stopping
    - use a validation set and validation error to decide when to stop the algorithm. When the validation error starts increasing, overfitting is occurring
  - Dropout
    - Randomly select weights to update
    - More precisely, in each update step:
      - Randomly sample a different binary mask (0 or 1) to all the input and hidden units
      - Multiple the mask bits with the units and do the update as usual (the units multiplied by 0 will be dropout)
      - Typical dropout probability: 0.2 for input and 0.5 for hidden units, because input units are more related to the training data

Convolutional Neural Network

CNN works for image classification
Image data is very large, we need to downsample the images to have fewer weights
Key concept: images carry more information than their vectorized versions
Adjacent pixels carry some information about the image
The information carried by adjacent pixels can be summarized by convolution operator. The Kernel (or Filter) in the convolution operator is a small matrix which encodes a way of extracting an interesting feature of an image (feature extraction)
- Image x Kernel = feature map
- The size of the Feature Map(Convolved Feature) is controlled by three parameters
  - Depth: the number of filters used for convolution; one filter will product one feature map
  - Stride (width): the number of pixels by which we slide our filter matrix over the input matrix. Larger stride yield smaller feature maps
  - Zero-Padding: Sometimes, we pad the input matrix with zeros around the border to apply the filter to bordering elements of the input image matrix. Zero padding controls the size of the feature maps
    - Adding zero-padding: wide convolution
    - Not using zero-padding : narrow convolution
  - NOTE: Feature map is not always smaller. If we pad the original image, then we can retain the size of the original image. Padding can be all zeros or repeating the border rows/cols all along outside of image.
It is evident that different Kernels will produce different feature maps for the same input image. So, which kernel to use? By CNN. CNN learns the values of these kernels
Nonlinearity in CNN
- ReLU introduces nonlinearity in CNN to deal with inherent nonlinearity of classification problems
- ReLU is applied element-wise (per pixel) and replaces all negative pixel values in the feature map by zero (non-negative)
- ReLU is applied to feature maps yielding Rectified Feature Maps
Pooling (Down-Sampling)
- Spatial pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information
- Max Spatial Pooling
  - define a spatial neighborhood (e.g., a 2×2 window) and take the largest element within that window
  - Instead the largest element, the average (Average Pooling) or sum of all elements could be taken. BUT, in practice, Max Pooling has better performance
  - Slide the 2 x 2 window by 2 cells (also called “stride”) and take the maximum value in each region to reduces the dimensionality of the feature map.
  - Properties of Pooling:
    - Reduces the dimensions of the input
    - Reduces the number of parameters and, therefore, controls overfitting
    - Is invariant to small transformations, distortions and translations in the input image (rather robust)
      - A small distortion in input will not change the output of Pooling too much – since we take the maximum /average value in a local neighborhood
      - Yields an almost scale invariant representation of our image. This is very powerful since we can detect objects in an image no matter where they are located
    - In essence, convolutions extract features from the image and pooling makes the features invariant to transformations
    - The convolution-ReLU-Pooling operation can be repeated many times in many convolutional “layers” with different depths and strides
      - Convolutional Layers play the role of feature extractors for a Deep Neural Network
      - The output layer of the MLP must be Softmax (each output is between 0 and 1, and sum is 1)
      - Adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features, which is even better for classification task
      - NOTE: Not necessary to have a Pooling layer after every Convolutional Layer

Evaluation of CNN by Calculating the total error
- Total Error = ∑ (target probability – output probability) ^2
- BUT, The squared error measure has some drawbacks:
  - If the difference between target probability and output probability is huge. There is almost no gradient for a sigmoid unit to fix up the error (for MLP training process)
  - Sometimes, we cannot get outputs sum to 1 using total squared error
- SO, we can instead to Force the outputs to represent a probability distribution across discrete alternatives, using Cross-entropy: the right cost function to use with softmax
  - This basically makes MLPs with a softmax output layer, a nonlinear version of logistic regression, where logits are highly nonlinear functions of the input that are learned from data

Model Comparison

Model	Type	Target	Assumption	Linear / Non-Linear	Loss Function	Evaluation Metric	Advantage	Disadvantage
Linear Regression	Supervised	Regression	1. Independent Features; 2. the dependence of Y on X1,X2,… Xp is linear; 3. error terms are uncorrelated; 4. error terms are independent from X; 5. E[ε] = 0; 6. error terms have a constant variance	Linear	RSS	R^2; MSE	Simple approach to supervised learning	Strong Assumption
Logistic Regression	Supervised	Classification	1. each sample is assigned to one and only one label;	Linear	Negative Log Likelihood	Confusion Matrix, Precision, Recall, etc	1. Not Sensitive; 2. Good for k=2, Bernoulli Classification Problem	Not stable for “well-separated” data
Linear Discriminant Analysis	Supervised	Classification	1. normal (Gaussian) distributions for each class; 2. same covariance matrix Σ in each class	Linear		Confusion Matrix, Precision, Recall, etc	1.Stable for “well-separate” data; 2. Good for k>2, it provides low-dimensional views of the data; 3. Good for n << p	Sensitive to the observations that far from the decision boundary
Naive Bayes	Supervised	Classification	conditional independence model in each class	Linear		Confusion Matrix, Precision, Recall, etc	1. good for computation (from joint probability to conditional probability); 2. Rather robust to isolated noise samples, since we average large samples; 3. Handles missing value by ignoring them (do not disregard the record/data point, just disregard the missing feature); 4. Rather robust to irrelevant attributes; 5. useful when p is very large	1. strong assumption; 2. Not robust to redundant attributes (correlated attributes), because they break down the conditional independence assumption
KNN	Supervised	Classification	similar things are always in close proximity	Non-linear		Confusion Matrix, Precision, Recall, etc	1.no training needed, just measure the distance; 2. Simple to implement; 3. Few tuning parameter: just K and distance; 4. Flexible: classes do not need to be linearly separable	1. KNN cannot tell us which predictor is more important; 2. Computationally expensive: we need to calculate distance from new observation to all samples; 3. Sensitive to imbalanced dataset: may get poor results for infrequent classes; 4. Sensitive to irrelevant inputs: irrelevant inputs make distances less meaningful for identifying similar neighbors
CART	Supervised	both		doesn’t matter		RSS for Regression; Misclassification rate/ Gini index for Classification	1. Easy interpretation; 2. Display graphically; 3. Trees can easily handle qualitative predictors without the need to create dummy variables	1. Poor prediction accuracy; 2. Trees can be very non-robust. A small change in the data can cause a large change in the final estimated tree
Bagging	Supervised	both		doesn’t matter		Out-of-bag Error Estimation	1. Improved ensemble method using bootstrap aggregation 2. Better Prediction Accuracy ; 3. reduce variance, avoid overfitting	It is no longer clear which variables are most important to the procedure, so not easy to interpretation
Random Forest	Supervised	both		doesn’t matter			1. Improved Bagged Trees by way of a small tweak that de-correlates the trees; 2. This reduces the variance when we average the trees	Not easy to interpretation
Boosting	Supervised	both		doesn’t matter			1. Boosting is remarkably resistant to overfitting, and it is fast and simple; 2. It improves the performance of many kinds of machine learning algorithms, not only decision tree	1. Really hard to interpretation; 2. Susceptible to noisy data
SVM	Supervised	classification		doesn’t matter	Hinge Loss / Squared Hinge Loss	Hamming Loss	1. Good for “well-separate” data; 2. SVMs are popular in high-dimensional classification problems with p>>n; 3. For nonlinear boundaries, kernel SVMs are popular; 4. Rather robust model since it only depends on a small set of observations	Results are not probabilities
K-Means Clustering	Unsupervised	Clustering		doesn’t matter		within-cluster variation	Easy to implement	1. Need to find K, but it’s hard to get a perfect one; 2. Sensitive to outliers; 3. Not very robust to changes of the data
Hierarchical Clustering	Unsupervised	Clustering		doesn’t matter			1. Does not need to choose K; 2.	1. Sometimes yield worse (i.e. less accurate) results than K - means clustering for a given number of clusters; 2. Need to consider which height to cut the model; 3. Sensitive to outliers; 4. Not very robust to changes of the data

Machine Learning Summary相关推荐

机器学习概要（MACHINE LEARNING SUMMARY）
机器学习概要(MACHINE LEARNING SUMMARY) 监督学习回归分析与线性回归 1.例如营业额预测,传统算法必须知道计算公式,机器学习可以帮你找到核心的函数关系式,利用它推算未来预测结 ...
机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster
下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...
AI：《Why is DevOps for Machine Learning so Different?—为什么机器学习的 DevOps 如此不同？》翻译与解读
AI:<Why is DevOps for Machine Learning so Different?-为什么机器学习的 DevOps 如此不同?>翻译与解读目录 <Why is ...
Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模态机器学习:综述与分类》翻译与解读
Paper:<Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类>翻译与解读目录 <Multimoda ...
台大李宏毅Machine Learning 2017Fall学习笔记 (8)Backpropagation
台大李宏毅Machine Learning 2017Fall学习笔记 (8)Backpropagation 当网络结构很复杂时,会有大量的参数.∇L(θ)\nabla L(\theta)是百万维的向量 ...
Python手册(Machine Learning)--statsmodels(Regression)
本站已停止更新,查看最新内容请移至本人博客 Wilen's Blog Regression Linear Regression(线性回归) Generalized Linear(广义线性回归) Gen ...
Python手册(Machine Learning)--statsmodels(列联表和多重插补)
列联表和多重插补 Contingency tables(列联表) Independence(独立性) Symmetry and homogeneity(对称性和同质性) A single 2x2 ta ...
An example machine learning notebook
原文地址 An example machine learning notebook Notebook by Randal S. Olson Supported by Jason H. Moore Un ...
【Kaggle】Titanic - Machine Learning from Disaster（二）
文章目录 1. 前言 2. 预备-环境配置 3. 数据集处理 3.1 读取数据集 3.2 查看pandas数据信息 3.2.1 查看总体信息 3.2.2 数据集空值统计 3.3. 相关性分析 3.3. ...

Machine Learning Summary

Machine Learning Summary

General Idea

Preprocessing Data

Resampling

Algorithm

KNN Classifier

Linear Regression

Logistic Regression

(Bayesian) Discriminant Analysis

Linear Discriminant Analysis

Quadratic Discriminant Analysis

Naive Bayes

Feature Selection & Regularization

Tree-based Methods

Support Vector Machines

Unsupervised Learning

Clustering

Semi-supervised Learning

Neural Network and Deep Learning

Multi-Layer Perceptron

Convolutional Neural Network

Model Comparison

Machine Learning Summary相关推荐

最新文章

热门文章