Course Outline
Introduction
This section provides a general introduction of when to use 'machine learning', what should be considered and what it all means including the pros and cons. Datatypes (structured/unstructured/static/streamed), data validity/volume, data driven vs user driven analytics, statistical models vs. machine learning models/ challenges of unsupervised learning, bias-variance trade off, iteration/evaluation, cross-validation approaches, supervised/unsupervised/reinforcement.
MAJOR TOPICS
1. Understanding naive Bayes
- Basic concepts of Bayesian methods
- Probability
- Joint probability
- Conditional probability with Bayes' theorem
- The naive Bayes algorithm
- The naive Bayes classification
- The Laplace estimator
- Using numeric features with naive Bayes
2. Understanding decision trees
- Divide and conquer
- The C5.0 decision tree algorithm
- Choosing the best split
- Pruning the decision tree
3. Understanding neural networks
- From biological to artificial neurons
- Activation functions
- Network topology
- The number of layers
- The direction of information travel
- The number of nodes in each layer
- Training neural networks with backpropagation
- Deep Learning
4. Understanding Support Vector Machines
- Classification with hyperplanes
- Finding the maximum margin
- The case of linearly separable data
- The case of non-linearly separable data
- Using kernels for non-linear spaces
5. Understanding clustering
- Clustering as a machine learning task
- The k-means algorithm for clustering
- Using distance to assign and update clusters
- Choosing the appropriate number of clusters
6. Measuring performance for classification
- Working with classification prediction data
- A closer look at confusion matrices
- Using confusion matrices to measure performance
- Beyond accuracy – other measures of performance
- The kappa statistic
- Sensitivity and specificity
- Precision and recall
- The F-measure
- Visualizing performance tradeoffs
- ROC curves
- Estimating future performance
- The holdout method
- Cross-validation
- Bootstrap sampling
7. Tuning stock models for better performance
- Using caret for automated parameter tuning
- Creating a simple tuned model
- Customizing the tuning process
- Improving model performance with meta-learning
- Understanding ensembles
- Bagging
- Boosting
- Random forests
- Training random forests
- Evaluating random forest performance
MINOR TOPICS
8. Understanding classification using the nearest neighbors
- The kNN algorithm
- Calculating distance
- Choosing an appropriate k
- Preparing data for use with kNN
- Why is the kNN algorithm lazy?
9. Understanding classification rules
- Separate and conquer
- The One Rule algorithm
- The RIPPER algorithm
- Rules from decision trees
10. Understanding regression
- Simple linear regression
- Ordinary least squares estimation
- Correlations
- Multiple linear regression
11. Understanding regression trees and model trees
- Adding regression to trees
12. Understanding association rules
- The Apriori algorithm for association rule learning
- Measuring rule interest – support and confidence
- Building a set of rules with the Apriori principle
Extras
- Spark/PySpark/MLlib and Multi-armed bandits