Some Data Science/Machine Learning References

Miscellaneous ML Notes

Covariance and correlation are similar concepts; the correlation between two variables is equal to their covariance divided by their variances, as explained at http://mccormickml.com/2014/07/22/mahalanobis-distance/

We can uuse the Mahalanobis distance to find outliers in multivariate data. It measures the separation of two groups of objects. Nice intuitive explanation here: https://www.theinformationlab.co.uk/2017/05/26/mahalanobis-distance/ The covariance matrix provides the covariance associated with the variables (the reason covariance is followed is to establish the effect of two or more variables together).

It is primarily used in classification and clustering problems where there is a need to establish correlation between different groups/clusters of data. Euclidean distance only makes sense when all the dimensions have the same units (like meters), since it involves adding the squared value of them.

When you are dealing with probabilities, a lot of times the features have different units. For example: we might have a model for men and a model for women, where both models are based on their weight [Kg] and height [m]. We also know the mean and covariance for each model. Now if we get a new measurement vector, an ordered set composed of weight and height, we have to decide if it's a man or a woman. We can use the Mahalanobis distance from the models of both men and women to decide which is closer, meaning which is more probable. The Mahalnobis distance transforms the random vector into a zero mean vector with an identity matrix for covariance. In that space, the Euclidean distance is safely applied.

Linear Discriminant Analysis (LDA) is used to classify multiple classes using dimensionality reduction like Principal Component Analysis (PCA). For two classes, you can just use Logistic Regression. For each input variable, you need to calculate the mean value of that variable for each class as well as the variance of that variable for each class. "Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value." https://www.kdnuggets.com/2018/02/tour-top-10-algorithms-machine-learning-newbies.html

This follows from the Curse of Dimensionality: as we we add in higher and higher dimensional in our feature vector, we need more computational power and data to effectively train the model. If you add in more features, you need more data, as seen here: https://towardsdatascience.com/curse-of-dimensionality-2092410f3d27

Thus, the goal of LDA is to reduce the dimension of the feature vectors without loss of information and maximize class separability; discrimination here is coming up with a rule that accurately assigna a new measurement/vector to one of several classes. http://www.cs.uml.edu/~ycao/teaching/fall_2013/downloads/05_MC_2_LDA.pdf

The rule is a discriminant function, a linear equation of the X variables that will provide best separation between the categorical Y variable. This checks to see if there are significant intra-group differences in terms of the X variables. It also identifies the X variables that contribute most to the inter-group separation. https://www.datascience.com/blog/predicting-customer-churn-with-a-discriminant-analysis

Ricky J. Sethi, PhD <rickys@sethi.org>
Last updated: Sunday, March 31 2019
(www.sethi.org/tutorials/references_data_science.shtml)