Machine Learning/Research Question Overview

Machine Learning Problem Formulation Flowchart

 

Difference between Qualitative and Quantitative Research Questions

The first step, before applying any machine learning, would be to develop your research question, which would depend on what kind of research you're going to do, usually either a qualitative or quantitative research design.

Converting a Business Problem to a Machine Learning Research Question

The research question should be falsifiable or tested scientifically, whether for a class project or for a PhD dissertation. It should be formulated as one of the 8 common questions machine learning answers, as seen below:

Machine Learning Questions

It should also identify the predictor and response variables. So, in the end, you should formulate a falsifiable hypothesis that can be tested with quantitative data.

For example, if your research question is vague, it doesn't have to be answered with a category label or a number. When thinking about a good research question, people often imagine they've found a mischievous genie who will truthfully answer all questions but will also try to make the answer vague and confusing. So they'll try to pin the genie down with a question so airtight that the genie has no choice but to answer insightfully.

So if you asked a vague question like, "Is there a correlation between stock prices and time?", the genie might answer, "Yes, the price will change with time." Although correct, this answer's not very helpful. But if you formulate a precise, quantitative question like, "Will my stock hit $50 by next Wednesday?", the genie has no choice but to give a specific answer and predict the stock's price.

Thus, as you start to formulate your machine learning research question, you'll likely increasingly sharpen your initial ideas by going through these steps:

  1. It helps to start by first stating the business problem. Business, as always in these contexts, doesn't mean a commercial enterprise but any enterprise, whether it's academic, industrial, or governmental.

    1. At this stage, the specific, measurable business outcome(s) should be clearly identified, as well as any quantifiable, actionable Key Performance Indicators (KPIs) that are relevant to the business problem.

      E.g., some traditional problems might be questions like: What factors are related to employee churn? Can we predict the number of employees that might leave the organization in the next year?

  2. Then, convert that business problem into an initial problem statement by making it precise and quantitative and something you can use with a machine learning model.

    1. In general, the initial problem statement should be precise and quantitative and is often expressed as some independent variable(s) (the predictor(s)) having a presumed correlation with some dependent variable (the response variable or class label).

      E.g., is there a correlation between the predictor variables of (age, length of service, and business unit) with the response variable of (terminated status) of an employee?

  3. Once you have a precise, quantifiable question, you will have to finally fine-tune it to match one or more of these questions that machine learning usually answers, as shown above. Once you decide which of those questions it fits, you'll formulate your initial machine learning research question which will consider the independent variables, the dependent variables, and the metrics for evaluating the performance.

    1. E.g., you might start with asking how likely is it that someone under 50, with more than 20 years service, working in the IT Department will leave within a year? This would be an example of a regression task where you predict a number (a probability) but you can also convert it to a classification task where you predict a category label, whether that person belongs in category A or category B (active employee or inactive employee).

      The final version of this classification machine learning research question could then be something like:

      Can the indepenent variables of (number of projects, recently promoted, job satisfaction, number of filed complaints, and average monthly hours worked) predict with over 90% precision and 90% recall the state of the dependent variable of (active employee status) for staff in the IT department?

It can sometimes be a little confusing which machine learning algorithm to pick; here is a general flowchart of how to pick a machine learning algorithm that might be helpful:

Selecting a ML Algorithm

Converting a Research Question into a Hypothesis

It’s generally good practice to try to position the independent variable(s) to appear first in the sentence followed by the predictive effect and then the dependent variable(s) as this ordering reflects the hypothesised direction of the effect, as well. Here are some examples of good and bad research questions.

Here is a better example of a Research Question from the dissertation of Charles Courchaine:

"To what extent, if any, does document encoding affect the recall, precision, F1, and recall-at-effort metrics of a Fuzzy ARTMAP-based TAR algorithm?"

And here is its corresponding hypothesis couplet, as discussed below, as well:

H0: Different document encodings (tf-idf, GloVe, Word2Vec, SBERT) will not change the recall, precision, F1, and recall-at-effort metrics of a Fuzzy ARTMAP-based TAR algorithm.

Ha: Different document encodings (tf-idf, GloVe, Word2Vec, SBERT) will change the recall, precision, F1, and recall-at-effort metrics of a Fuzzy ARTMAP-based TAR algorithm. Corpus specific document encodings (e.g., tf-idf) will improve performance metrics, while non-corpus specific encodings perform the same or worse.

Overall, you should try to:

Statistical Significance

Once you have your null and alternative hypotheses in hand, it's time to make a decision on which to reject. How do you go about deciding in this case? In the NHST framework, you start by assuming the null hypothesis (that there's no correlation or effect between the variables in the original population) is true. Next, you gauge the relationship between the variables in the sample taken from that population by computing some statistic on that sample.

If the likelihood of that value for that test statistic on that sample is extremely unlikely, you reject the null hypothesis (otherwise, you retain the null hypothesis as you fail to reject it). This likelihood of the sample result if the null hypothesis is true is measured by the p-value, the probability of obtaining that sample result if the null hypothesis is true.

If the p-value is lower than some pre-determined threshold, α (which is usually set to 0.01 or 0.05), you reject the null hypothesis, otherwise you retain it. As such, this p-value is not the the probability that the null hypothesis is true but, instead, is the probability of getting the sample result if the null hypothesis is true.

Thus, if the probability of getting that test statistic value is very low (lower than alpha), you reject the null hypothesis. Here is an excellent page that shows examples of how to craft null hypotheses and alternative hypotheses as well as which statistical tests (like t-tests, z-tests, linear regressions, etc.) are associated with the most popular studies.

Statistical Testing Options

This is one of the best references for Choosing the Right Statistical Test, that details both the different types of tests and examples for when they're most appropriate. In addition, it includes several tables that detail the statistical assumptions for these tests, the data types of independent and dependent variables (whether categorical, quantitative, or categorical) which influence which test is most appropriate for your study, and the difference between parametric and non-parametric tests; the parametric statistical tests cover regression (linear and logistic regression tests), comparison (t-tests, ANOVA, and MANOVA tests), or correlation (Pearson's r) tests, as summarized in their diagram below.

Statistical Test Flowchart

However, when you can't make the assumptions about the data that are required for the parametric statistical tests, you can instead use the non-parametric equivalents of most of those parametric tests by using tests like Spearman’s r (instead of Pearson's r), Chi-square (categorical-categorical), Sign test or Wilcoxon (instead of t-test), Kruskal–Wallis or ANOSIM (instead of ANOVA or MANOVA), etc.

Scientific Method/Falsifiability?

Machine Learning vs. Data Science

PredictionDecision
What video the learner wants to watch next.Show those videos in the recommendation bar.
Probability someone will click on a search result.If P(click) > 0.12, prefetch the web page.
What fraction of a video ad the user will watch.If a small fraction, don’t show the user the ad.