Milestone 1: Select Dataset and Formulate Research Question

Description

For Milestone One, please choose a data set from the web. It can be a a dataset you explored in your lab or you liked from the standard list of datasets or a completely different data source you want to explore. Then, write a data analysis research question, which should be framed as a discrete set of choices to be analyzed, perhaps using decision trees or one of the other machine learning algorithms you've seen in the course or in your own research.

These two initial pieces of the final project are important to determine as early and accurately as possible in order to guide the next phases of your research. But this is, of course, an iterative process and the research question can get fine-tuned as you go on to the subsequent milestones. For milestone 1, please ensure you have the following sections at a minimum:

Research Question: start with a business problem and convert it to a research problem statement
Dataset description and details: describe the dataset, summarize its statistics, and include links to its source/provenance
Reasoning and Hypothesis: explain what you expect to find and why

Research Question Details

The research question should be falsifiable or tested scientifically. It should be formulated as one of the 8 common questions machine learning answers, as seen below. It should also identify the predictor and response variables. So, in the end, you should formulate a falsifiable hypothesis that can be tested with quantitative data.

In general, the kinds of questions you can test with a machine learning approach are:

What category label should this thing have?
- E.g., is this animal a cat?
What number characterizes this thing?
- E.g., what will be the price of my stock next Wednesday?
What should I do next?
- E.g., in which direction should my robot go next?
Is there some structure in this thing?
- E.g., can you group moviegoers as people who like action movies or comedies?
Is this thing weird compared to other things like it?
- E.g., is this credit card charge abnormal?

Tips

If your research question is vague, it doesn't have to be answered with a category label or a number. When thinking about a good research question, people often imagine they've found a mischievous genie who will truthfully answer all questions but will also try to make the answer vague and confusing. So they'll try to pin the genie down with a question so airtight that the genie has no choice but to answer insightfully.

For example, if you asked a vague question like, "Is there a correlation between stock prices and time?", the genie might answer, "Yes, the price will change with time." Although correct, this answer's not very helpful. But if you formulate a precise, quantitative question like, "Will my stock hit $50 by next Wednesday?", the genie has no choice but to give a specific answer and predict the stock's price.

Thus, as you start to formulate your research question, you'll likely increasingly sharpen your initial ideas by going through these steps:

It helps to start by first stating the business problem. E.g., What factors are related to employee churn? Can we predict future terminations?
At this stage, the specific, measurable business outcome(s) should be clearly identified, as well as any quantifiable, actionable Key Performance Indicators (KPIs) that are relevant to the business problem.
Then, convert that business problem into an initial problem statement by making it precise and quantitative and something you can use with a machine learning model. E.g., is there a correlation between age, length of service, and business unit with the terminated status of an employee?
In general, the initial problem statement should be precise and quantitative and is often expressed as some independent variable(s) (the predictor(s)) having a presumed correlation with some dependent variable (the response variable or class label).
Once you have a precise, quantifiable question, you will have to finally fine-tune it to match one or more of these questions that machine learning usually answers, as shown below. Once you decide which of those questions it fits, you'll formulate your initial research question.
E.g., how likely is it that someone under 50, with more than 20 years service, working in the IT Department will be gone within a year? This would be an example of a regression task where you predict a number (a probability) but you can also convert it to a classification task where you predict a category label, whether that person belongs in category A or category B.

Converting Research Questions to Research Hypotheses

Once you've formulated your initial research question, you can then further convert it to a specific kind of research question. There are three kinds, or levels, of additional research questions you can address: a Machine Learning (ML) Research Question, a Data Science (DS) Research Question, or a Research Hypothesis.

Formulating a ML Research Question:

You can further convert your research question to a full-fledged Machine Learning (ML) Research Question by:

Confirming the data exists and is relevant to the problem domain
Framing the ML context for the problem
Identifying variables and finding correlations
Deciding which variable(s) will be the response variable(s) using if-then constructions
Deciding upon the performance measure or metric for the ML model

EXAMPLE: Using the titanic dataset, can we use a linear regression model to predict the probability of survival with a 70% accuracy for a person depending on their gender, age, and cabin location?

You'll often add the reasoning for this initial hypothesis, as well, which might look something like this:

REASONING: Given the positive correlation between first-class cabins, women, and children, we anticipate that young girls in first-class cabins have the highest survival rate on the Titanic. As such, we will utilize the Titanic dataset to test this premise.

Formulating a DS Research Question:

A machine learning problem makes predictions but a data science problem makes decisions, as well. In order to convert the ML Research Question to a Data Science (DS) Research Question, you will have to consider:

How will that prediction help solve the business problem?
What is a business strategy based on that prediction?
How will you align the ML metrics with the business outcome and KPIs?

EXAMPLE: Being able to predict survival based on gender, age, and cabin location will allow us to decide how best to populate the new Titanic voyage. If the survival rate for young girls in first-class cabins is more than 90%, we will re-allocate their population to increase overall survivability by a factor of 15%, in alignment with the KPI for the business outcome of increasing survivability by at least 10% for the new Titanic voyage.

Formulating a Research Hypothesis:

Finally, if you're involved in fundamental research, you can convert the ML Research Question to a Research Hypothesis. In order to convert to a Hypothesis Testing framework, you need to:

Establish the theoretical/conceptual framework that will serve as the basis, or context, for your answer to the research question
Formulate your answer as a clear description of the relationship between the variables of interest in your study
Write the predicted answer to your research question as an If-Then formulation
Re-phrase it next as a correlation with the evaluation metric
Convert it to a Hypothesis Testing framework

EXAMPLE: H_o: Null Hypothesis: Gender, age, and cabin location have no effect on survival on the Titanic. H_a: Alternative Hypothesis: Young girls in first-class cabins will have a 90% higher likelihood of survival as compared to old men in the lower decks.

Advice

Look at the datasets you plan on using to make sure they are usable and will work. If you plan on creating a dataset (e.g., by scraping a website) convince the reader this will be feasible (you don't have to have the scraper working perfectly right now but you should before the next milestone).

Assignment

Please read the final project description and then write a project proposal as follows.

You should start by brainstorming a long list of ideas, then narrow it down to a couple that are feasible given your knowledge background, the time constraints, and the available datasets you're able to locate. You should answer questions like:

What questions will you try to answer? List 5-10 possible questions.
What datasets will you use? You should have already found and taken a first look at the datasets. Make sure the data is clean enough to reasonably use and actually has the information content to answer your questions.
How will you use the data to try to answer the question? What are some things you will do with the data to get at your questions? For example, what are some plots you might make.

You should write the proposal for one of these ideas but please do keep a couple as backups in case the original project doesn't work out for some reason.

The point of this milestone is to think through a reasonable project. You will not be held to doing exactly what you say you will do in this proposal. In fact, you should anticipate adapting your project as you continue to work on it (just ask Robert Burns or Mike Tyson). The more you put into the proposal, however, the better your life will be a few weeks from now as you work on the next few milestones.

Deliverables

Write a one page initial project proposal which describes your proposed project and discusses:

What research question will you try to answer?
What dataset(s) will you use?
What are some analyses and visualizations you might carry out on that dataset?

On a second page, include a list of three (3) other ideas you brainstormed with a couple bullet points of detail for each.