For Milestone One, please choose a data set from the web. It can be a a dataset you explored in your lab or you liked from the standard list of datasets or a completely different data source you want to explore. Then, write a data analysis research question, which should be framed as a discrete set of choices to be analyzed, perhaps using decision trees or one of the other machine learning algorithms you've seen in the course or in your own research.
These two initial pieces of the final project are important to determine as early and accurately as possible in order to guide the next phases of your research. But this is, of course, an iterative process and the research question can get fine-tuned as you go on to the subsequent milestones. For milestone 1, please ensure you have the following sections at a minimum:
The research question should be falsifiable or tested scientifically. It should be formulated as one of the 8 common questions machine learning answers, as seen below. It should also identify the predictor and response variables. So, in the end, you should formulate a falsifiable hypothesis that can be tested with quantitative data.
In general, the kinds of questions you can test with a machine learning approach are:
What category label should this thing have?
What number characterizes this thing?
What should I do next?
Is there some structure in this thing?
Is this thing weird compared to other things like it?
If your research question is vague, it doesn't have to be answered with a category label or a number. When thinking about a good research question, people often imagine they've found a mischievous genie who will truthfully answer all questions but will also try to make the answer vague and confusing. So they'll try to pin the genie down with a question so airtight that the genie has no choice but to answer insightfully.
For example, if you asked a vague question like, "Is there a correlation between stock prices and time?", the genie might answer, "Yes, the price will change with time." Although correct, this answer's not very helpful. But if you formulate a precise, quantitative question like, "Will my stock hit $50 by next Wednesday?", the genie has no choice but to give a specific answer and predict the stock's price.
Thus, as you start to formulate your research question, you'll likely increasingly sharpen your initial ideas by going through these steps:
It helps to start by first stating the business problem. E.g., What factors are related to employee churn? Can we predict future terminations?
At this stage, the specific, measurable business outcome(s) should be clearly identified, as well as any quantifiable, actionable Key Performance Indicators (KPIs) that are relevant to the business problem.
Then, convert that business problem into an initial problem statement by making it precise and quantitative and something you can use with a machine learning model. E.g., is there a correlation between age, length of service, and business unit with the terminated status of an employee?
In general, the initial problem statement should be precise and quantitative and is often expressed as some independent variable(s) (the predictor(s)) having a presumed correlation with some dependent variable (the response variable or class label).
Once you have a precise, quantifiable question, you will have to finally fine-tune it to match one or more of these questions that machine learning usually answers, as shown below. Once you decide which of those questions it fits, you'll formulate your initial research question.
E.g., how likely is it that someone under 50, with more than 20 years service, working in the IT Department will be gone within a year? This would be an example of a regression task where you predict a number (a probability) but you can also convert it to a classification task where you predict a category label, whether that person belongs in category A or category B.

Once you've formulated your initial research question, you can then further convert it to a specific kind of research question. There are three kinds, or levels, of additional research questions you can address: a Machine Learning (ML) Research Question, a Data Science (DS) Research Question, or a Research Hypothesis.
You can further convert your research question to a full-fledged Machine Learning (ML) Research Question by:
EXAMPLE: Using the titanic dataset, can we use a linear regression model to predict the probability of survival with a 70% accuracy for a person depending on their gender, age, and cabin location?
You'll often add the reasoning for this initial hypothesis, as well, which might look something like this:
REASONING: Given the positive correlation between first-class cabins, women, and children, we anticipate that young girls in first-class cabins have the highest survival rate on the Titanic. As such, we will utilize the Titanic dataset to test this premise.
A machine learning problem makes predictions but a data science problem makes decisions, as well. In order to convert the ML Research Question to a Data Science (DS) Research Question, you will have to consider:
EXAMPLE: Being able to predict survival based on gender, age, and cabin location will allow us to decide how best to populate the new Titanic voyage. If the survival rate for young girls in first-class cabins is more than 90%, we will re-allocate their population to increase overall survivability by a factor of 15%, in alignment with the KPI for the business outcome of increasing survivability by at least 10% for the new Titanic voyage.
Finally, if you're involved in fundamental research, you can convert the ML Research Question to a Research Hypothesis. In order to convert to a Hypothesis Testing framework, you need to:
EXAMPLE: Ho: Null Hypothesis: Gender, age, and cabin location have no effect on survival on the Titanic. Ha: Alternative Hypothesis: Young girls in first-class cabins will have a 90% higher likelihood of survival as compared to old men in the lower decks.
Look at the datasets you plan on using to make sure they are usable and will work. If you plan on creating a dataset (e.g., by scraping a website) convince the reader this will be feasible (you don't have to have the scraper working perfectly right now but you should before the next milestone).
Please read the final project description and then write a project proposal as follows.
You should start by brainstorming a long list of ideas, then narrow it down to a couple that are feasible given your knowledge background, the time constraints, and the available datasets you're able to locate. You should answer questions like:
You should write the proposal for one of these ideas but please do keep a couple as backups in case the original project doesn't work out for some reason.
The point of this milestone is to think through a reasonable project. You will not be held to doing exactly what you say you will do in this proposal. In fact, you should anticipate adapting your project as you continue to work on it (just ask Robert Burns or Mike Tyson). The more you put into the proposal, however, the better your life will be a few weeks from now as you work on the next few milestones.
Write a one page initial project proposal which describes your proposed project and discusses:
On a second page, include a list of three (3) other ideas you brainstormed with a couple bullet points of detail for each.