{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TUTORIAL 1: Pandas and MovieLens\n",
    "\n",
    "\n",
    "Let's revisit the Python notebook mentioned in the readings on the MovieLens dataaset. We'll start by creating our own notebook in Python to do the following tasks:\n",
    "\n",
    "* Each movie in the database has a genre (comedy, animations, etc.) associated with it. Show the top 20 genres with the highest number of responses from users.\n",
    "* Show the top 20 genres sorted by average ratings.\n",
    "* Show the top 20 movies sorted by descending mean female ratings for a specific genre (say \"Drama\").\n",
    "\n",
    "Let's start by first setting up Pandas!\n",
    "\n",
    "# Pandas\n",
    "\n",
    "The first thing we have to do start using the Pandas library is to import that module using `import pandas as pd` as shown below. Once you have that setup, please run each of the following cells, observing what they do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data is from https://grouplens.org/datasets/movielens/\n",
    "# Use \"MovieLens 1M Dataset\" and download it to the local directory first\n",
    "unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] \n",
    "users = pd.read_table('data/ml-1m/users.dat', sep='::', header=None, \n",
    "                      names=unames, engine='python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnames = ['movie_id', 'title', 'genres']\n",
    "movies = pd.read_table('data/ml-1m/movies.dat', sep='::', header=None,\n",
    "                        names=mnames, engine='python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnames = ['user_id', 'movie_id', 'rating', 'timestamp']\n",
    "ratings = pd.read_table('data/ml-1m/ratings.dat', sep='::', header=None, \n",
    "                        names=rnames, engine='python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ratings[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users_ratings = pd.merge(users, ratings)\n",
    "users_ratings[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create one merged DataFrame\n",
    "movies_ratings = pd.merge(movies, ratings)\n",
    "movies_ratings[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.merge(movies_ratings, users)\n",
    "data[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.iloc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.iloc[443889]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you try to get a sense of what all this slicing and dicing of dataframes does, the following image might help:\n",
    "\n",
    "![dataframe-slicing-dicing](dataframe-slicing-dicing.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get average ratings of all movies and separate by gender\n",
    "mean_ratings = data.pivot_table('rating', index=['title'],\n",
    "                    columns='gender', aggfunc='mean')\n",
    "mean_ratings[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_ratings2 = data.pivot_table('rating', index=['title', 'genres'],\n",
    "                    columns='gender', aggfunc='mean')\n",
    "mean_ratings2[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_ratings = data.groupby('title').size()\n",
    "num_ratings[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# meaningful ratings are when we have at least 250 people rate a movie\n",
    "meaningful_ratings = num_ratings.index[num_ratings >= 250]\n",
    "meaningful_ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meaningful_mean_ratings = mean_ratings.loc[meaningful_ratings]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meaningful_mean_ratings[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_female_ratings = meaningful_mean_ratings.sort_values(by='F', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_female_ratings[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_male_ratings = meaningful_mean_ratings.sort_values(by='M', ascending=False)\n",
    "top_male_ratings[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meaningful_mean_ratings['diff'] = meaningful_mean_ratings['M'] - meaningful_mean_ratings['F']\n",
    "sorted_by_diff = meaningful_mean_ratings.sort_values(by='diff')\n",
    "sorted_by_diff[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_by_diff[-10:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rating_std = data.groupby('title')['rating'].std()\n",
    "rating_std = rating_std.loc[meaningful_ratings] # filter only meaningful ones\n",
    "rating_std.sort_values(ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rating_std.sort_values()[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great job done getting a handle on the basics! \n",
    "\n",
    "# Extensions\n",
    "\n",
    "Let's see if you can now do the following tasks mentioned earlier:\n",
    "\n",
    "* Each movie in the database has a genre (comedy, animations, etc.) associated with it. Show the top 20 genres with the highest number of responses from users.\n",
    "* Show the top 20 genres sorted by average ratings.\n",
    "* Show the top 20 movies sorted by descending mean female ratings for a specific genre (say \"Drama\")."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}