Kaggle TMDB Box Office Prediction

Building a model to predict the box office figures for movies using the Random Forest algorithm.

The solution notebook (R) is available here.

Test

Background & Problem Description

In a world where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget?

In this competition hosted by Kaggle, we were presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. Only the data that is available before a movie’s release is used to implement the model.

Data Source

The data is made available by the famous The Movie Database.

Two data files train.csv and test.csv were provided as part of this competition -

Additionally a third file sample_submission.csv was availavle that provides the structure of the solution file to be submitted.

More details can regarding the data can be found here.

Analytical Approach

The R notebook shows the end-to-end analytical process (described below) and its implementation to address this business problem -

  1. Intial Setup & Loading the data
  2. Feature Engineering
  3. Exploratory Data Analysis
  4. Missing Values treatment
  5. Analytical Dataset creation
  6. Model Building
  7. Prediction on Test dataset

Evaluation

For each movie (id) available in the test data set, international box office revenue had to be predicted. Submissions were evaluated on Root-Mean-Squared-Logarithmic-Error (RMSLE) between the predicted value and the actual revenue.

I took up this competition to practice and polish my analytical skills. As the competition was over long ago, I could not make it to the public leaderboard.