Capital One Data Challenge

I. Problem Description

Our client is a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City.
The client has already concluded that two bedroom properties are the most profitable. However, they want to identify the zip codes are the best to invest in.

II. Objective

The objective of this analysis is to build a data product that assists the client in identifying the zipcodes that are favorable and will generate the most profit on short term rentals within New York City.

III. Assumptions

The following assumptions have been made while implementing this data model -

The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Number of reviews have been used to estimate revenue figures. We assume that number of reviews that a property has recieved reflects the number of bookings that have been made. This, however is more conservative appraoch to estimating revenue.
In this analysis, we make a comparison between revenue estimated based one number of reviews vs assuming all properties observe a 75% occupancy in a year.
Considering the increasing Cost trend over last 5 years, we use maximum value of cost observed in last 5 years as an estimate of cost of year 2019

IV. Datasets used

We have used 2 datasets in our analysis -

AirBnb Listings data for New York City - Contains information, such as location, host information, number of bedrooms, reviews recieved, etc.,about listings in New York City _http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz_
Zillow Property Cost data - Contains selling cost of 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017. _https://www.zillow.com/research/data/_

Data dictionary for these two datasets have been provided in the metadata file ("Narsinghani.Meenal_DataChallenge_Metadata.xlsx) available in the zipped folder. Certain columns have been derived based on the columns in these datasets. Description of those fields has been included in the file.

V. Data Preparation

1. Initial Setup: Setting the working directory

2. Loading the libraries required (installing if need be)

packages_reqd <- c("ggplot2", "tidyverse", "knitr", "plotly")

new_packages <- packages_reqd[!(packages_reqd %in% installed.packages()[,"Package"])]

if(length(new_packages)) 
  install.packages(new_packages)

library(tidyverse)
library(ggplot2)
library(knitr)
library(plotly)

3. Importing the data files

A. AirBnb Listings Dataset

airbnb_listings_data <- read.csv("listings_airbnb.csv", header = TRUE, stringsAsFactors = FALSE)

head(airbnb_listings_data,20)

dim(airbnb_listings_data)

## [1] 48895   106

B. Zillow Cost Dataset

zillow_cost_data <- read.csv("Zip_Zhvi_2bedroom.csv", header = TRUE, stringsAsFactors = FALSE)

dim(zillow_cost_data)

## [1] 8946  262

4. Data Wrangling

A. AirBnb Listings Dataset

Data Subseting
Formatting data
Filtering & Subsetting
Missing values treatment
Univariate analysis & outlier treatment
Generating “Estimated Revenue/Year” field

i. Dropping columns

Before cleaning up the data set, dropping the fields which will not be used in the analysis (considering the nature and scope of this analysis)

remove_cols <- c(2,3, 5:37, 51, 59, 70:77, 82, 94:105) 

airbnb_listings_subset <- airbnb_listings_data[ , -remove_cols]

dim(airbnb_listings_subset)

## [1] 48895    48

#Removing the original dataset
rm(airbnb_listings_data)

ii. Formatting column values

A quick glance of the data, shows certain columns need to be formatted in order to use them in analysis.

Columns relating to “price” are of character type, and have a preceding “dollar” sign. Removing the ‘$’ symbol and converting them to numeric

format_price <- function(var)
{
  var <- as.numeric(gsub(pattern="\\$", replacement = "", var))
}

colnames(airbnb_listings_subset)

##  [1] "id"                           "last_scraped"                
##  [3] "street"                       "neighbourhood"               
##  [5] "neighbourhood_cleansed"       "neighbourhood_group_cleansed"
##  [7] "city"                         "state"                       
##  [9] "zipcode"                      "market"                      
## [11] "smart_location"               "country_code"                
## [13] "country"                      "latitude"                    
## [15] "longitude"                    "property_type"               
## [17] "room_type"                    "accommodates"                
## [19] "bathrooms"                    "bedrooms"                    
## [21] "beds"                         "bed_type"                    
## [23] "square_feet"                  "price"                       
## [25] "weekly_price"                 "monthly_price"               
## [27] "security_deposit"             "cleaning_fee"                
## [29] "guests_included"              "extra_people"                
## [31] "minimum_nights"               "maximum_nights"              
## [33] "availability_30"              "availability_60"             
## [35] "availability_90"              "availability_365"            
## [37] "number_of_reviews"            "number_of_reviews_ltm"       
## [39] "first_review"                 "last_review"                 
## [41] "review_scores_rating"         "review_scores_accuracy"      
## [43] "review_scores_cleanliness"    "review_scores_checkin"       
## [45] "review_scores_communication"  "review_scores_location"      
## [47] "review_scores_value"          "reviews_per_month"

airbnb_listings_cleaned <- data.frame(airbnb_listings_subset[c(1:23, 29, 31:48)], sapply(airbnb_listings_subset[c(24:28, 30)], format_price))

Columns “last_scraped”, “first_review” and “last_review” represent dates. Converting them from character to date format

format_date <- function(var)
{
  var <- as.Date(var, format="%Y-%m-%d")  
}

airbnb_listings_cleaned[c("last_scraped","first_review","last_review")] <- 
        lapply(airbnb_listings_cleaned[c("last_scraped","first_review","last_review")],format_date)

#Since we will continue cleaning up listings_cleaned dataset, removing listings_subset data 
rm(airbnb_listings_subset)

Formatting City column

Changing different values that correspond to New York to the same format - “New York City”

#Viewing the city data
unique(airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)])

##  [1] "New York"                  "NY"                       
##  [3] "NY "                       "new york"                 
##  [5] "New York City"             "New York "                
##  [7] "NYC"                       "Nueva York"               
##  [9] "New-York"                  "New york"                 
## [11] "New York city"             "NUEVA YORK"               
## [13] "New York, US"              "New York City "           
## [15] "NewYork"                   "NYC "                     
## [17] "Newyork"                   "New York - Astoria"       
## [19] "Neponsit"                  "New York , East Elmhurst "
## [21] "nyc"                       "New York East Elmhurst "  
## [23] "New York,Manhattan "       "New York, "               
## [25] "Nyc"                       "New york "                
## [27] "New York - Sunnyside "     "Ny"                       
## [29] "Nova York"                 "NEW YORK"

#Since all these correspond to New York City, converting them to a standard format
airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)] <- "New York City"

#Verifying if all the NY related city values were changed correctly
unique(airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)])

## [1] "New York City"

Formatting State column

For New York, different formats were found in the state column. Changing all to the same format - “NY”

#Viewing the state data
unique(airbnb_listings_cleaned$state)

## [1] "NY"        ""          "Ny"        "ny"        "MP"        "CA"       
## [7] "NJ"        "New York "

# 4 different formats for NY were found - NY, Ny, ny, New York
# Changing them all to NY

airbnb_listings_cleaned$state[airbnb_listings_cleaned$state %in% c("ny", "Ny", "New York ", "NY")] <- "NY"

#Verifying where the changes were made successfully
unique(airbnb_listings_cleaned$state)

## [1] "NY" ""   "MP" "CA" "NJ"

Formatting Zipcode column: Converting Zipcode from character to numeric

airbnb_listings_cleaned$zipcode <- as.numeric(airbnb_listings_cleaned$zipcode)

iii. Filtering & Subsetting the dataset

Identifying number of zipcodes with 2 BR properties information

length(unique(airbnb_listings_cleaned$zipcode))

## [1] 194

A total of 194 zipcodes are avaiable in the AirBnb listings data

airbnb_listings_cleaned %>% 
    select(zipcode, bedrooms) %>%
      filter(bedrooms == 2) %>%
        distinct(zipcode)%>%
          count()

Out of 194, only 170 zipcodes have information about 2 BR properties

We are missing out of 24 zipcodes from our analysis Since it is already established that 2-bedroom properties are the most profitable;

Filtering the Listings data set for only 2-bedroom properties

airbnb_listings_2BR <- airbnb_listings_cleaned %>%
                        filter(bedrooms == 2)

dim(airbnb_listings_2BR)

## [1] 6497   48

#We no longer need the listings_cleaned data, removing it
rm(airbnb_listings_cleaned)

There are mutiple columns that provide us information on the location of the properties. Some of these (like Street) provides too specific information and others (like country_code) is too generic

Revisiting the columns, after filtering the dataset, to identify if columns that do not provide any useful information

#A quick glance at the dataset suggests that columns - street and  smart_location
#Looking at these columns to reverify

airbnb_listings_2BR %>%
  select(street,
         neighbourhood,
         neighbourhood_cleansed,
         neighbourhood_group_cleansed,
         city,
         state,
         market,
         smart_location,
         country_code,
         country) %>% 
    summarise_all(n_distinct) %>%
      gather("Field", "Count",1:10)

The number of distinct values available in the below columns is huge -

street
neighbourhood
neighbourhood_cleansed
city

There are 5 values for “neighbourhood_group_cleansed” which represent the boroughs(neighbourhood)

Since “Borough” is the widely used logical grouping for zipcodes. We can drop these columns are there is no new, interesting information in these columns that might be useful in our analysis.

airbnb_listings_2BR <- airbnb_listings_2BR %>% 
                          select(-c("street", 
                                    "neighbourhood",
                                    "neighbourhood_cleansed",
                                    "city"))
dim(airbnb_listings_2BR)

## [1] 6497   44

There are observations with “market” as “London” and “Other(Domestic)”. Analyzing those records

airbnb_listings_2BR %>% filter(!market %in% c("New York"))

These 14 observations are for New York only as is evident from other location-specific columns. So, it is safe to assume that the missing values and odd values(London and Other(Domestic)) is nothing more than recording error.

Columns - “state”, market“,”smart_location“,”country_code" and “country” essentially have only one value. We can drop these columns.

airbnb_listings_2BR <- airbnb_listings_2BR %>% 
                          select(-c("state",
                                    "market", 
                                    "smart_location",
                                    "country_code",
                                    "country"))

dim(airbnb_listings_2BR)

## [1] 6497   39

Understanding the variety of information available in the remianing character columns -

neighbourhood_group_cleansed
property_type
room_type
bed_type

airbnb_listings_2BR %>%
  select(neighbourhood_group_cleansed,
         property_type,
         room_type,
         bed_type) %>% 
    summarise_all(n_distinct) %>%
      gather("Field", "Count",1:4)

airbnb_listings_2BR[, c("neighbourhood_group_cleansed", "bed_type", "room_type", "property_type")] <- 
      airbnb_listings_2BR[, c("neighbourhood_group_cleansed", "bed_type", "room_type", "property_type")] %>% mutate_all(as.factor)

iv. Missing values treatment

We will identify what fields that have missing values and then find appropriate ways of handling them

#table(colSums(is.na(airbnb_listings_2BR)))
apply(is.na(airbnb_listings_2BR), 2, sum)

##                           id                 last_scraped 
##                            0                            0 
## neighbourhood_group_cleansed                      zipcode 
##                            0                           50 
##                     latitude                    longitude 
##                            0                            0 
##                property_type                    room_type 
##                            0                            0 
##                 accommodates                    bathrooms 
##                            0                            3 
##                     bedrooms                         beds 
##                            0                            1 
##                     bed_type                  square_feet 
##                            0                         6396 
##              guests_included               minimum_nights 
##                            0                            0 
##               maximum_nights              availability_30 
##                            0                            0 
##              availability_60              availability_90 
##                            0                            0 
##             availability_365            number_of_reviews 
##                            0                            0 
##        number_of_reviews_ltm                 first_review 
##                            0                         1280 
##                  last_review         review_scores_rating 
##                         1280                         1383 
##       review_scores_accuracy    review_scores_cleanliness 
##                         1386                         1384 
##        review_scores_checkin  review_scores_communication 
##                         1388                         1384 
##       review_scores_location          review_scores_value 
##                         1387                         1389 
##            reviews_per_month                        price 
##                         1280                           59 
##                 weekly_price                monthly_price 
##                         6175                         6493 
##             security_deposit                 cleaning_fee 
##                         2190                          933 
##                 extra_people 
##                            0

Of all the numeric fields, fields sqaure_foot, weekly_price, monthly_price have considerable amount of NA’s (almost more than 90%)

Imputing these columns based on the available data will not be accurate. It is safe to drop these column

Dropping columns - “square_feet”, “monthly_price”, “weekly_price”

airbnb_listings_2BR <- airbnb_listings_2BR %>% select(-c(square_feet, monthly_price, weekly_price))

Since zip code is at the heart of our analysis, even though we have only 50 observations with missing zipcode values, we can populate the values based on other records. (Making use of the Latitude and Longitude columns)

Populating these observations based on other observations having (almost) same Latitude and Longitude values

Fixing zip codes

index <- which(is.na(airbnb_listings_2BR$zipcode))

for(k in 4:1)
{
    #index <- which(is.na(airbnb_listings_2BR$zipcode))
  
    for(i in index)
    {
      for(j in 1:nrow(airbnb_listings_2BR))
      {
        if(round(airbnb_listings_2BR$latitude[i],k) == round(airbnb_listings_2BR$latitude[j],k) &&      round(airbnb_listings_2BR$longitude[i],k) == round(airbnb_listings_2BR$longitude[j],k) && i!=j && !is.na(airbnb_listings_2BR$zipcode[j]))
        {
          airbnb_listings_2BR$zipcode[i] = airbnb_listings_2BR$zipcode[j]
          break
        }
    }
  }
}

Noting the count of observations with missing zipcode

sum(is.na(airbnb_listings_2BR$zipcode))

## [1] 0

v. Univariate Analysis and Outlier Treatment

In order to determine the estimated revenue for a listing there are 2 main metrics to be considered -

Price/night
Minimum number of nights
Number of reviews

So we will begin exploring these variables first, identify and treat outliers (if any)

Price

plot_ly(y = airbnb_listings_2BR$price, 
        type = "box",
        marker = list(color = 'rgb(7,40,89)'),
        line = list(color = 'rgb(7,40,89)'))%>%
      layout(title = "Outlier identification for Price/Night metric",
             xaxis = list(title = ""),
             yaxis = list(title = ""))

There are large number of outliers for price/night. Some values are so large that does not really make sense.

plot_ly(x = airbnb_listings_2BR$price, 
        type = "histogram",
        marker = list(color = 'rgb(7,40,89)'),
        line = list(color = 'rgb(7,40,89)'))%>%
      layout(title = "Distribution of Price/Night across all properties",
             xaxis = list(title = "Price"),
             yaxis = list(title = "Frequency"))

Summary statistics for Price/Night metric

summary(airbnb_listings_2BR$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   129.0   185.0   214.1   259.0   999.0      59

Based in the summary statistics, we observe that about 75% of the listings offer a price upto 260 dollars per night. Price per night as large as 1000 dollars seems to be an outlier

To classify observations as outliers (and subsequently remove them) we will follow the rule of thumb that is - “Any observation with price value > 3 SD above mean can be considered as an outlier”

airbnb_listings_2BR <- airbnb_listings_2BR %>%
                          mutate(z_price = scale(price, center = T, scale = T))

length(airbnb_listings_2BR$z_price[airbnb_listings_2BR$z_price > 3])

## [1] 188

airbnb_2BR_treated <- airbnb_listings_2BR %>% filter(z_price <= 3)

dim(airbnb_2BR_treated)

## [1] 6309   37

rm(airbnb_listings_2BR)

Minimum nights

sum(is.na(airbnb_2BR_treated$minimum_nights))

## [1] 0

Ther are no missing values in for “minimum_nights” column

summary(airbnb_2BR_treated$minimum_nights)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    8.04    5.00  999.00

From the summary statistics we see that about 75% of properties have a condition of minimum nights <= 5

We cannot drop the properties with unusaully high “minimum_nights”, as properties like Service Apartment usually have longer rent duration

We will observe the split of the numberof properties and summary statistics for “minimum_nights” for different property type

min_nights_property <- airbnb_2BR_treated %>% 
                        select(property_type, minimum_nights) %>%
                          group_by(property_type) %>%
                            summarize(count = n(),
                                      min = min(minimum_nights),
                                      mean = mean(minimum_nights),
                                      max_min_nights = max(minimum_nights)) %>%
                              arrange(desc(count))
min_nights_property

There are not many listings of type - Bungalow, Guesthouse, etc. Comparing the summary statistics of “minimum_nights” field and the summary by “property_type” , 75 percent of the properties have minimum nights <= 5.

Therefore, we can restrict our analysis to only the properties that make up the top 75 percentile.

airbnb_2BR_treated <- airbnb_2BR_treated %>%
                      mutate(z_min_nights = scale(minimum_nights, center = T, scale = T))

length(airbnb_2BR_treated$z_min_nights[airbnb_2BR_treated$z_min_nights > 3])

## [1] 50

airbnb_2BR_treated <- airbnb_2BR_treated %>% filter(z_min_nights <= 3)

summary(airbnb_2BR_treated$minimum_nights)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   6.646   5.000  70.000

Number of Reviews

sum(is.na(airbnb_2BR_treated$number_of_reviews))

## [1] 0

sum(is.na(airbnb_2BR_treated$reviews_per_month))

## [1] 1174

vi. Generating Estimated Revenue/Year field

Approach: A

Assuming occupancy rate for all properties across all zipcodes is 75%. Then, we can estimate the revenue per year per listing using the below formula -

**Revenue/year/ per listing = 0.75 * 365 * price/night**

airbnb_2BR_treated$revenue_year_75 <- 0.75 * 365 * airbnb_2BR_treated$price

Approach: B

Another conservative approach could be to consider number of reviews/month as a reflection of the number of number of bookings recieved per month

Since column “reviews_per_month” has a good number of missing values. We will calculate a new column “reviews_month” based on “number_of_reviews”, “last_scraped” and “first_review” column

**Number of reviews/month = (Number of reviews / (last_scraped - first_review)in months)**

airbnb_2BR_treated <- airbnb_2BR_treated %>%
                                      mutate(reviews_month = 4*number_of_reviews/as.numeric(difftime(last_scraped, first_review, units = "weeks")))

**Revenue/year per listing  = 12 * number of reviews / month * price/night * minimum_nights**

airbnb_2BR_treated <- airbnb_2BR_treated %>% mutate(revenue_year_reviews = 12* reviews_month * price * minimum_nights)

Last step - We will drop columns like Latitude/Longitude as they are no longer needed

airbnb_revenue <- airbnb_2BR_treated %>% select(-c(latitude, 
                                                   longitude,
                                                   z_price,
                                                   z_min_nights))

The AirBnb Listings dataset has been treated and cleaned.

B. Zillow Cost Dataset

Dataset filtering
Restructuring data
Estimating median cost over years
Missing values treatment
Cost estimation for 2019

dim(zillow_cost_data)

## [1] 8946  262

View(unique(zillow_cost_data$City))

i. Dataset filtering

A quick glance at the price data shows that the data is for many cities, not only New York

Since analysis needs to be performed only for New York city, filtering the dataset

#Filtering the data set for City = New York and State = NY
zillow_data_NY <- zillow_cost_data %>%
                  filter(City == "New York" & 
                          State == "NY")


dim(zillow_data_NY)

## [1]  25 262

#25 262

## Removing the original dataset having cost data for all cities
rm(zillow_cost_data)

ii. Restructuring the data

Converting the column names into a single column

zillow_data_NY <- gather(zillow_data_NY, YearMonth, Cost , -c(1:7))
dim(zillow_data_NY)

## [1] 6375    9

Separating the Month and Year from this newly created column

zillow_data_NY <- separate(zillow_data_NY, YearMonth, c("Year", "Month"))

dim(zillow_data_NY)

## [1] 6375   10

Formatting the Year column to remove leading ‘X’

zillow_data_NY$Year <- substr(zillow_data_NY$Year, 2, nchar(zillow_data_NY$Year))

iii. Estimating median cost over years

To estimate the cost of properties for year 2019, we will use median cost in different years instead of using monthly cost over years

Deciding on the level at which we need to compute the median of cost for each year

unique_val <- NULL

for(i in 1:ncol(zillow_data_NY))
{
  unique_val[i] <- length(unique(zillow_data_NY[,i]))
}

cbind(colnames(zillow_data_NY), unique_val)

##                    unique_val
##  [1,] "RegionID"   "25"      
##  [2,] "RegionName" "25"      
##  [3,] "City"       "1"       
##  [4,] "State"      "1"       
##  [5,] "Metro"      "1"       
##  [6,] "CountyName" "4"       
##  [7,] "SizeRank"   "25"      
##  [8,] "Year"       "22"      
##  [9,] "Month"      "12"      
## [10,] "Cost"       "3712"

We can see that there is only one City (New York), one State (NY) and one Metro(New York). So we can drop these columns from our dataset.

nrow(unique(zillow_data_NY[,c("RegionID", "RegionName")]))

## [1] 25

Also, since RegionName is the zip code(which will be used to join with AirBnb dataset). And a regionID corresponds to a unique RegionName, we can drop RegionID from our analysis

Calculating Median price for each year at RegionName, CountyName, SizeRank, Year level (RegionID, City, State, Metro will be dropped)

zillow_medcost_year <- zillow_data_NY %>%
                                 group_by(RegionName, CountyName, SizeRank, Year) %>%
                                   summarize(MedianCost = median(Cost)) %>%
                                     ungroup()

dim(zillow_medcost_year)

## [1] 550   5

## We no longer need this dataset as we have restructred the dataset to suit our analysis
rm(zillow_data_NY)

iv. Missing values treatment

Looking out for missing values in each column

year_cost_data <- zillow_medcost_year %>% spread(Year, MedianCost)

colSums(is.na(year_cost_data))

## RegionName CountyName   SizeRank       1996       1997       1998       1999 
##          0          0          0         17         17         17         16 
##       2000       2001       2002       2003       2004       2005       2006 
##         16         16         17         17         13          4          1 
##       2007       2008       2009       2010       2011       2012       2013 
##          1          0          0          0          0          0          0 
##       2014       2015       2016       2017 
##          0          0          0          0

Years from 1996 to 2007 have missing values for Median Price. We can estimate this value by keeping the months in a year that have missing data and calculate the median price based on the months that have value.

However, instead of complicating our analysis, we can filter data for years 2008 to 2017 to estimate the price for Year 2019

10 years worth of data should be sufficient in order to observe the trend of price over years and estimate it for year 2019

Dropping data from 1996 - 2007

zillow_medcost_year <- zillow_medcost_year[zillow_medcost_year$Year >= 2008,]

dim(zillow_medcost_year)

## [1] 250   5

View(zillow_medcost_year)

v. Cost estimation for 2019

Visualizing the trend of median cost over years within each zipcode

Last 10 years

zillow_medcost_year %>%
      plot_ly(x = ~Year, 
              y = ~MedianCost,
              group = ~as.factor(RegionName),
              type = "scatter",
              mode = "lines+markers",
              color = ~as.factor(RegionName))%>%
     layout(title = "Trend of median cost of property over years within each zipcode",
            xaxis = list(title = ""),
              yaxis = list(title = "Median Cost"))

Last 5 years

zillow_medcost_year %>%
    filter(Year >= 2013) %>%
      plot_ly(x = ~Year, 
              y = ~MedianCost,
              group = ~as.factor(RegionName),
              type = "scatter",
              color = ~as.factor(RegionName),
              mode = "lines+markers")%>%
     layout(title = "Trend of median cost of property over years within each zipcode",
            xaxis = list(title = ""),
              yaxis = list(title = "Median Cost"))

Almost all the zipcodes have shown a steady increase in the prices over past 5 years.

To estimate the cost of properties in 2019, we will take the cost that was maximum in the last 5 years.

zillow_medcost_region <- zillow_medcost_year %>%
                            spread(Year, MedianCost)


rm(zillow_medcost_year)

zillow_medcost_2019 <- zillow_medcost_region %>% 
                          group_by(RegionName,CountyName, SizeRank) %>%
                            rowwise() %>%
                            mutate(`2019` = max(c(`2013`,
                                                  `2014`,
                                                  `2015`,
                                                  `2016`,
                                                  `2017`)))

zillow_medcost_2019 <- gather(zillow_medcost_2019, Year, Cost , -c(1:3))

vi. Filtering the cost information just for year 2019

As the historical cost information ws required only to predict the cost for year 2019, we can filter the historical data out and retain data for year 2019 only.

zillow_cost <- zillow_medcost_2019 %>% filter(Year == 2019)
View(zillow_cost)

rm(zillow_medcost_region)

The Zillow Cost dataset has been treated and cleaned.

5. Final dataset creation

Merging Airbnb Revenue and Zillow Cost datasets

The merging of the 2 datasets - airbnb_revenue and zillow_cost would happen at zipcode level

zipcode(Airbnb Revenue data)
RegionName(Zillow Cost data)

Checking the distinct number of zipcodes in each dataset to determine the perecent of zipcodes that overlap in the two.

length(unique(airbnb_revenue$zipcode))

## [1] 166

length(unique(zillow_cost$RegionName))

## [1] 25

We can expect to see a maximum of distinct zipcodes in the merged dataset.

Merging the 2 datasets using an inner join

analysis_dataset <- airbnb_revenue %>% inner_join(zillow_cost, by = c("zipcode" = "RegionName"))

dim(analysis_dataset)   #1496 41

## [1] 1467   41

Checking the number of distinct zipcodes in the merged dataset

length(unique(analysis_dataset$zipcode))

## [1] 24

24 out of the 25 zipcodes avaialble in zillow dataset had a corresponding match in the AirBnb Revenue dataset.

Our final dataset is ready for deep dive analysis

VI. Deep Dive Analysis

In order to determine the zipcodes to invest in we need to analyze following factors -

      1. Popularity/High Demand: Based on -
      
      * a. most reviewed properties
      * b. number of properties available
              
      2. Price per night 
      3. Cost of property 
      4. Estimated revenue

(Some of the below steps may not map directly to one of the above points. It is to understand the data better )

glimpse(analysis_dataset)

## Observations: 1,467
## Variables: 41
## $ id                           <int> 16458, 20853, 23135, 27006, 27659, 323...
## $ last_scraped                 <date> 2019-07-08, 2019-07-09, 2019-07-09, 2...
## $ neighbourhood_group_cleansed <fct> Brooklyn, Brooklyn, Brooklyn, Manhatta...
## $ zipcode                      <dbl> 11215, 11217, 11231, 10023, 11215, 112...
## $ property_type                <fct> Condominium, Apartment, Apartment, Apa...
## $ room_type                    <fct> Entire home/apt, Entire home/apt, Enti...
## $ accommodates                 <int> 4, 4, 3, 4, 9, 3, 6, 2, 4, 4, 4, 4, 6,...
## $ bathrooms                    <dbl> 2.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.5, NA,...
## $ bedrooms                     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ beds                         <int> 1, 3, 2, 3, 6, 1, 3, 2, 3, 3, 2, 2, 3,...
## $ bed_type                     <fct> Real Bed, Real Bed, Real Bed, Real Bed...
## $ guests_included              <int> 1, 4, 1, 3, 1, 2, 2, 1, 2, 2, 11, 2, 1...
## $ minimum_nights               <int> 3, 7, 2, 6, 2, 2, 2, 2, 2, 2, 2, 3, 2,...
## $ maximum_nights               <int> 14, 120, 730, 30, 30, 5, 730, 365, 30,...
## $ availability_30              <int> 0, 0, 21, 0, 0, 0, 0, 12, 0, 0, 13, 11...
## $ availability_60              <int> 0, 0, 39, 16, 17, 0, 0, 37, 0, 0, 43, ...
## $ availability_90              <int> 0, 0, 69, 16, 17, 0, 0, 67, 6, 0, 73, ...
## $ availability_365             <int> 0, 158, 342, 199, 216, 0, 0, 342, 272,...
## $ number_of_reviews            <int> 4, 21, 233, 27, 16, 4, 80, 150, 21, 21...
## $ number_of_reviews_ltm        <int> 0, 5, 27, 4, 6, 0, 61, 21, 6, 0, 0, 17...
## $ first_review                 <date> 2017-07-02, 2013-11-10, 2010-05-20, 2...
## $ last_review                  <date> 2017-09-24, 2019-05-21, 2019-06-24, 2...
## $ review_scores_rating         <int> 100, 99, 93, 92, 95, 90, 94, 92, 97, 9...
## $ review_scores_accuracy       <int> 10, 10, 10, 9, 10, 10, 10, 10, 10, 9, ...
## $ review_scores_cleanliness    <int> 10, 10, 10, 9, 9, 10, 10, 10, 9, 9, 9,...
## $ review_scores_checkin        <int> 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9...
## $ review_scores_communication  <int> 10, 10, 9, 10, 10, 10, 10, 9, 10, 9, 1...
## $ review_scores_location       <int> 10, 10, 9, 10, 10, 10, 10, 9, 10, 9, 1...
## $ review_scores_value          <int> 10, 10, 9, 9, 10, 10, 9, 9, 10, 9, 9, ...
## $ reviews_per_month            <dbl> 0.16, 0.30, 2.09, 0.27, 0.24, 0.04, 2....
## $ price                        <dbl> 225, 250, 175, 235, 400, 140, 250, 165...
## $ security_deposit             <dbl> NA, 500, 0, 500, 500, NA, 800, NA, NA,...
## $ cleaning_fee                 <dbl> NA, 250, 65, 120, 120, 60, 165, 65, NA...
## $ extra_people                 <dbl> 0, 50, 15, 40, 0, 0, 40, 20, 25, 50, 3...
## $ revenue_year_75              <dbl> 61593.75, 68437.50, 47906.25, 64331.25...
## $ reviews_month                <dbl> 0.15217391, 0.28447025, 1.95504945, 0....
## $ revenue_year_reviews         <dbl> 1232.6087, 5973.8752, 8211.2077, 4273....
## $ CountyName                   <chr> "Kings", "Kings", "Kings", "New York",...
## $ SizeRank                     <int> 71, 1555, 1817, 3, 71, 32, 71, 1817, 7...
## $ Year                         <chr> "2019", "2019", "2019", "2019", "2019"...
## $ Cost                         <dbl> 1045400, 1231850, 1202550, 1988700, 10...

colSums(is.na(analysis_dataset))

##                           id                 last_scraped 
##                            0                            0 
## neighbourhood_group_cleansed                      zipcode 
##                            0                            0 
##                property_type                    room_type 
##                            0                            0 
##                 accommodates                    bathrooms 
##                            0                            3 
##                     bedrooms                         beds 
##                            0                            0 
##                     bed_type              guests_included 
##                            0                            0 
##               minimum_nights               maximum_nights 
##                            0                            0 
##              availability_30              availability_60 
##                            0                            0 
##              availability_90             availability_365 
##                            0                            0 
##            number_of_reviews        number_of_reviews_ltm 
##                            0                            0 
##                 first_review                  last_review 
##                          324                          324 
##         review_scores_rating       review_scores_accuracy 
##                          341                          341 
##    review_scores_cleanliness        review_scores_checkin 
##                          341                          342 
##  review_scores_communication       review_scores_location 
##                          341                          342 
##          review_scores_value            reviews_per_month 
##                          342                          324 
##                        price             security_deposit 
##                            0                          549 
##                 cleaning_fee                 extra_people 
##                          184                            0 
##              revenue_year_75                reviews_month 
##                            0                          324 
##         revenue_year_reviews                   CountyName 
##                          324                            0 
##                     SizeRank                         Year 
##                            0                            0 
##                         Cost 
##                            0

Before we even begin analyzing properties at neighbourhood and zipcode level, it is useful to know information of how many zipcodes is available to us within each neighbourhood

Split of zipcodes by neighbourhood

neighbourhood_zip <- analysis_dataset %>% 
                      select(neighbourhood_group_cleansed, zipcode) %>%
                        group_by(neighbourhood_group_cleansed) %>%
                          distinct(neighbourhood_group_cleansed, zipcode) %>%
                            arrange(neighbourhood_group_cleansed)
neighbourhood_zip

We see that zipcode “10013” maps to both Manhattan and Brooklyn. This is a data issue.

As a matter of fact, zipcode 10013 belongs to Manhattan. Retreiving record(s) that have zipcode = ‘10013’ and ‘neighbourhood_group_cleansed’ = Brooklyn

analysis_dataset %>% filter(zipcode == 10013 & neighbourhood_group_cleansed == 'Brooklyn')

There is only one such observation. Important metric like - number_of_reviews, first_review, last_review have null value.

So omitting this record from analysis

analysis_dataset <- analysis_dataset %>% filter(!(zipcode == 10013 & neighbourhood_group_cleansed == 'Brooklyn'))

Count of zipcodes in each area is -

Manhattan - 11
Staten Island - 7
Brooklyn - 5
Queens - 1

Analyzing the zipcodes against metrics describe above

1. Popularity

Number of listings available

Visualizing the distribution of listings across neighbourhood

analysis_dataset %>% 
  group_by(neighbourhood_group_cleansed) %>%
    count() %>%
      arrange(desc(n)) %>%
        plot_ly(x = ~reorder(neighbourhood_group_cleansed, -n), 
                y = ~n, 
                type = 'bar', 
                text = ~n, 
                textposition = 'outside',
                color = ~neighbourhood_group_cleansed) %>%
          layout(title = "Distribution of number of listings by neighbourhood",
            xaxis = list(title = ""),
              yaxis = list(title = "Number of listings"))

Manhattan has the highest number of prperties followed by Brookyln. This finding is not surprising giving the popularity and strategic importance of these boroughs.

We can infer that these 2 boroughs are a popular choice among tourists and offer a wide range of options for guests to choose from.

We will go one level deeper to identify within each of these neighbourhoods, which zipcode offers most number of properties

Visualizing the distribution of listings by zipcode across neighbourhood

analysis_dataset %>% 
  select(neighbourhood_group_cleansed, zipcode) %>%
    group_by(neighbourhood_group_cleansed, zipcode) %>%
      count() %>%
        arrange(desc(n)) %>%
          plot_ly(x = ~reorder(zipcode, -n), 
                  y = ~n, 
                  type = 'bar', 
                  color = ~neighbourhood_group_cleansed) %>%
                    layout(title = "Distribution of number of properties by zipcode",
                           xaxis = list(title = ""),
                           yaxis = list(title = "Number of listings"))

As expected, overall zipcodes within Manhattan and Brooklyn have most number of properties.

Below is the list of zipcodes selected from each neighbourhood that have high number of listings

Manhattan
- 10036
- 10003
- 10025
- 10011
- 10013
Brooklyn
- 11215
- 11217
- 11231
- 11201
Queens: 11434 (There is only 1 zipcode present)
Staten Island
- 10305
- 10303
- 10304

Next we will identify properties the most reviewed properties.

Number of reviews recieved

Analyzing the pattern of reviews recieved by properties across zipcodes would indirectly provide an impression of the zipcodes with high demand

Visualizing the distribution of number of reviews recieved over a time

zipcode_popularity <- 
  analysis_dataset %>% 
    mutate(time_prd = round(difftime(last_scraped, first_review, units = "weeks"),0)) %>%
     group_by(neighbourhood_group_cleansed,zipcode) %>%      select(neighbourhood_group_cleansed, zipcode, number_of_reviews, time_prd) %>% 
      distinct()%>% 
        drop_na()
              
plot_ly(zipcode_popularity, 
        x = ~as.numeric(time_prd), 
        y = ~number_of_reviews,
        type = 'scatter',
        color = ~as.factor(zipcode),
        colors = "Set2")%>%
  layout(title = "Number of reviews recieved over time",
           xaxis = list(title = "Time (in weeks)"),
           yaxis = list(title = "Number of reviews"))

Zipcodes that have been able to gain traction in a short time frame are -

10022
11434
10003
11215
10013
10011
10036
11231
10128

_There are some properties that have been existent for a longer time period and have consistently attracted guests. Below are such popular zipcodes -

10003
11215
10014
11201
10036

Based on the above “Popularity” analysis, top winners are -

* 10003 (Manhattan)
* 11215 (Brooklyn)
* 10014 (Manhattan)
* 10036 (Manhattan)
* 11201 (Brooklyn)
* 11231 (Brooklyn)
* 10013 (Manhattan)
* 11434 (Queens)
* 10011 (Manhattan)
* 10128 (Manhattan)

Next factor that we need to analyze is Price/night.

2. Price Per Night

Identifying zipcodes that have higher Price/night figures, will give an indication of how much guests are willing to pay to stay in a paritcular neighbourhood(or zipcode)

analysis_dataset %>%
  plot_ly(x = ~neighbourhood_group_cleansed, 
          y = ~price, 
          type = "box", 
          jitter = 0.7, 
          color = ~neighbourhood_group_cleansed) %>%
    layout( title = "Variation in Price per Night by Neighbourhood",
            yaxis = list(title = "Price per Night"),
            xaxis = list(title = "Neighbourhood"))

Given the popularity and demand, properties in neighbourhoods Manhattan and Brooklyn charge their guests higher per night compared to those in Queens and Staten Island

Median price/night charged in Manhanttan equals the maximum price charged in Queens

We will now drill down at zipcode level to determine the prime areas that charge their guests high

analysis_dataset %>%
  plot_ly(x = ~as.factor(zipcode),
          y = ~price,
          type = "box", 
          jitter = 0.7, 
          color = ~neighbourhood_group_cleansed) %>%
    layout(title = "Variation in Price per Night by Zipcode",
           yaxis = list(title = "Price per Night"),
           xaxis = list(title = "Zipcode"))

Our favorable zipcodes based on Price/night metric are -

* 10011 (Manhattan)
* 10036 (Manhattan)
* 10022 (Manhattan)
* 10013 (Manhattan)
* 10014 (Manhattan)
* 10023 (Manhattan)
* 10028 (Manhattan)
* 11201 (Brooklyn)

A property that charges high per night stay might not be profitable always, as its likely that the property costs more

3. Cost of Investment

The next factor to be considered in analysis is “Cost” to buy a property. Favorable zipcodes would be the ones that do not demand high investment

Understanding the distribution of Cost of Property across neighbourhoods

analysis_dataset %>%
  plot_ly(x = ~neighbourhood_group_cleansed, 
          y = ~Cost, 
          type = "box", 
          jitter = 0.7, 
          color = ~neighbourhood_group_cleansed) %>%
    layout( title = "Variation in Property Cost by Neighbourhood",
            yaxis = list(title = "Cost"),
            xaxis = list(title = "Neighbourhood"))

Undoubtedly, Manhattan is the costliest borough.

The variation in cost of properties within the boroughs is too huge to be plotted on one single graph.

So, to analyze cost figures within zipcodes, we will analyze zipcodes of Brooklyn and Manhattan together and of Queens and Staten Island separately

Determining costliest zipcodes within each neighbourhood

Manhattan and Brooklyn zipcodes

analysis_data_cost <- analysis_dataset %>% 
                          group_by(zipcode) %>%
                            mutate(median_cost = median(Cost)) %>%
                              select(neighbourhood_group_cleansed, zipcode, median_cost)

analysis_data_cost <- analysis_data_cost %>% distinct()    
View(analysis_data_cost)

analysis_data_cost %>% 
  filter(neighbourhood_group_cleansed %in%  c("Manhattan", "Brooklyn"))%>%
    plot_ly(x = ~as.factor(zipcode), 
            y = ~median_cost,
            color = ~neighbourhood_group_cleansed,
            type = "bar") %>%
    layout( title = "Cost of properties by Zipcode (Manhattan & Brooklyn)",
            yaxis = list(title = "Cost"),
            xaxis = list(title = "Zipcode"))

Zipcodes that have cheaper costs for properties are -

  * 11234 (Brooklyn)
  * 10021 (Manhattan)
  * 10128 (Manhattan)
  * 11201 (Brooklyn)
  * 11231 (Brooklyn)
  * 10022 (Manhattan)
  * 10028 (Manhattan)
  * 11217 (Brooklyn)
  * 10023 (Manhattan)
  * 10025 (Manhattan)

Queens and Staten Island zipcodes

analysis_data_cost %>% 
  filter(neighbourhood_group_cleansed %in%  c("Queens", "Staten Island")) %>%
      plot_ly(x = ~as.factor(zipcode), 
            y = ~median_cost,
            color = ~neighbourhood_group_cleansed,
            type = "bar")%>%
    layout( title = "Cost of properties by Zipcode (Staten Island & Queens)",
            yaxis = list(title = "Cost"),
            xaxis = list(title = "Zipcode"))

From the plot it would appear that zipcodes 10305 and 11434 are costliest. This is true when considering zipcodes within Queens and Staten Island. However, in comparison to Manhattan and Brooklyn, all the zipcodes have lower costs.

So we pick all the zipcodes from the 2 neighbourhoods as favorable

* 10309 (Staten Island)
* 10306 (Staten Island)
* 10314 (Staten Island)
* 10308 (Staten Island)
* 10304 (Staten Island)
* 10303 (Staten Island)
* 10304 (Staten Island)
* 10305 (Staten Island)
* 11434 (Queens)

A final factor to be analyzed is the ’Estimated revenue".

4. Estimated Revenue

Price/night and Cost analysis alone cannot help in identifying the profitable zipcodes. We need to understand the tradeoff and also analyze estimated revenue figures to identify profitable zipcodes

For each zip code we will look at following metrics -

median Price/night
Cost
median estimated revenue/year (considering 75% occupancy)

In addition, we will derive an Index i.e (Revneue Earned/ Cost Invested) in order to better identify the profitable zipcodes

zipcode_profit <- 
  analysis_dataset %>%
   select("neighbourhood" = neighbourhood_group_cleansed, zipcode, price, Cost, revenue_year_75, revenue_year_reviews) %>%
      group_by(neighbourhood, zipcode, Cost) %>%
        summarize(Revenue_75 = median(revenue_year_75)) %>%
            mutate(RevenuePerCost = Revenue_75/Cost) %>%
                arrange(desc(RevenuePerCost),desc(Revenue_75),Cost)



print(zipcode_profit)

## # A tibble: 24 x 5
## # Groups:   neighbourhood, zipcode [24]
##    neighbourhood zipcode    Cost Revenue_75 RevenuePerCost
##    <fct>           <dbl>   <dbl>      <dbl>          <dbl>
##  1 Staten Island   10306  341250     32166.         0.0943
##  2 Staten Island   10303  312450     29428.         0.0942
##  3 Queens          11434  373950     34766.         0.0930
##  4 Staten Island   10304  311050     26006.         0.0836
##  5 Staten Island   10308  392100     29976.         0.0764
##  6 Staten Island   10305  407900     30386.         0.0745
##  7 Brooklyn        11234  473300     34219.         0.0723
##  8 Staten Island   10309  374150     23269.         0.0622
##  9 Staten Island   10314  345950     19984.         0.0578
## 10 Manhattan       10036 1743050     81851.         0.0470
## # ... with 14 more rows

From the RevenuePerCost figures, we observe that though Manhattan is the most popular borough and fetches highest price per night of stay, it is not very profitable as this it demands huge investment.

Overall properties within Staten Island seems to be the most profitable.

However, there is a shortcoming in this appraoch We are overlooking one of the most important factor i.e. “Popularity”.

We had observed that “Staten Island” is not popular neighbourhood among guests and therefore, calculation of revenue based on assumption that occupancy across all areas is 75% is not really accurate.

Based on the “Number of reviews” we should tweak the occupany percent of different regions to estimate the revenue

We will use “revenue_year_reviews” field that is the eatimated revenue based on number of reviews recieved, price/night and length of minimum stay

zipcode_profit_reviews <- 
  analysis_dataset %>%
   select("neighbourhood" = neighbourhood_group_cleansed, zipcode, price, Cost, revenue_year_reviews) %>%
      group_by(neighbourhood, zipcode, Cost) %>%
        summarize(Revenue_Reviews = mean(revenue_year_reviews,na.rm = T)) %>%
            mutate(RevenuePerCost = Revenue_Reviews/Cost) %>%
                arrange(desc(RevenuePerCost),desc(Revenue_Reviews),Cost)



print(zipcode_profit_reviews)

## # A tibble: 24 x 5
## # Groups:   neighbourhood, zipcode [24]
##    neighbourhood zipcode    Cost Revenue_Reviews RevenuePerCost
##    <fct>           <dbl>   <dbl>           <dbl>          <dbl>
##  1 Staten Island   10304  311050          19393.         0.0623
##  2 Queens          11434  373950          17563.         0.0470
##  3 Staten Island   10309  374150           6436.         0.0172
##  4 Manhattan       10023 1988700          30911.         0.0155
##  5 Brooklyn        11201 1400200          21297.         0.0152
##  6 Brooklyn        11217 1231850          16470.         0.0134
##  7 Brooklyn        11234  473300           5852.         0.0124
##  8 Staten Island   10305  407900           5032.         0.0123
##  9 Staten Island   10306  341250           4151.         0.0122
## 10 Manhattan       10011 2354000          28416.         0.0121
## # ... with 14 more rows

As we observe the zipcodes that appear to be profitable (when considering Number of reviews to estimate revenue) are quite different from the ones where we used 75% occupancy rate to estimate the revenue.

We, therefore, use the revenue estimated based on number of reviews recieved to determine the profitable zipcodes

Revenue per Cost Analysis for zipcodes

zipcode_profit_reviews %>% 
  plot_ly(x = ~RevenuePerCost,
          y = ~reorder(as.factor(zipcode), RevenuePerCost),
          color = ~neighbourhood, 
          type = 'bar',
          text = ~round(RevenuePerCost,2), 
          textposition = 'auto') %>%
    layout(title = "Revenue per Cost Analysis across zipcodes",
           xaxis = list(title = "Revenue/Cost Index"),
          yaxis = list(title = "Zipcode"))

Zipcodes that are more profitable compared to others are -

* 10304 (Staten Island)
* 11434 (Queens)
* 10309 (Staten Island)
* 10023 (Manhattan)
* 11201 (Brooklyn)
* 11217 (Brooklyn)
* 11234 (Brooklyn)
* 10305 (Staten Island)
* 10306 (Staten Island)
* 10011 (Manhattan)

VII. Conclusion

Considering all the factors analyzed above, we will select the zipcodes that satisfy maximum number of conditions.

Here is the final list of zipcodes that should be considered for investment -

Zipcode 11201(Brooklyn) satifies all the criteria and is our best bet
Zipcodes 10011(Manhattan), 10036(Manhattan), 10023(Manhattan) are the profitable, popular zipcodes. Even though they have high cost of investment, the higher Price/Night value compensates for it and makes these zipcodes profitable
Zipcode 11434(Queens) looks promising and is highly recommended considering the increasing popularity of the zipcode. A couple of new properties that have come up recently in this area have been well recieved by the guests. Even though a nigh stay is charged less, it is still profitable

VIII. Next Steps

1. Additional data procurement

i. Including more zipcodes into analysis

We have AirBnb Listing data for 166 zipcodes(with information on 2BR properties available). However, due to non-availability of the Zillow Cost data, we were able to carried out analysis for only 24 zipcodes (that is only about 15% of zipcodes). Procurement of this data could help make better recommendations about the profitable zipcodes across New York City

ii. Better approach to estimating revenue figures

The 2 approaches used here to estimate the revenue/year/listing have their onw shortcomings.

Taking 75% occupancy for all properties across all zipcodes overestimates the revenue (since less popular zipcodes may not have 75% occupancy throughout a year)
Taking number of reviews as an approximation for the number of bookings mades underestimates the revenue (as not all guests will write a review)

If we could obtain the booking information for the listings, estimation of revenue could be more accurate. OR Consider creating a model that will help in predicting the occupancy rate based on key influential factors.

2. Better outlier treatment for price

Considering the nature and scope of this analysis, we have considered any price/night figure 3 standard deviations above mean as an outlier, this may have caused dropping of luxurious properties.

Price depends multiple factors like - property type, room type, amenetities, etc. While deciding if a price/night value for a listing is a data recording error or not, we need t consider all such factors.

A model that clusters similar properties together based on these price-influencing factors and provide us an estimate of a resonable price for such properties, then we will be able to identify erroneous price information.

3. Customer Review and Review Scores analysis

Currently, we have considered only number of reviews as a metric to identify popular zipcodes. Analysis of customer reviews and making use of review scores can be included to select favorable zipcodes

4. Property Cost prediction

We have used the maximum cost observed in last 5 years as the cost estimate for year 2019. However, we can build a time-series model to better predict cost for coming years

IX. Quality Checks

24 zipcodes in the AirBnb Listings dataset do not have any information about the 2BR properties. We have information on 2BR properties for 166 zipcodes
Zillow Cost dataset has only 25 unique zipcodes.
Out of these 25 zipcodes in Zillow cost data, only 24 zipcodes match to the AirBnb listings data

Capital One Data Challenge

Zipcode Profitability Analysis

Meenal Narsinghani

I. Problem Description

II. Objective

III. Assumptions

IV. Datasets used

V. Data Preparation

1. Initial Setup: Setting the working directory

2. Loading the libraries required (installing if need be)

3. Importing the data files

A. AirBnb Listings Dataset

B. Zillow Cost Dataset

4. Data Wrangling

A. AirBnb Listings Dataset

B. Zillow Cost Dataset

5. Final dataset creation

VI. Deep Dive Analysis

VII. Conclusion

VIII. Next Steps

IX. Quality Checks