Our client is a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City.
The client has already concluded that two bedroom properties are the most profitable. However, they want to identify the zip codes are the best to invest in.
The objective of this analysis is to build a data product that assists the client in identifying the zipcodes that are favorable and will generate the most profit on short term rentals within New York City.
The following assumptions have been made while implementing this data model -
We have used 2 datasets in our analysis -
AirBnb Listings data for New York City - Contains information, such as location, host information, number of bedrooms, reviews recieved, etc.,about listings in New York City _http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz_
Zillow Property Cost data - Contains selling cost of 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017. _https://www.zillow.com/research/data/_
Data dictionary for these two datasets have been provided in the metadata file ("Narsinghani.Meenal_DataChallenge_Metadata.xlsx) available in the zipped folder. Certain columns have been derived based on the columns in these datasets. Description of those fields has been included in the file.
packages_reqd <- c("ggplot2", "tidyverse", "knitr", "plotly")
new_packages <- packages_reqd[!(packages_reqd %in% installed.packages()[,"Package"])]
if(length(new_packages))
install.packages(new_packages)
library(tidyverse)
library(ggplot2)
library(knitr)
library(plotly)
airbnb_listings_data <- read.csv("listings_airbnb.csv", header = TRUE, stringsAsFactors = FALSE)
head(airbnb_listings_data,20)
dim(airbnb_listings_data)
## [1] 48895 106
zillow_cost_data <- read.csv("Zip_Zhvi_2bedroom.csv", header = TRUE, stringsAsFactors = FALSE)
dim(zillow_cost_data)
## [1] 8946 262
i. Dropping columns
Before cleaning up the data set, dropping the fields which will not be used in the analysis (considering the nature and scope of this analysis)
remove_cols <- c(2,3, 5:37, 51, 59, 70:77, 82, 94:105)
airbnb_listings_subset <- airbnb_listings_data[ , -remove_cols]
dim(airbnb_listings_subset)
## [1] 48895 48
#Removing the original dataset
rm(airbnb_listings_data)
ii. Formatting column values
A quick glance of the data, shows certain columns need to be formatted in order to use them in analysis.
format_price <- function(var)
{
var <- as.numeric(gsub(pattern="\\$", replacement = "", var))
}
colnames(airbnb_listings_subset)
## [1] "id" "last_scraped"
## [3] "street" "neighbourhood"
## [5] "neighbourhood_cleansed" "neighbourhood_group_cleansed"
## [7] "city" "state"
## [9] "zipcode" "market"
## [11] "smart_location" "country_code"
## [13] "country" "latitude"
## [15] "longitude" "property_type"
## [17] "room_type" "accommodates"
## [19] "bathrooms" "bedrooms"
## [21] "beds" "bed_type"
## [23] "square_feet" "price"
## [25] "weekly_price" "monthly_price"
## [27] "security_deposit" "cleaning_fee"
## [29] "guests_included" "extra_people"
## [31] "minimum_nights" "maximum_nights"
## [33] "availability_30" "availability_60"
## [35] "availability_90" "availability_365"
## [37] "number_of_reviews" "number_of_reviews_ltm"
## [39] "first_review" "last_review"
## [41] "review_scores_rating" "review_scores_accuracy"
## [43] "review_scores_cleanliness" "review_scores_checkin"
## [45] "review_scores_communication" "review_scores_location"
## [47] "review_scores_value" "reviews_per_month"
airbnb_listings_cleaned <- data.frame(airbnb_listings_subset[c(1:23, 29, 31:48)], sapply(airbnb_listings_subset[c(24:28, 30)], format_price))
format_date <- function(var)
{
var <- as.Date(var, format="%Y-%m-%d")
}
airbnb_listings_cleaned[c("last_scraped","first_review","last_review")] <-
lapply(airbnb_listings_cleaned[c("last_scraped","first_review","last_review")],format_date)
#Since we will continue cleaning up listings_cleaned dataset, removing listings_subset data
rm(airbnb_listings_subset)
Changing different values that correspond to New York to the same format - “New York City”
#Viewing the city data
unique(airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)])
## [1] "New York" "NY"
## [3] "NY " "new york"
## [5] "New York City" "New York "
## [7] "NYC" "Nueva York"
## [9] "New-York" "New york"
## [11] "New York city" "NUEVA YORK"
## [13] "New York, US" "New York City "
## [15] "NewYork" "NYC "
## [17] "Newyork" "New York - Astoria"
## [19] "Neponsit" "New York , East Elmhurst "
## [21] "nyc" "New York East Elmhurst "
## [23] "New York,Manhattan " "New York, "
## [25] "Nyc" "New york "
## [27] "New York - Sunnyside " "Ny"
## [29] "Nova York" "NEW YORK"
#Since all these correspond to New York City, converting them to a standard format
airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)] <- "New York City"
#Verifying if all the NY related city values were changed correctly
unique(airbnb_listings_cleaned$city[grep(pattern="^n|^N", airbnb_listings_cleaned$city)])
## [1] "New York City"
For New York, different formats were found in the state column. Changing all to the same format - “NY”
#Viewing the state data
unique(airbnb_listings_cleaned$state)
## [1] "NY" "" "Ny" "ny" "MP" "CA"
## [7] "NJ" "New York "
# 4 different formats for NY were found - NY, Ny, ny, New York
# Changing them all to NY
airbnb_listings_cleaned$state[airbnb_listings_cleaned$state %in% c("ny", "Ny", "New York ", "NY")] <- "NY"
#Verifying where the changes were made successfully
unique(airbnb_listings_cleaned$state)
## [1] "NY" "" "MP" "CA" "NJ"
airbnb_listings_cleaned$zipcode <- as.numeric(airbnb_listings_cleaned$zipcode)
iii. Filtering & Subsetting the dataset
length(unique(airbnb_listings_cleaned$zipcode))
## [1] 194
A total of 194 zipcodes are avaiable in the AirBnb listings data
airbnb_listings_cleaned %>%
select(zipcode, bedrooms) %>%
filter(bedrooms == 2) %>%
distinct(zipcode)%>%
count()
Out of 194, only 170 zipcodes have information about 2 BR properties
We are missing out of 24 zipcodes from our analysis Since it is already established that 2-bedroom properties are the most profitable;
Filtering the Listings data set for only 2-bedroom properties
airbnb_listings_2BR <- airbnb_listings_cleaned %>%
filter(bedrooms == 2)
dim(airbnb_listings_2BR)
## [1] 6497 48
#We no longer need the listings_cleaned data, removing it
rm(airbnb_listings_cleaned)
There are mutiple columns that provide us information on the location of the properties. Some of these (like Street) provides too specific information and others (like country_code) is too generic
Revisiting the columns, after filtering the dataset, to identify if columns that do not provide any useful information
#A quick glance at the dataset suggests that columns - street and smart_location
#Looking at these columns to reverify
airbnb_listings_2BR %>%
select(street,
neighbourhood,
neighbourhood_cleansed,
neighbourhood_group_cleansed,
city,
state,
market,
smart_location,
country_code,
country) %>%
summarise_all(n_distinct) %>%
gather("Field", "Count",1:10)
The number of distinct values available in the below columns is huge -
There are 5 values for “neighbourhood_group_cleansed” which represent the boroughs(neighbourhood)
Since “Borough” is the widely used logical grouping for zipcodes. We can drop these columns are there is no new, interesting information in these columns that might be useful in our analysis.
airbnb_listings_2BR <- airbnb_listings_2BR %>%
select(-c("street",
"neighbourhood",
"neighbourhood_cleansed",
"city"))
dim(airbnb_listings_2BR)
## [1] 6497 44
There are observations with “market” as “London” and “Other(Domestic)”. Analyzing those records
airbnb_listings_2BR %>% filter(!market %in% c("New York"))
These 14 observations are for New York only as is evident from other location-specific columns. So, it is safe to assume that the missing values and odd values(London and Other(Domestic)) is nothing more than recording error.
Columns - “state”, market“,”smart_location“,”country_code" and “country” essentially have only one value. We can drop these columns.
airbnb_listings_2BR <- airbnb_listings_2BR %>%
select(-c("state",
"market",
"smart_location",
"country_code",
"country"))
dim(airbnb_listings_2BR)
## [1] 6497 39
Understanding the variety of information available in the remianing character columns -
airbnb_listings_2BR %>%
select(neighbourhood_group_cleansed,
property_type,
room_type,
bed_type) %>%
summarise_all(n_distinct) %>%
gather("Field", "Count",1:4)
airbnb_listings_2BR[, c("neighbourhood_group_cleansed", "bed_type", "room_type", "property_type")] <-
airbnb_listings_2BR[, c("neighbourhood_group_cleansed", "bed_type", "room_type", "property_type")] %>% mutate_all(as.factor)
iv. Missing values treatment
We will identify what fields that have missing values and then find appropriate ways of handling them
#table(colSums(is.na(airbnb_listings_2BR)))
apply(is.na(airbnb_listings_2BR), 2, sum)
## id last_scraped
## 0 0
## neighbourhood_group_cleansed zipcode
## 0 50
## latitude longitude
## 0 0
## property_type room_type
## 0 0
## accommodates bathrooms
## 0 3
## bedrooms beds
## 0 1
## bed_type square_feet
## 0 6396
## guests_included minimum_nights
## 0 0
## maximum_nights availability_30
## 0 0
## availability_60 availability_90
## 0 0
## availability_365 number_of_reviews
## 0 0
## number_of_reviews_ltm first_review
## 0 1280
## last_review review_scores_rating
## 1280 1383
## review_scores_accuracy review_scores_cleanliness
## 1386 1384
## review_scores_checkin review_scores_communication
## 1388 1384
## review_scores_location review_scores_value
## 1387 1389
## reviews_per_month price
## 1280 59
## weekly_price monthly_price
## 6175 6493
## security_deposit cleaning_fee
## 2190 933
## extra_people
## 0
Of all the numeric fields, fields sqaure_foot, weekly_price, monthly_price have considerable amount of NA’s (almost more than 90%)
Imputing these columns based on the available data will not be accurate. It is safe to drop these column
airbnb_listings_2BR <- airbnb_listings_2BR %>% select(-c(square_feet, monthly_price, weekly_price))
Since zip code is at the heart of our analysis, even though we have only 50 observations with missing zipcode values, we can populate the values based on other records. (Making use of the Latitude and Longitude columns)
Populating these observations based on other observations having (almost) same Latitude and Longitude values
index <- which(is.na(airbnb_listings_2BR$zipcode))
for(k in 4:1)
{
#index <- which(is.na(airbnb_listings_2BR$zipcode))
for(i in index)
{
for(j in 1:nrow(airbnb_listings_2BR))
{
if(round(airbnb_listings_2BR$latitude[i],k) == round(airbnb_listings_2BR$latitude[j],k) && round(airbnb_listings_2BR$longitude[i],k) == round(airbnb_listings_2BR$longitude[j],k) && i!=j && !is.na(airbnb_listings_2BR$zipcode[j]))
{
airbnb_listings_2BR$zipcode[i] = airbnb_listings_2BR$zipcode[j]
break
}
}
}
}
Noting the count of observations with missing zipcode
sum(is.na(airbnb_listings_2BR$zipcode))
## [1] 0
v. Univariate Analysis and Outlier Treatment
In order to determine the estimated revenue for a listing there are 2 main metrics to be considered -
So we will begin exploring these variables first, identify and treat outliers (if any)
plot_ly(y = airbnb_listings_2BR$price,
type = "box",
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'))%>%
layout(title = "Outlier identification for Price/Night metric",
xaxis = list(title = ""),
yaxis = list(title = ""))
There are large number of outliers for price/night. Some values are so large that does not really make sense.
plot_ly(x = airbnb_listings_2BR$price,
type = "histogram",
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'))%>%
layout(title = "Distribution of Price/Night across all properties",
xaxis = list(title = "Price"),
yaxis = list(title = "Frequency"))
Summary statistics for Price/Night metric
summary(airbnb_listings_2BR$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 129.0 185.0 214.1 259.0 999.0 59
Based in the summary statistics, we observe that about 75% of the listings offer a price upto 260 dollars per night. Price per night as large as 1000 dollars seems to be an outlier
To classify observations as outliers (and subsequently remove them) we will follow the rule of thumb that is - “Any observation with price value > 3 SD above mean can be considered as an outlier”
airbnb_listings_2BR <- airbnb_listings_2BR %>%
mutate(z_price = scale(price, center = T, scale = T))
length(airbnb_listings_2BR$z_price[airbnb_listings_2BR$z_price > 3])
## [1] 188
airbnb_2BR_treated <- airbnb_listings_2BR %>% filter(z_price <= 3)
dim(airbnb_2BR_treated)
## [1] 6309 37
rm(airbnb_listings_2BR)
sum(is.na(airbnb_2BR_treated$minimum_nights))
## [1] 0
Ther are no missing values in for “minimum_nights” column
summary(airbnb_2BR_treated$minimum_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 8.04 5.00 999.00
From the summary statistics we see that about 75% of properties have a condition of minimum nights <= 5
We cannot drop the properties with unusaully high “minimum_nights”, as properties like Service Apartment usually have longer rent duration
We will observe the split of the numberof properties and summary statistics for “minimum_nights” for different property type
min_nights_property <- airbnb_2BR_treated %>%
select(property_type, minimum_nights) %>%
group_by(property_type) %>%
summarize(count = n(),
min = min(minimum_nights),
mean = mean(minimum_nights),
max_min_nights = max(minimum_nights)) %>%
arrange(desc(count))
min_nights_property
There are not many listings of type - Bungalow, Guesthouse, etc. Comparing the summary statistics of “minimum_nights” field and the summary by “property_type” , 75 percent of the properties have minimum nights <= 5.
Therefore, we can restrict our analysis to only the properties that make up the top 75 percentile.
airbnb_2BR_treated <- airbnb_2BR_treated %>%
mutate(z_min_nights = scale(minimum_nights, center = T, scale = T))
length(airbnb_2BR_treated$z_min_nights[airbnb_2BR_treated$z_min_nights > 3])
## [1] 50
airbnb_2BR_treated <- airbnb_2BR_treated %>% filter(z_min_nights <= 3)
summary(airbnb_2BR_treated$minimum_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 6.646 5.000 70.000
sum(is.na(airbnb_2BR_treated$number_of_reviews))
## [1] 0
sum(is.na(airbnb_2BR_treated$reviews_per_month))
## [1] 1174
vi. Generating Estimated Revenue/Year field
Assuming occupancy rate for all properties across all zipcodes is 75%. Then, we can estimate the revenue per year per listing using the below formula -
**Revenue/year/ per listing = 0.75 * 365 * price/night**
airbnb_2BR_treated$revenue_year_75 <- 0.75 * 365 * airbnb_2BR_treated$price
Another conservative approach could be to consider number of reviews/month as a reflection of the number of number of bookings recieved per month
Since column “reviews_per_month” has a good number of missing values. We will calculate a new column “reviews_month” based on “number_of_reviews”, “last_scraped” and “first_review” column
**Number of reviews/month = (Number of reviews / (last_scraped - first_review)in months)**
airbnb_2BR_treated <- airbnb_2BR_treated %>%
mutate(reviews_month = 4*number_of_reviews/as.numeric(difftime(last_scraped, first_review, units = "weeks")))
**Revenue/year per listing = 12 * number of reviews / month * price/night * minimum_nights**
airbnb_2BR_treated <- airbnb_2BR_treated %>% mutate(revenue_year_reviews = 12* reviews_month * price * minimum_nights)
Last step - We will drop columns like Latitude/Longitude as they are no longer needed
airbnb_revenue <- airbnb_2BR_treated %>% select(-c(latitude,
longitude,
z_price,
z_min_nights))
The AirBnb Listings dataset has been treated and cleaned.
dim(zillow_cost_data)
## [1] 8946 262
View(unique(zillow_cost_data$City))
i. Dataset filtering
A quick glance at the price data shows that the data is for many cities, not only New York
Since analysis needs to be performed only for New York city, filtering the dataset
#Filtering the data set for City = New York and State = NY
zillow_data_NY <- zillow_cost_data %>%
filter(City == "New York" &
State == "NY")
dim(zillow_data_NY)
## [1] 25 262
#25 262
## Removing the original dataset having cost data for all cities
rm(zillow_cost_data)
ii. Restructuring the data
zillow_data_NY <- gather(zillow_data_NY, YearMonth, Cost , -c(1:7))
dim(zillow_data_NY)
## [1] 6375 9
zillow_data_NY <- separate(zillow_data_NY, YearMonth, c("Year", "Month"))
dim(zillow_data_NY)
## [1] 6375 10
zillow_data_NY$Year <- substr(zillow_data_NY$Year, 2, nchar(zillow_data_NY$Year))
iii. Estimating median cost over years
To estimate the cost of properties for year 2019, we will use median cost in different years instead of using monthly cost over years
Deciding on the level at which we need to compute the median of cost for each year
unique_val <- NULL
for(i in 1:ncol(zillow_data_NY))
{
unique_val[i] <- length(unique(zillow_data_NY[,i]))
}
cbind(colnames(zillow_data_NY), unique_val)
## unique_val
## [1,] "RegionID" "25"
## [2,] "RegionName" "25"
## [3,] "City" "1"
## [4,] "State" "1"
## [5,] "Metro" "1"
## [6,] "CountyName" "4"
## [7,] "SizeRank" "25"
## [8,] "Year" "22"
## [9,] "Month" "12"
## [10,] "Cost" "3712"
We can see that there is only one City (New York), one State (NY) and one Metro(New York). So we can drop these columns from our dataset.
nrow(unique(zillow_data_NY[,c("RegionID", "RegionName")]))
## [1] 25
Also, since RegionName is the zip code(which will be used to join with AirBnb dataset). And a regionID corresponds to a unique RegionName, we can drop RegionID from our analysis
Calculating Median price for each year at RegionName, CountyName, SizeRank, Year level (RegionID, City, State, Metro will be dropped)
zillow_medcost_year <- zillow_data_NY %>%
group_by(RegionName, CountyName, SizeRank, Year) %>%
summarize(MedianCost = median(Cost)) %>%
ungroup()
dim(zillow_medcost_year)
## [1] 550 5
## We no longer need this dataset as we have restructred the dataset to suit our analysis
rm(zillow_data_NY)
iv. Missing values treatment
Looking out for missing values in each column
year_cost_data <- zillow_medcost_year %>% spread(Year, MedianCost)
colSums(is.na(year_cost_data))
## RegionName CountyName SizeRank 1996 1997 1998 1999
## 0 0 0 17 17 17 16
## 2000 2001 2002 2003 2004 2005 2006
## 16 16 17 17 13 4 1
## 2007 2008 2009 2010 2011 2012 2013
## 1 0 0 0 0 0 0
## 2014 2015 2016 2017
## 0 0 0 0
Years from 1996 to 2007 have missing values for Median Price. We can estimate this value by keeping the months in a year that have missing data and calculate the median price based on the months that have value.
However, instead of complicating our analysis, we can filter data for years 2008 to 2017 to estimate the price for Year 2019
10 years worth of data should be sufficient in order to observe the trend of price over years and estimate it for year 2019
Dropping data from 1996 - 2007
zillow_medcost_year <- zillow_medcost_year[zillow_medcost_year$Year >= 2008,]
dim(zillow_medcost_year)
## [1] 250 5
View(zillow_medcost_year)
v. Cost estimation for 2019
Visualizing the trend of median cost over years within each zipcode
zillow_medcost_year %>%
plot_ly(x = ~Year,
y = ~MedianCost,
group = ~as.factor(RegionName),
type = "scatter",
mode = "lines+markers",
color = ~as.factor(RegionName))%>%
layout(title = "Trend of median cost of property over years within each zipcode",
xaxis = list(title = ""),
yaxis = list(title = "Median Cost"))
zillow_medcost_year %>%
filter(Year >= 2013) %>%
plot_ly(x = ~Year,
y = ~MedianCost,
group = ~as.factor(RegionName),
type = "scatter",
color = ~as.factor(RegionName),
mode = "lines+markers")%>%
layout(title = "Trend of median cost of property over years within each zipcode",
xaxis = list(title = ""),
yaxis = list(title = "Median Cost"))
Almost all the zipcodes have shown a steady increase in the prices over past 5 years.
To estimate the cost of properties in 2019, we will take the cost that was maximum in the last 5 years.
zillow_medcost_region <- zillow_medcost_year %>%
spread(Year, MedianCost)
rm(zillow_medcost_year)
zillow_medcost_2019 <- zillow_medcost_region %>%
group_by(RegionName,CountyName, SizeRank) %>%
rowwise() %>%
mutate(`2019` = max(c(`2013`,
`2014`,
`2015`,
`2016`,
`2017`)))
zillow_medcost_2019 <- gather(zillow_medcost_2019, Year, Cost , -c(1:3))
vi. Filtering the cost information just for year 2019
As the historical cost information ws required only to predict the cost for year 2019, we can filter the historical data out and retain data for year 2019 only.
zillow_cost <- zillow_medcost_2019 %>% filter(Year == 2019)
View(zillow_cost)
rm(zillow_medcost_region)
The Zillow Cost dataset has been treated and cleaned.
Merging Airbnb Revenue and Zillow Cost datasets
The merging of the 2 datasets - airbnb_revenue and zillow_cost would happen at zipcode level
Checking the distinct number of zipcodes in each dataset to determine the perecent of zipcodes that overlap in the two.
length(unique(airbnb_revenue$zipcode))
## [1] 166
length(unique(zillow_cost$RegionName))
## [1] 25
We can expect to see a maximum of distinct zipcodes in the merged dataset.
Merging the 2 datasets using an inner join
analysis_dataset <- airbnb_revenue %>% inner_join(zillow_cost, by = c("zipcode" = "RegionName"))
dim(analysis_dataset) #1496 41
## [1] 1467 41
Checking the number of distinct zipcodes in the merged dataset
length(unique(analysis_dataset$zipcode))
## [1] 24
24 out of the 25 zipcodes avaialble in zillow dataset had a corresponding match in the AirBnb Revenue dataset.
Our final dataset is ready for deep dive analysis
In order to determine the zipcodes to invest in we need to analyze following factors -
1. Popularity/High Demand: Based on -
* a. most reviewed properties
* b. number of properties available
2. Price per night
3. Cost of property
4. Estimated revenue
(Some of the below steps may not map directly to one of the above points. It is to understand the data better )
glimpse(analysis_dataset)
## Observations: 1,467
## Variables: 41
## $ id <int> 16458, 20853, 23135, 27006, 27659, 323...
## $ last_scraped <date> 2019-07-08, 2019-07-09, 2019-07-09, 2...
## $ neighbourhood_group_cleansed <fct> Brooklyn, Brooklyn, Brooklyn, Manhatta...
## $ zipcode <dbl> 11215, 11217, 11231, 10023, 11215, 112...
## $ property_type <fct> Condominium, Apartment, Apartment, Apa...
## $ room_type <fct> Entire home/apt, Entire home/apt, Enti...
## $ accommodates <int> 4, 4, 3, 4, 9, 3, 6, 2, 4, 4, 4, 4, 6,...
## $ bathrooms <dbl> 2.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.5, NA,...
## $ bedrooms <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ beds <int> 1, 3, 2, 3, 6, 1, 3, 2, 3, 3, 2, 2, 3,...
## $ bed_type <fct> Real Bed, Real Bed, Real Bed, Real Bed...
## $ guests_included <int> 1, 4, 1, 3, 1, 2, 2, 1, 2, 2, 11, 2, 1...
## $ minimum_nights <int> 3, 7, 2, 6, 2, 2, 2, 2, 2, 2, 2, 3, 2,...
## $ maximum_nights <int> 14, 120, 730, 30, 30, 5, 730, 365, 30,...
## $ availability_30 <int> 0, 0, 21, 0, 0, 0, 0, 12, 0, 0, 13, 11...
## $ availability_60 <int> 0, 0, 39, 16, 17, 0, 0, 37, 0, 0, 43, ...
## $ availability_90 <int> 0, 0, 69, 16, 17, 0, 0, 67, 6, 0, 73, ...
## $ availability_365 <int> 0, 158, 342, 199, 216, 0, 0, 342, 272,...
## $ number_of_reviews <int> 4, 21, 233, 27, 16, 4, 80, 150, 21, 21...
## $ number_of_reviews_ltm <int> 0, 5, 27, 4, 6, 0, 61, 21, 6, 0, 0, 17...
## $ first_review <date> 2017-07-02, 2013-11-10, 2010-05-20, 2...
## $ last_review <date> 2017-09-24, 2019-05-21, 2019-06-24, 2...
## $ review_scores_rating <int> 100, 99, 93, 92, 95, 90, 94, 92, 97, 9...
## $ review_scores_accuracy <int> 10, 10, 10, 9, 10, 10, 10, 10, 10, 9, ...
## $ review_scores_cleanliness <int> 10, 10, 10, 9, 9, 10, 10, 10, 9, 9, 9,...
## $ review_scores_checkin <int> 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9...
## $ review_scores_communication <int> 10, 10, 9, 10, 10, 10, 10, 9, 10, 9, 1...
## $ review_scores_location <int> 10, 10, 9, 10, 10, 10, 10, 9, 10, 9, 1...
## $ review_scores_value <int> 10, 10, 9, 9, 10, 10, 9, 9, 10, 9, 9, ...
## $ reviews_per_month <dbl> 0.16, 0.30, 2.09, 0.27, 0.24, 0.04, 2....
## $ price <dbl> 225, 250, 175, 235, 400, 140, 250, 165...
## $ security_deposit <dbl> NA, 500, 0, 500, 500, NA, 800, NA, NA,...
## $ cleaning_fee <dbl> NA, 250, 65, 120, 120, 60, 165, 65, NA...
## $ extra_people <dbl> 0, 50, 15, 40, 0, 0, 40, 20, 25, 50, 3...
## $ revenue_year_75 <dbl> 61593.75, 68437.50, 47906.25, 64331.25...
## $ reviews_month <dbl> 0.15217391, 0.28447025, 1.95504945, 0....
## $ revenue_year_reviews <dbl> 1232.6087, 5973.8752, 8211.2077, 4273....
## $ CountyName <chr> "Kings", "Kings", "Kings", "New York",...
## $ SizeRank <int> 71, 1555, 1817, 3, 71, 32, 71, 1817, 7...
## $ Year <chr> "2019", "2019", "2019", "2019", "2019"...
## $ Cost <dbl> 1045400, 1231850, 1202550, 1988700, 10...
colSums(is.na(analysis_dataset))
## id last_scraped
## 0 0
## neighbourhood_group_cleansed zipcode
## 0 0
## property_type room_type
## 0 0
## accommodates bathrooms
## 0 3
## bedrooms beds
## 0 0
## bed_type guests_included
## 0 0
## minimum_nights maximum_nights
## 0 0
## availability_30 availability_60
## 0 0
## availability_90 availability_365
## 0 0
## number_of_reviews number_of_reviews_ltm
## 0 0
## first_review last_review
## 324 324
## review_scores_rating review_scores_accuracy
## 341 341
## review_scores_cleanliness review_scores_checkin
## 341 342
## review_scores_communication review_scores_location
## 341 342
## review_scores_value reviews_per_month
## 342 324
## price security_deposit
## 0 549
## cleaning_fee extra_people
## 184 0
## revenue_year_75 reviews_month
## 0 324
## revenue_year_reviews CountyName
## 324 0
## SizeRank Year
## 0 0
## Cost
## 0
Before we even begin analyzing properties at neighbourhood and zipcode level, it is useful to know information of how many zipcodes is available to us within each neighbourhood
Split of zipcodes by neighbourhood
neighbourhood_zip <- analysis_dataset %>%
select(neighbourhood_group_cleansed, zipcode) %>%
group_by(neighbourhood_group_cleansed) %>%
distinct(neighbourhood_group_cleansed, zipcode) %>%
arrange(neighbourhood_group_cleansed)
neighbourhood_zip
We see that zipcode “10013” maps to both Manhattan and Brooklyn. This is a data issue.
As a matter of fact, zipcode 10013 belongs to Manhattan. Retreiving record(s) that have zipcode = ‘10013’ and ‘neighbourhood_group_cleansed’ = Brooklyn
analysis_dataset %>% filter(zipcode == 10013 & neighbourhood_group_cleansed == 'Brooklyn')
There is only one such observation. Important metric like - number_of_reviews, first_review, last_review have null value.
So omitting this record from analysis
analysis_dataset <- analysis_dataset %>% filter(!(zipcode == 10013 & neighbourhood_group_cleansed == 'Brooklyn'))
Count of zipcodes in each area is -
Analyzing the zipcodes against metrics describe above
1. Popularity
Number of listings available
Visualizing the distribution of listings across neighbourhood
analysis_dataset %>%
group_by(neighbourhood_group_cleansed) %>%
count() %>%
arrange(desc(n)) %>%
plot_ly(x = ~reorder(neighbourhood_group_cleansed, -n),
y = ~n,
type = 'bar',
text = ~n,
textposition = 'outside',
color = ~neighbourhood_group_cleansed) %>%
layout(title = "Distribution of number of listings by neighbourhood",
xaxis = list(title = ""),
yaxis = list(title = "Number of listings"))
Manhattan has the highest number of prperties followed by Brookyln. This finding is not surprising giving the popularity and strategic importance of these boroughs.
We can infer that these 2 boroughs are a popular choice among tourists and offer a wide range of options for guests to choose from.
We will go one level deeper to identify within each of these neighbourhoods, which zipcode offers most number of properties
Visualizing the distribution of listings by zipcode across neighbourhood
analysis_dataset %>%
select(neighbourhood_group_cleansed, zipcode) %>%
group_by(neighbourhood_group_cleansed, zipcode) %>%
count() %>%
arrange(desc(n)) %>%
plot_ly(x = ~reorder(zipcode, -n),
y = ~n,
type = 'bar',
color = ~neighbourhood_group_cleansed) %>%
layout(title = "Distribution of number of properties by zipcode",
xaxis = list(title = ""),
yaxis = list(title = "Number of listings"))
As expected, overall zipcodes within Manhattan and Brooklyn have most number of properties.
Below is the list of zipcodes selected from each neighbourhood that have high number of listings
Queens: 11434 (There is only 1 zipcode present)
Next we will identify properties the most reviewed properties.
Number of reviews recieved
Analyzing the pattern of reviews recieved by properties across zipcodes would indirectly provide an impression of the zipcodes with high demand
Visualizing the distribution of number of reviews recieved over a time
zipcode_popularity <-
analysis_dataset %>%
mutate(time_prd = round(difftime(last_scraped, first_review, units = "weeks"),0)) %>%
group_by(neighbourhood_group_cleansed,zipcode) %>% select(neighbourhood_group_cleansed, zipcode, number_of_reviews, time_prd) %>%
distinct()%>%
drop_na()
plot_ly(zipcode_popularity,
x = ~as.numeric(time_prd),
y = ~number_of_reviews,
type = 'scatter',
color = ~as.factor(zipcode),
colors = "Set2")%>%
layout(title = "Number of reviews recieved over time",
xaxis = list(title = "Time (in weeks)"),
yaxis = list(title = "Number of reviews"))
Zipcodes that have been able to gain traction in a short time frame are -
_There are some properties that have been existent for a longer time period and have consistently attracted guests. Below are such popular zipcodes -
Based on the above “Popularity” analysis, top winners are -
* 10003 (Manhattan)
* 11215 (Brooklyn)
* 10014 (Manhattan)
* 10036 (Manhattan)
* 11201 (Brooklyn)
* 11231 (Brooklyn)
* 10013 (Manhattan)
* 11434 (Queens)
* 10011 (Manhattan)
* 10128 (Manhattan)
Next factor that we need to analyze is Price/night.
2. Price Per Night
Identifying zipcodes that have higher Price/night figures, will give an indication of how much guests are willing to pay to stay in a paritcular neighbourhood(or zipcode)
analysis_dataset %>%
plot_ly(x = ~neighbourhood_group_cleansed,
y = ~price,
type = "box",
jitter = 0.7,
color = ~neighbourhood_group_cleansed) %>%
layout( title = "Variation in Price per Night by Neighbourhood",
yaxis = list(title = "Price per Night"),
xaxis = list(title = "Neighbourhood"))
Given the popularity and demand, properties in neighbourhoods Manhattan and Brooklyn charge their guests higher per night compared to those in Queens and Staten Island
Median price/night charged in Manhanttan equals the maximum price charged in Queens
We will now drill down at zipcode level to determine the prime areas that charge their guests high
analysis_dataset %>%
plot_ly(x = ~as.factor(zipcode),
y = ~price,
type = "box",
jitter = 0.7,
color = ~neighbourhood_group_cleansed) %>%
layout(title = "Variation in Price per Night by Zipcode",
yaxis = list(title = "Price per Night"),
xaxis = list(title = "Zipcode"))
Our favorable zipcodes based on Price/night metric are -
* 10011 (Manhattan)
* 10036 (Manhattan)
* 10022 (Manhattan)
* 10013 (Manhattan)
* 10014 (Manhattan)
* 10023 (Manhattan)
* 10028 (Manhattan)
* 11201 (Brooklyn)
A property that charges high per night stay might not be profitable always, as its likely that the property costs more
3. Cost of Investment
The next factor to be considered in analysis is “Cost” to buy a property. Favorable zipcodes would be the ones that do not demand high investment
Understanding the distribution of Cost of Property across neighbourhoods
analysis_dataset %>%
plot_ly(x = ~neighbourhood_group_cleansed,
y = ~Cost,
type = "box",
jitter = 0.7,
color = ~neighbourhood_group_cleansed) %>%
layout( title = "Variation in Property Cost by Neighbourhood",
yaxis = list(title = "Cost"),
xaxis = list(title = "Neighbourhood"))
Undoubtedly, Manhattan is the costliest borough.
The variation in cost of properties within the boroughs is too huge to be plotted on one single graph.
So, to analyze cost figures within zipcodes, we will analyze zipcodes of Brooklyn and Manhattan together and of Queens and Staten Island separately
Determining costliest zipcodes within each neighbourhood
analysis_data_cost <- analysis_dataset %>%
group_by(zipcode) %>%
mutate(median_cost = median(Cost)) %>%
select(neighbourhood_group_cleansed, zipcode, median_cost)
analysis_data_cost <- analysis_data_cost %>% distinct()
View(analysis_data_cost)
analysis_data_cost %>%
filter(neighbourhood_group_cleansed %in% c("Manhattan", "Brooklyn"))%>%
plot_ly(x = ~as.factor(zipcode),
y = ~median_cost,
color = ~neighbourhood_group_cleansed,
type = "bar") %>%
layout( title = "Cost of properties by Zipcode (Manhattan & Brooklyn)",
yaxis = list(title = "Cost"),
xaxis = list(title = "Zipcode"))
Zipcodes that have cheaper costs for properties are -
* 11234 (Brooklyn)
* 10021 (Manhattan)
* 10128 (Manhattan)
* 11201 (Brooklyn)
* 11231 (Brooklyn)
* 10022 (Manhattan)
* 10028 (Manhattan)
* 11217 (Brooklyn)
* 10023 (Manhattan)
* 10025 (Manhattan)
analysis_data_cost %>%
filter(neighbourhood_group_cleansed %in% c("Queens", "Staten Island")) %>%
plot_ly(x = ~as.factor(zipcode),
y = ~median_cost,
color = ~neighbourhood_group_cleansed,
type = "bar")%>%
layout( title = "Cost of properties by Zipcode (Staten Island & Queens)",
yaxis = list(title = "Cost"),
xaxis = list(title = "Zipcode"))
From the plot it would appear that zipcodes 10305 and 11434 are costliest. This is true when considering zipcodes within Queens and Staten Island. However, in comparison to Manhattan and Brooklyn, all the zipcodes have lower costs.
So we pick all the zipcodes from the 2 neighbourhoods as favorable
* 10309 (Staten Island)
* 10306 (Staten Island)
* 10314 (Staten Island)
* 10308 (Staten Island)
* 10304 (Staten Island)
* 10303 (Staten Island)
* 10304 (Staten Island)
* 10305 (Staten Island)
* 11434 (Queens)
A final factor to be analyzed is the ’Estimated revenue".
4. Estimated Revenue
Price/night and Cost analysis alone cannot help in identifying the profitable zipcodes. We need to understand the tradeoff and also analyze estimated revenue figures to identify profitable zipcodes
For each zip code we will look at following metrics -
In addition, we will derive an Index i.e (Revneue Earned/ Cost Invested) in order to better identify the profitable zipcodes
zipcode_profit <-
analysis_dataset %>%
select("neighbourhood" = neighbourhood_group_cleansed, zipcode, price, Cost, revenue_year_75, revenue_year_reviews) %>%
group_by(neighbourhood, zipcode, Cost) %>%
summarize(Revenue_75 = median(revenue_year_75)) %>%
mutate(RevenuePerCost = Revenue_75/Cost) %>%
arrange(desc(RevenuePerCost),desc(Revenue_75),Cost)
print(zipcode_profit)
## # A tibble: 24 x 5
## # Groups: neighbourhood, zipcode [24]
## neighbourhood zipcode Cost Revenue_75 RevenuePerCost
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Staten Island 10306 341250 32166. 0.0943
## 2 Staten Island 10303 312450 29428. 0.0942
## 3 Queens 11434 373950 34766. 0.0930
## 4 Staten Island 10304 311050 26006. 0.0836
## 5 Staten Island 10308 392100 29976. 0.0764
## 6 Staten Island 10305 407900 30386. 0.0745
## 7 Brooklyn 11234 473300 34219. 0.0723
## 8 Staten Island 10309 374150 23269. 0.0622
## 9 Staten Island 10314 345950 19984. 0.0578
## 10 Manhattan 10036 1743050 81851. 0.0470
## # ... with 14 more rows
From the RevenuePerCost figures, we observe that though Manhattan is the most popular borough and fetches highest price per night of stay, it is not very profitable as this it demands huge investment.
Overall properties within Staten Island seems to be the most profitable.
However, there is a shortcoming in this appraoch We are overlooking one of the most important factor i.e. “Popularity”.
We had observed that “Staten Island” is not popular neighbourhood among guests and therefore, calculation of revenue based on assumption that occupancy across all areas is 75% is not really accurate.
Based on the “Number of reviews” we should tweak the occupany percent of different regions to estimate the revenue
We will use “revenue_year_reviews” field that is the eatimated revenue based on number of reviews recieved, price/night and length of minimum stay
zipcode_profit_reviews <-
analysis_dataset %>%
select("neighbourhood" = neighbourhood_group_cleansed, zipcode, price, Cost, revenue_year_reviews) %>%
group_by(neighbourhood, zipcode, Cost) %>%
summarize(Revenue_Reviews = mean(revenue_year_reviews,na.rm = T)) %>%
mutate(RevenuePerCost = Revenue_Reviews/Cost) %>%
arrange(desc(RevenuePerCost),desc(Revenue_Reviews),Cost)
print(zipcode_profit_reviews)
## # A tibble: 24 x 5
## # Groups: neighbourhood, zipcode [24]
## neighbourhood zipcode Cost Revenue_Reviews RevenuePerCost
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Staten Island 10304 311050 19393. 0.0623
## 2 Queens 11434 373950 17563. 0.0470
## 3 Staten Island 10309 374150 6436. 0.0172
## 4 Manhattan 10023 1988700 30911. 0.0155
## 5 Brooklyn 11201 1400200 21297. 0.0152
## 6 Brooklyn 11217 1231850 16470. 0.0134
## 7 Brooklyn 11234 473300 5852. 0.0124
## 8 Staten Island 10305 407900 5032. 0.0123
## 9 Staten Island 10306 341250 4151. 0.0122
## 10 Manhattan 10011 2354000 28416. 0.0121
## # ... with 14 more rows
As we observe the zipcodes that appear to be profitable (when considering Number of reviews to estimate revenue) are quite different from the ones where we used 75% occupancy rate to estimate the revenue.
We, therefore, use the revenue estimated based on number of reviews recieved to determine the profitable zipcodes
Revenue per Cost Analysis for zipcodes
zipcode_profit_reviews %>%
plot_ly(x = ~RevenuePerCost,
y = ~reorder(as.factor(zipcode), RevenuePerCost),
color = ~neighbourhood,
type = 'bar',
text = ~round(RevenuePerCost,2),
textposition = 'auto') %>%
layout(title = "Revenue per Cost Analysis across zipcodes",
xaxis = list(title = "Revenue/Cost Index"),
yaxis = list(title = "Zipcode"))
Zipcodes that are more profitable compared to others are -
* 10304 (Staten Island)
* 11434 (Queens)
* 10309 (Staten Island)
* 10023 (Manhattan)
* 11201 (Brooklyn)
* 11217 (Brooklyn)
* 11234 (Brooklyn)
* 10305 (Staten Island)
* 10306 (Staten Island)
* 10011 (Manhattan)
Considering all the factors analyzed above, we will select the zipcodes that satisfy maximum number of conditions.
Here is the final list of zipcodes that should be considered for investment -
1. Additional data procurement
i. Including more zipcodes into analysis
We have AirBnb Listing data for 166 zipcodes(with information on 2BR properties available). However, due to non-availability of the Zillow Cost data, we were able to carried out analysis for only 24 zipcodes (that is only about 15% of zipcodes). Procurement of this data could help make better recommendations about the profitable zipcodes across New York City
ii. Better approach to estimating revenue figures
The 2 approaches used here to estimate the revenue/year/listing have their onw shortcomings.
If we could obtain the booking information for the listings, estimation of revenue could be more accurate. OR Consider creating a model that will help in predicting the occupancy rate based on key influential factors.
2. Better outlier treatment for price
Considering the nature and scope of this analysis, we have considered any price/night figure 3 standard deviations above mean as an outlier, this may have caused dropping of luxurious properties.
Price depends multiple factors like - property type, room type, amenetities, etc. While deciding if a price/night value for a listing is a data recording error or not, we need t consider all such factors.
A model that clusters similar properties together based on these price-influencing factors and provide us an estimate of a resonable price for such properties, then we will be able to identify erroneous price information.
3. Customer Review and Review Scores analysis
Currently, we have considered only number of reviews as a metric to identify popular zipcodes. Analysis of customer reviews and making use of review scores can be included to select favorable zipcodes
4. Property Cost prediction
We have used the maximum cost observed in last 5 years as the cost estimate for year 2019. However, we can build a time-series model to better predict cost for coming years