# show code but not console output in report
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
options(digits = 3)

Introduction

My task is to help a company offering small and mid-size apartments find the right price for their new AirBnb units in Vienna. I do this by building a model for predicting Airbnb prices with publicly available data from InsideAirbnb. We can run the model on the client’s apartments to figure out what an appropriate price is for each unit.

The Vienna AirBnb data has approximately 10.000 observations after cleaning and preparation. The mean price is about €68 and the average number of reviews is 33. However, 17% of the apartments in the data have 0 reviews, just like the client’s properties.

# set data directories (RMDs always set their own dir as the working dir)
raw_data_dir <- "../data/raw/"
clean_data_dir <- "../data/clean/"

# load libraries
library(knitr)
library(kableExtra)
library(tidyverse)
library(caret)
library(ranger)
library(glmnet)
library(e1071)
library(ggpubr)

# load data
df <- read_csv(paste0(raw_data_dir,'listings.csv.gz'))

Cleaning and Filtering

In the cleaning and filtering stage, I need to accomplish two goals:

  • adjusting the sample to be more representative of the properties of the properties we want to model
  • preparing the data for machine learning

Adjusting the Sample

The data contains 61 different property types. But our client only has one property type: apartment. We’ll trim the sample down to apartment-style properties.

# 'Entire apartment' + 'Private room in apartment' are 87% of obs, 
# there are 59 other types, and each is <= 2.3% of total obs

df %>% 
  group_by(property_type) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>% 
  kable() %>% 
  kable_material(c("striped", "hover")) %>% 
  scroll_box(height = "350px")
property_type count
Entire apartment 7898
Private room in apartment 2584
Entire condominium 277
Entire serviced apartment 274
Entire loft 167
Entire house 118
Private room in house 107
Room in boutique hotel 84
Shared room in apartment 74
Private room in condominium 72
Room in serviced apartment 53
Private room in bed and breakfast 41
Private room in hostel 38
Private room in loft 29
Room in hotel 26
Room in aparthotel 25
Entire guest suite 22
Private room in guesthouse 16
Private room in townhouse 16
Entire townhouse 14
Entire villa 12
Shared room in hostel 11
Entire bungalow 9
Entire guesthouse 8
Entire place 8
Room in bed and breakfast 7
Private room in guest suite 5
Private room in serviced apartment 5
Private room in villa 4
Camper/RV 3
Entire cottage 3
Private room 3
Shared room in condominium 3
Entire cabin 2
Lighthouse 2
Private room in castle 2
Private room in cave 2
Private room in earth house 2
Shared room in house 2
Shared room in loft 2
Tiny house 2
Casa particular 1
Castle 1
Dome house 1
Entire chalet 1
Entire hostel 1
Hut 1
Private room in bus 1
Private room in camper/rv 1
Private room in dome house 1
Private room in farm stay 1
Private room in lighthouse 1
Private room in nature lodge 1
Private room in tiny house 1
Private room in treehouse 1
Room in hostel 1
Shared room in bed and breakfast 1
Shared room in hotel 1
Shared room in serviced apartment 1
Shared room in tent 1
Shared room in tiny house 1

Fortunately, 87% of the data are Entire apartment or private room in apartment, plus there are a few more serviced apartments. We still have a large sample size. I keep any property which is an apartment of some kind, and I filter out anything that is not.

# identify apartments with string detection on property_type column and recode any other value to 'Other'
df <-
  df %>% 
  mutate(property_type = ifelse(str_detect(property_type, 'apartment|Apartment'), 'Apartment', 'Other'))

# keep only apartments
df <- 
  df %>%
  filter(property_type == 'Apartment')

A quick look at the room_type column reveal something odd. If our data has been filtered to apartments, why do we have the Hotel room room type?

# look at the room_type column... why do we have hotel rooms?
df %>% 
  group_by(room_type) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
room_type count
Entire home/apt 8172
Private room 2589
Shared room 75
Hotel room 53

I clean this up by dropping Hotel room observations and renaming Entire home/apt to Entire apartment to be more accurate.

# discard hotel rooms with a negated filter
df <-
  df %>% 
  filter(!room_type == 'Hotel room')

# rename 'Entire home/apt' to 'Entire apartment'
df <-
  df %>% 
  mutate(room_type = ifelse(room_type == 'Entire home/apt', 'Entire apartment', room_type))

The client’s apartments accommodate two to six guests. Therefore, I filter out any apartments which are not in this range.

# filter to include apartments which meet our criteria: accomodates 2-6 people
df <- 
  df %>% 
  filter(accommodates %in% 2:6)

The client’s apartments all have at least one bathroom. We can filter out the small number of apartments which have zero bathrooms or just a half-bath (dude, who is renting these…) and also do some basic cleaning on the bathrooms_text column while we are at it. Currently, this column is an ugly string with values such as 1 bath or 2.5 shared baths. I convert these values to a numeric stored in bathrooms and add a flag for apartments which have shared bath facilities.

# extract number of bathrooms from string column, store in numeric column
df <-
  df %>% 
  mutate(bathrooms = as.numeric(str_extract(bathrooms_text, '\\d\\.*\\d*'))) 


# add flag for shared bathrooms
df <- 
  df %>% 
  mutate(shared_bathroom = ifelse(str_detect(bathrooms_text, 'shared'), 1, 0))

# keep only apartments with one more more bathrooms
df <- 
  df %>% filter(bathrooms >= 1)

I also filter out apartments which have extreme values for price. The client does not have super expensive luxury apartments, and having luxury apartments in the sample for the pricing model is not helpful. These values (whether they are real values for luxury apartments or data entry mistakes) are going to hurt the accuracy of the model. I set the cutoff at €600 per night - which comes out to €100 - €300 per person for apartments which accommodate 2 - 6 guests.

# strip dollar signs from values in price column by taking substring starting at second position ("$100" --> "100")
df <-
  df %>% 
  mutate(price = str_sub(price, start = 2))

# convert to numeric
df$price <- as.numeric(df$price)

# check for NAs, it turns out there are 19. small number, let's drop them
# sum(is.na(df$price))

# drop price NAs
df <-
  df %>% 
  filter(!is.na(price))

# visualize price histogram with annotated cutoff point
df %>% 
    ggplot(aes(x = price)) + 
    geom_histogram() +
    annotate("segment", x = 600, xend = 600, y = 0, yend = 2000, size = 1.5, color = "red") +
    annotate("text", x = 600, y = 2200, label = "Luxury Price Cutoff", size = 6, color = "red")

# we should drop extreme values to improve our predictions
# but also because luxury apartments are not relevant comparisons for our business case
df <-
  df %>% 
  filter(price <= 600)

Prep data for machine learning

The data needs to be in a clean format before it can be passed to a model. The key tasks are:

  • dropping unneeded variables
  • engineering useful features from an ugly string column
  • dealing with NAs

I drop unneeded variables to make the data easier to work with. Using domain knowledge or common sense, I drop anything I am confident will not help to predict price. For the client’s case, I also drop review scores columns. Typically, reviews would be highly useful features for modeling price. But in the client’s case, we are trying to predict the appropriate price for properties which don’t have any reviews because they are new to the market. We can’t confidently impute values for the missing review scores: we don’t know how much guests will enjoy these particular properties, how accurate they will find the listing, how well the client will perform the host duties overall, etc. I decide to keep the number_of_reviews variable because we do have this information (it’s zero).

# drop a bunch of variables we won't use to make wrangling amenities a little easier
# do this by selecting only the variables we are interested in
df <- 
  df %>% 
  select(id, 
         name, 
         neighbourhood_cleansed, 
         room_type, 
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         amenities,
         price,
         instant_bookable,
         number_of_reviews,
         shared_bathroom)

I do a small amount of feature engineering. Every row in the amenities column contains a list of the property’s amenities (appliances, sound systems, patios, etc.) as one long string. A machine learning tool won’t be able to use these values as predictive features.

I split the string and make a list of all unique values in amenities. Then, I make binary variables for every unique value. To get the most out of this information, I consolidate amenities which are extremely similar, such as refrigerators of different brands and different varieties of stoves. Many variables are renamed to replace spaces with underscores.

# first create unique list of all amenities in the data
# the regex pattern captures amenities of any number of words
amenities_unique <-
  unique(
    unlist(
      df$amenities %>% str_extract_all('(?<=")(\\w+\\s*)*(?=")')
    )
  )

# then iterate over the list and create a new column for each unique amenity and fill values with this logic:
# in the new column, if the amenity is detected as a string in the original "amenities" at this row, set value to 1
# else set value to 0
for (i in amenities_unique){
  df <- 
    df %>% 
    mutate(!! i := ifelse(str_detect(amenities, !! i), 1, 0))
}

# mutate columns to aggregate amenities - refrigerators, ovens, stoves of different types
# this checks for any type of refrigerator, oven or stove unit, etc. and sets value to 1 if present
# example: Refrigerator = 1 if it is a refrigerator, Siemens refrigerator, Gorenje refrigerator, Miele refrigerator, etc...
df <-
  df %>% 
  mutate(Refrigerator = ifelse(str_detect(amenities, "Refrigerator|refrigerator"), 1, 0)) %>% 
  mutate(Oven = ifelse(str_detect(amenities, "Oven|oven"), 1, 0)) %>% 
  mutate(Stove = ifelse(str_detect(amenities, "Stove|stove"), 1, 0)) %>% 
  mutate(`Body Soap` = ifelse(str_detect(amenities, "Body soap|body soap|Shower gel"), 1, 0)) %>% 
  mutate(Shampoo = ifelse(str_detect(amenities, "Shampoo|shampoo"), 1, 0)) %>% 
  mutate(Conditioner = ifelse(str_detect(amenities, "Conditioner|conditioner"), 1, 0)) %>% 
  mutate(Netflix = ifelse(str_detect(amenities, "Netflix|netflix"), 1, 0)) %>% 
  mutate(`Amazon Prime Video` = ifelse(str_detect(amenities, "Amazon Prime Video"), 1, 0)) %>% 
  mutate(`Air conditioning` = ifelse(str_detect(amenities, "Air conditioning|air conditioning|Window AC"), 1, 0)) %>% 
  mutate(TV = ifelse(str_detect(amenities, "TV"), 1, 0)) %>% 
  mutate(`Hot tub` = ifelse(str_detect(amenities, "Hot tub|hot tub"), 1, 0)) %>% 
  mutate(`Sound system` = ifelse(str_detect(amenities, "Sound system|sound system"), 1, 0)) %>% 
  mutate(`Free parking` = ifelse(str_detect(amenities, "Parking|parking") & str_detect(amenities, "Free|free"), 1, 0)) %>% 
  mutate(`Paid parking` = ifelse(str_detect(amenities, "Parking|parking") & str_detect(amenities, "Paid|paid"), 1, 0))


# vector of vars to drop: all the hyper-specific appliances which have been aggregated into more general amenities
vars_to_drop <- c('Cable TV', 
                  'Shower gel', 
                  'Zanussi oven', 
                  'Zanussi electric stove', 
                  'Zanussi refrigerator',
                  'Window AC unit',
                  'Electric stove',
                  'HDTV with Netflix',
                  'Central air conditioning',
                  'Nivea body soap',
                  'Bodyshop body soap',
                  'Stainless steel oven',
                  'TV with Netflix',
                  'HDTV',
                  'Induction stove',
                  'SIEMENS oven',
                  'AEG refrigerator',
                  'Duschgel body soap',
                  'Private hot tub',
                  'Gorenje induction stove',
                  'Gorenje refrigerator',
                  'Portable air conditioning',
                  'Cerankochfeld electric stove',
                  'Siemens oven',
                  'Siemens refrigerator',
                  'Oranier induction stove',
                  'Rituals body soap',
                  'Rituals shampoo',
                  'variouse body soap',
                  'various conditioner',
                  'Bauknecht refrigerator',
                  'TV with Chromecast',
                  'Miele refrigerator',
                  'Paid parking off premises',                           
                  'Free parking on premises',                                                     
                  'Paid parking on premises',                           
                  'TV with standard cable',                              
                  'Paid parking garage off premises',                   
                  'Paid street parking off premises',                    
                  'HDTV with standard cable',                           
                  'Grundig Ovation sound system with aux',               
                  'Stainless steel electric stove',                     
                  'Paid parking lot on premises',                        
                  'Sound system with aux',                              
                  'Free parking garage on premises',                                            
                  'Paid parking garage on premises',                     
                  'Sound system with Bluetooth and aux',                
                  'Samsung TV with integrated sound system sound system',
                  'Philip Starck Parrot Bluetooth sound system',        
                  'SAVON PUR VEGETAL ROSE DE MAI body soap',             
                  'Stainless steel gas stove',                          
                  'Gorenje stainless steel oven',                        
                  'Paid parking lot off premises',                      
                  'LG sound system with Bluetooth and aux',              
                  'Yamaha Bluetooth sound system',                      
                  'Gorenje stainless steel electric stove',              
                  'HDTV with Amazon Prime Video',                       
                  'Stainless steel induction stove',                     
                  'Miele stainless steel oven',                         
                  'Miele stainless steel induction stove',
                  'Free street parking',        
                  'Bluetooth sound system',                
                  'amenities') # we don't need the original amenities column anymore

# drop the vars using the vector 
df <- 
    df %>% 
    select(-all_of(vars_to_drop))

# rename columns: replace all spaces with underscores
names(df) <- str_replace_all(names(df), '\\s', '_')

Now that I’m done adjusting the sample and features, I take one last check for NAs.

# count NAs for all columns except "id" and "name", which are just labels, not predictive features
na_info <- 
  df %>% 
  select(-c('id', 'name')) %>% 
  summarise_all(funs(sum(is.na(.))))

# show NAs
na_info[,colSums(na_info>0)>0]  %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
bedrooms beds
1111 69

There are two columns with NAs: beds and bedrooms. The number of NAs in beds is small - I can just drop them without any significant impact on our sample. But the number of NAs in bedrooms is a bigger portion of our sample. I don’t think there is an easy way to impute this. You can’t just assume that there is one bedroom for every two beds. There could be a systemic reason for missing data here, which would make simply imputing median values dangerous and misleading. (For example, hosts might tend to omit information about the number of bedrooms when the number is small…) I want to take a conservative approach: leave the observations in the data and set NA as factor level when doing prediction.

# there are 70 NAs in beds - this is small number in our 10k obs, so we can afford to drop these
df <-
  df %>% 
  drop_na(beds)

# set bedrooms to factor, including NA as a level
df$bedrooms <- 
  df$bedrooms %>% 
  factor(exclude = NULL)

I save the clean data to a separate CSV file.

# write out a clean CSV
write_csv(df, paste0(clean_data_dir,'airbnb_vienna_midsize_clean.csv'))

Functional Form and Interactions

I do not make any adjustments to the functional form of the numeric features. I examine scatterplots of numeric x variables and the y variable, price, with loess smoothing, and I don’t find any evidence to suggest that a transformation (such a log, square, or cubic) would better accommodate the pattern of association between the variables. There are two numeric variables with skewed distributions, number_of_reviews and price, but log transformations of these variables don’t seem to help fit any meaningful patterns.

I decide to interact some variables with neighbourhood. The logic here is that property features be extremely important in some areas and less important in others. One crystal clear example is free parking. In some areas of the city, parking spots are precious. The graph below mean Airbnb prices, with and without free parking, by neighbourhood:

# define a function to create plots which help identify potential interactions
# this function was written by Gabor Bekes and  Gabor Kezdi and modified slightly
price_diff_by_variables <- function(df, factor_var, dummy_var){

  factor_var <- as.name(factor_var)
  dummy_var <- as.name(dummy_var)
  
  stats <- df %>%
    group_by(!!factor_var, !!dummy_var) %>%
    dplyr::summarize(Mean = mean(price, na.rm=TRUE),
                     se = sd(price)/sqrt(n()))
  
  stats[,2] <- lapply(stats[,2], factor)
  
  ggplot(stats, aes_string(colnames(stats)[1], colnames(stats)[3], fill = colnames(stats)[2]))+
    geom_bar(stat='identity', position = position_dodge(width=0.9))+
    geom_errorbar(aes(ymin=Mean-(1.96*se),ymax=Mean+(1.96*se)),
                  position=position_dodge(width = 0.9), width = 0.25)+
    ylab('Mean Price')+
    theme_bw()+
    theme(panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),
          panel.border=element_blank(),
          axis.line=element_line(),
          axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
    labs(x = 'Neighbourhood') + 
    scale_fill_grey()
}

# call the function which returns a ggplot
price_diff_by_variables(df, "neighbourhood_cleansed", "Free_parking")

When I set the model formulas, I will include a number of interaction terms which involve the neighbourhood variable. These interaction terms have the potential to improve predictive performance, and methods like LASSO can select the interaction terms which are most useful.

Model Building

Training and Test Sets

The data needs to be split into training and test sets. I’ll use 80% of the data for training. The 20% remaining in the test set will only be used for model evaluation.

# set the seed for reproducibility
set.seed(2020)

# create an index for splitting the data. let's do a classic 80/20 split
# use as.integer() to coerce the matrix into a vector
index <- as.integer(
  createDataPartition(df$price, times = 1, p = 0.8, list = FALSE)
)

# create sets with index
train_set <- df[index,]
test_set <- df[-index,]

Model Formulas

I put the variables into three different groups and create model formulas with increasing complexity:

  • basic property characteristics
  • basic property characteristics + amenities
  • basic property characteristics + amenities + neighborhood interactions
# put variables in groups to make model-building easier

# basic characteristics of the listing
basic_vars <- c('neighbourhood_cleansed', 
                'room_type', 
                'accommodates', 
                'bathrooms', 
                'bedrooms', 
                'beds', 
                'instant_bookable', 
                'shared_bathroom',
                'number_of_reviews')


# dummy variables for the amenities (fridges, parking, etc.) in columns 15 - 123
amenities <- colnames(df[15:121])

# interactions with neighbourhood
# I can't explore all possible interactions, so I'm only using basic property stats + parking
neighbourhood_interactions <- c('room_type*neighbourhood_cleansed', 
                                'accommodates*neighbourhood_cleansed', 
                                'bathrooms*neighbourhood_cleansed', 
                                'bedrooms*neighbourhood_cleansed', 
                                'beds*neighbourhood_cleansed', 
                                'instant_bookable*neighbourhood_cleansed', 
                                'shared_bathroom*neighbourhood_cleansed',
                                'number_of_reviews*neighbourhood_cleansed',
                                'Free_parking*neighbourhood_cleansed',
                                'Paid_parking*neighbourhood_cleansed')


# simplify model formulas with these groups
predictors_1 <- basic_vars
predictors_2 <- c(basic_vars, amenities)
predictors_3 <- c(basic_vars, amenities, neighbourhood_interactions)


# set formulas

# formula 1: basic variables
formula_1 <- 
  as.formula(
    paste0('price ~ ', 
           paste0(predictors_1,
                  collapse = ' + ')))

# formula 2: basic variables + amenities
formula_2 <- as.formula(
  paste0('price ~ ', 
         paste0(predictors_2,
                collapse = ' + ')))

# formula 3: basic variables + amenities + neighbourhood interactions
formula_3 <- as.formula(
  paste0('price ~ ', 
         paste0(predictors_3,
                collapse = ' + ')))

Tuning Parameters

I’ll be training models with the caret package. I’ll use five-fold cross-validation to select the best configuration from a range of tuning parameters for each model.

For an OLS model, there is no need to specify tuning parameters.
For the LASSO model, caret will find the optimal penalty term (λ) for reducing model complexity.
For the random forest models, caret will find the optimal number of variables (m) to consider at each split.

# set trainControl to do 5 folds of cross validation
train_control <- trainControl(method = "cv", number = 5)

# tuneGrid for LASSO
# try lambdas all the way up 1, we need to allow LASSO to basically zero things out if appropriate
tunegrid_lasso <-  expand.grid("alpha" = 1, "lambda" = seq(0.05, 1, by = 0.01))

# set tunegrid for random forest: simple model
# for the simple model, we have 9 features. we may consider sqrt(9) = 3 as a potential value for m
# let's at least give it two options: 3 and 4. if we go any higher, the trees may get too correlated
tune_grid_rf_simple <- expand.grid(
  .mtry = c(3, 4),
  .splitrule = "variance",
  .min.node.size = c(5, 10)
)

# set tunegrid for random forest: full model
# for the full model, we have 121 features, we may consider sqrt(121) = 11 as a possible value for m
# but let's have it try a few values
tune_grid_rf_full <- expand.grid(
  .mtry = c(9, 11, 13),
  .splitrule = "variance",
  .min.node.size = c(5, 10)
)

# setting train control for RF - the same as previous train control but you should set verboseIter = TRUE
# if you want to see the progress running in the console
train_control_rf <- trainControl(method = "cv", number = 5, verboseIter = FALSE)

Model Training

I’ll build a few different types of models with different levels of complexity:

  • OLS with basic property characteristics
  • OLS with basic property characteristics + amenities
  • OLS with basic property characteristics + amenities + neighborhood interactions
  • LASSO basic property characteristics + amenities + neighborhood interactions
  • Random forest basic property characteristics
  • Random forest basic property characteristics + amenities

The OLS models are unlikely to be the best performers, but they provide a benchmark against we can compare the LASSO and random forest models. The LASSO model is only trained using the most complex formula, because the penalty term will choose the optimal number of coefficients anyway. For the random forest model, I’ll train a basic version with just the property characteristics and a more complex version which includes amenities. However, I won’t pass interaction terms to the random forest model; the splitting process uncovers uncovers interactions automatically.

# OLS

# linear model 1: basic predictors
lm_1 <- caret::train(formula_1,
                    data = train_set,
                    method = 'lm',
                    preProcess = c('center', 'scale'),
                    trControl = train_control)

# linear model 2: basic predictors + amenities
lm_2 <- caret::train(formula_2,
                    data = train_set,
                    method = 'lm',
                    preProcess = c('center', 'scale'),
                    trControl = train_control)

# linear model 3: basic predictors + amenities + neighbourhood interactions
lm_3 <- caret::train(formula_3,
                    data = train_set,
                    method = 'lm',
                    preProcess = c('center', 'scale'),
                    trControl = train_control)

# LASSO

# give LASSO all potential x variables (including interactions) and let the algorithm determine which features are important
lasso_1 <- caret::train(formula_3,
                       data = train_set,
                       method = 'glmnet',
                       preProcess = c('center', 'scale'),
                       tuneGrid = tunegrid_lasso,
                       trControl = train_control
)

# Random Forest

# first train a simple model only using basic property stats


set.seed(2020)

rf_1 <- train(
  formula_1,
  data = train_set,
  method = "ranger",
  trControl = train_control_rf,
  tuneGrid = tune_grid_rf_simple,
  importance = "impurity"
)

# second train a complex model using all features (but not interactions, as decision trees uncover these independently)

set.seed(2020)

rf_2 <- train(
  formula_2,
  data = train_set,
  method = "ranger",
  trControl = train_control_rf,
  tuneGrid = tune_grid_rf_full,
  importance = "impurity"
)

Model Evaluation

First, I’ll take a look at the model performance in cross validation. The chart shows the RMSE of each model across five folds. I consider the overall mean RMSE the most meaningful metric, but it’s worth scanning the other statistics. For example, the LASSO model was the second best model overall in terms of mean RMSE, but observe the max value: it also had the worst single-fold performance! (We could examine the nature of the observations in the individual folds to see why LASSO performed so poorly on one of them, but it would make this report a lot longer…)

# put all models into a list
final_models <-
  list("OLS_1" = lm_1,
       "OLS_2" = lm_2,
       "OLS_3" = lm_3,
       "LASSO" = lasso_1,
       "Random_forest_basic" = rf_1,
       "Random forest_full" = rf_2)

# summarize model stats
results <- resamples(final_models) %>% summary()

# examine cross-validated RMSE
results$statistics$RMSE %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
OLS_1 40.6 42.0 42.2 42.2 43.0 43.2 0
OLS_2 37.7 39.1 41.2 40.8 42.0 44.2 0
OLS_3 39.3 39.5 40.8 41.3 43.2 43.5 0
LASSO 38.9 39.9 40.2 40.3 41.1 41.6 0
Random_forest_basic 38.0 40.8 42.2 41.3 42.6 42.8 0
Random forest_full 35.4 38.8 39.9 39.1 40.2 41.0 0

Next, I’ll try the models on the test set data. There are no radical changes in performance in cross validation VS the test set - which is a good sign that the models were not overfit. The random forest (full) model with all available features had the lowest RMSE.

# make predictions  on the **test set**
predictions_lm_1 <- predict(lm_1, test_set)
predictions_lm_2 <- predict(lm_2, test_set)
predictions_lm_3 <- predict(lm_3, test_set)
predictions_lasso_1 <- predict(lasso_1, test_set)
predictions_rf1 <- predict(rf_1, test_set)
predictions_rf2 <- predict(rf_2, test_set)

# calculate RMSE for test set predictions
test_set_rmse <- data.frame('OLS_1' = RMSE(predictions_lm_1, test_set$price), 
                            'OLS_2' = RMSE(predictions_lm_2, test_set$price),
                            'OLS_3' = RMSE(predictions_lm_3, test_set$price),
                            'LASSO' = RMSE(predictions_lasso_1, test_set$price),
                            'Random_forest_basic' = RMSE(predictions_rf1, test_set$price),
                            'Random_forest_full' = RMSE(predictions_rf2, test_set$price))

# examine test set RMSE
test_set_rmse %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
OLS_1 OLS_2 OLS_3 LASSO Random_forest_basic Random_forest_full
39.6 38.2 38.7 37.6 38.3 35.1

Model Diagnostics

I do some model diagnostics as a sanity check. I examine the random forest (full) model because it was the best performer on the test set and therefore the best candidate for use in the business case. I also take a look at the LASSO model - the second-best performer - as a point of of comparison.

Y vs Y-hat Plots

The Y/Y-hat plots shows how predicted values differ from actual values. If we could create a perfect model, all points would lie on the dotted 45 degree line. We see that expensive units are an area of weakness for both models.

# LASSO - scatterplot of predicted VS actual values
lasso_pred_vs_actual <- 
  test_set %>% 
  mutate(predictions = predictions_lasso_1) %>% 
  ggplot(aes(x = predictions, y = price)) +
  geom_point(color = 'blue', size = 1, shape = 16, alpha = 0.7, show.legend=FALSE, na.rm=TRUE) +
  geom_segment(aes(x = 0, y = 0, xend = 350, yend =350), size=0.5, color = 'red', linetype=2) +
  coord_cartesian(xlim = c(0, 350), ylim = c(0, 350)) +
  scale_x_continuous(expand = c(0.01,0.01),limits=c(0, 350), breaks=seq(0, 350, by=50)) +
  scale_y_continuous(expand = c(0.01,0.01),limits=c(0, 350), breaks=seq(0, 350, by=50)) +
  labs(title = 'LASSO', y = "Price (EUR)", x = "Predicted price  (EUR)") +
  theme(plot.title = element_text(hjust = 0.5))

# RANDOM FOREST (full) - scatterplot of predicted VS actual values
rf_2_pred_vs_actual <-
  test_set %>% 
  mutate(predictions = predictions_rf2) %>% 
  ggplot(aes(x = predictions, y = price)) +
  geom_point(color = 'blue', size = 1, shape = 16, alpha = 0.7, show.legend=FALSE, na.rm=TRUE) +
  geom_segment(aes(x = 0, y = 0, xend = 350, yend =350), size=0.5, color = 'red', linetype=2) +
  coord_cartesian(xlim = c(0, 350), ylim = c(0, 350)) +
  scale_x_continuous(expand = c(0.01,0.01),limits=c(0, 350), breaks=seq(0, 350, by=50)) +
  scale_y_continuous(expand = c(0.01,0.01),limits=c(0, 350), breaks=seq(0, 350, by=50)) +
  labs(title = 'Random Forest (full model)', y = "Price (EUR)", x = "Predicted price  (EUR)") +
  theme(plot.title = element_text(hjust = 0.5))


# show LASSO scatter
lasso_pred_vs_actual

# show RF scatter
rf_2_pred_vs_actual

In fact, the random forest model is clearly a bit conservative in the valuation of high-end AirBnB units. The points in the upper left quadrant represent observations where the true price (y axis) was fairly high, but the price predicted by our model (x axis) was significantly lower. But beyond €150 on the x axis, there are only a few points under the dotted line - meaning fairly few errors where the model overpriced a unit. So when it comes to the high-end segment, we may suspect that our best model has a negative bias.

Top Coefficients in the LASSO Model

To get an idea of which predictors were most impactful in the LASSO model, I look at the top 10 largest coefficient values. The interaction terms including neighbourhood were very important.

# get the coefficients from the LASSO model
lasso_coeffs <- 
  coef(lasso_1$finalModel, lasso_1$bestTune$lambda) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column(var = "variable") %>%
  rename(coefficient = `s1`)  # the column has a name "s1", to be renamed

# get the top 10 LASSO coefficients with clean names
lasso_top10_coef <-
  lasso_coeffs %>% 
  arrange(desc(coefficient)) %>% 
  mutate(variable = gsub('neighbourhood_cleansedInnere Stadt:bathrooms', 'neighbourhood: Innere Stadt * bathrooms', variable)) %>% 
  mutate(variable = gsub('neighbourhood_cleansedInnere Stadt:bedrooms3', 'neighbourhood: Innere Stadt * bedrooms3', variable)) %>% 
  mutate(variable = gsub('neighbourhood_cleansedInnere Stadt:bedrooms2', 'neighbourhood: Innere Stadt * bedrooms2', variable)) %>% 
  head(., n = 10)

# view the top 10 LASSO coefficients
lasso_top10_coef %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
variable coefficient
(Intercept) 68.63
neighbourhood: Innere Stadt * bathrooms 8.37
accommodates 7.85
bedrooms2 6.26
bedrooms3 5.11
TV 4.84
Air_conditioning 3.88
bathrooms 3.32
neighbourhood: Innere Stadt * bedrooms3 3.25
neighbourhood: Innere Stadt * bedrooms2 2.64

Variable Importance for Random Forest

To get an idea of which predictors were most important to the random forest model, I examine the top ten features in terms of variable importance - a metric which describes how important a variable was to reducing RMSE.

# calculate random forest variable importance as a percentage
rf_2_var_imp <- importance(rf_2$finalModel)/1000
rf_2_var_imp_df <-
  data.frame(varname = names(rf_2_var_imp), imp = rf_2_var_imp) %>%
  mutate(varname = gsub("room_typePrivate room", "room type: private room", varname) ) %>%
  mutate(varname = gsub("neighbourhood_cleansedInnere Stadt", "neighbourhood: Innere Stadt", varname) ) %>%
  arrange(desc(imp)) %>%
  mutate(imp_percentage = imp/sum(imp))

# make plot of top 10 varimp vars
rf_2_var_imp_plot <- ggplot(rf_2_var_imp_df[1:10,], aes(x=reorder(varname, imp), y=imp_percentage)) +
  geom_point(color= 'blue', size=1) +
  geom_segment(aes(x=varname,xend=varname,y=0,yend=imp_percentage), color= 'blue', size=0.75) +
  ylab("Importance (Percent)") +
  xlab("Variable Name") +
  labs(title = "Top 10 Variables (VarImp)") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme(axis.text.x = element_text(size=8), axis.text.y = element_text(size=8),
        axis.title.x = element_text(size=8), axis.title.y = element_text(size=8),
        plot.title = element_text(hjust = 0.5))

# show plot
rf_2_var_imp_plot

Partial Dependence Plots for Random Forest

A partial dependence plot offers some insight into the relationship between a single x and the random forest’s predicted y value. You can see that there isn’t much of a difference between having the capacity to accommodate two people VS three, but that capacity for each additional person past three is associated with significantly higher price. The PDP plot for bathrooms has a different shape, but possibly points to a similar principle: facilities that can serve groups, or possibly just larger apartments, may be associated with higher prices.

# partial dependence plot for number of people it accommodates
pdp_n_acc <- pdp::partial(rf_2, pred.var = "accommodates", pred.grid = distinct_(test_set, "accommodates"), train = train_set)
pdp_n_acc_plot <- pdp_n_acc %>%
  autoplot( ) +
  geom_point(color= 'blue', size=2) +
  geom_line(color= 'red', size=1) +
  ylab("Predicted price") +
  xlab("Accommodates (persons)") +
  labs(title = "Partial Dependence Plot - Accommodates") +
  scale_x_continuous(limit=c(2,6), breaks=seq(2,6,1)) +
  theme(plot.title = element_text(hjust = 0.5))

# show the plot
pdp_n_acc_plot

# partial dependence plot for number of bathrooms
pdp_bathrooms <- pdp::partial(rf_2, pred.var = "bathrooms", pred.grid = distinct_(test_set, "bathrooms"), train = train_set)
pdp_bathrooms_plot <- pdp_bathrooms %>%
  autoplot( ) +
  geom_point(color= 'blue', size = 2) +
  geom_line(color= 'red', size = 1) +
  ylab("Predicted price") +
  xlab("Number of Bathrooms") +
  scale_x_continuous(limit=c(0.5,4), breaks=seq(0.5,4,0.5)) +
  labs(title = "Partial Dependence Plot - Bathrooms")+
  theme(plot.title = element_text(hjust = 0.5)) 

# show the plot
pdp_bathrooms_plot

Comparison of Important Variables

Take a look at the top LASSO coefficients and the top features in terms of variable importance from the random forest model. The two methods have some similarities in terms of which x variables are most useful in predicting nightly price. In both models, the Innere Stadt neighbourhood seems to be associated with higher prices. (Innere Stadt means inner city in German, so this is fairly intuitive.)

# make a table for comparing most important variables in the top two models
top10_compare <-
  data.frame('LASSO' = lasso_top10_coef$variable, 
             'Random Forest' = rf_2_var_imp_df$varname[1:10])

# show the table
top10_compare %>% 
  kable() %>% 
  kable_material(c("striped", "hover"))
LASSO Random.Forest
(Intercept) accommodates
neighbourhood: Innere Stadt * bathrooms number_of_reviews
accommodates neighbourhood: Innere Stadt
bedrooms2 bathrooms
bedrooms3 beds
TV bedrooms2
Air_conditioning room type: private room
bathrooms TV
neighbourhood: Innere Stadt * bedrooms3 shared_bathroom
neighbourhood: Innere Stadt * bedrooms2 Air_conditioning

Conclusion

The model I built is decent, but not perfect. The best model achieved an RMSE of 35.058 and a mean average error of 22.346.

It may have been possible to improve my model’s performance further by including information from the various review scores variables, but I am highly skeptical of the idea of imputing these values for the client’s apartments. The model would have appeared better on paper, but it may have resulted in worse predictions for the client if we incorrectly guess what the client’s review scores should be.

As for the client’s use of the model: I would propose using the model as a starting point for setting prices. It may be a valuable complement to the client’s domain expertise, but it is only a decent performer on average, and it would be wise to consider individual price adjustments if the prediction appears unreasonable. This is particularly true of more expensive units.