Open In Colab

Exploring New York city Airbnb Data#

We are going to carry out an exploratory data analysis of some data from the company Airbnb available in Kaggle. This analysis will have different sections that will allow us to understand how the data is composed, correct some problems around them and after this a statistical analysis and also a visualization. You can visit this project and more in my github repository Machine-Learning-Class-2022.

# These are all the modules used in this notebook. However, later we are going to re-import some specific modules to show you when they are used.
from folium import plugins
import folium
import geopy.distance 
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns 
abnb_data = pd.read_csv('https://raw.githubusercontent.com/BautistaDavid/Machine-Learning-Class-2022/main/data/AB_NYC_2019.csv')

¡A quick Look!#

Let’s check the first ten rows …

abnb_data.head(10) 
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188

Now the last ten …

abnb_data.tail(10) 
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

And what about the shape of the data?

abnb_data.shape   
(48895, 16)

Also we can verify that all the columns names are in snake_case

abnb_data.columns    
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

Now we can check what kind of data has every columns

abnb_data.dtypes   
id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

Is necessary to change the columns dtypes ?

for col in ['neighbourhood_group','room_type','neighbourhood']:  # We can change the data kind of some columns from 'object' to 'category'
  abnb_data[col] = abnb_data[col].astype('category')

abnb_data['host_id'] = abnb_data['host_id'].astype('object') 

abnb_data.drop(columns = ['host_name','name','id','last_review',], inplace = True) # Now we can drop some variables that are no usefull 

Finally… Info about the data

abnb_data.info()  # Checking the info about the dataset we find that 'reviews_per_month' variables has missing values
# So now let's talk about this ... 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   host_id                         48895 non-null  object  
 1   neighbourhood_group             48895 non-null  category
 2   neighbourhood                   48895 non-null  category
 3   latitude                        48895 non-null  float64 
 4   longitude                       48895 non-null  float64 
 5   room_type                       48895 non-null  category
 6   price                           48895 non-null  int64   
 7   minimum_nights                  48895 non-null  int64   
 8   number_of_reviews               48895 non-null  int64   
 9   reviews_per_month               38843 non-null  float64 
 10  calculated_host_listings_count  48895 non-null  int64   
 11  availability_365                48895 non-null  int64   
dtypes: category(3), float64(3), int64(5), object(1)
memory usage: 3.6+ MB

Now We Have Problems… Missing Values :(#

Lets localizing the missing values

missing_info = pd.DataFrame()
cols = []
freq = []
for var in abnb_data.columns:
  cols.append(var)
  freq.append(abnb_data[var].isna().sum()) 

missing_info['column'] = cols
missing_info['missing_values'] = freq 
missing_info['percentage'] = missing_info['missing_values'] / len(abnb_data)

missing_info.set_index('column')
missing_values percentage
column
host_id 0 0.000000
neighbourhood_group 0 0.000000
neighbourhood 0 0.000000
latitude 0 0.000000
longitude 0 0.000000
room_type 0 0.000000
price 0 0.000000
minimum_nights 0 0.000000
number_of_reviews 0 0.000000
reviews_per_month 10052 0.205583
calculated_host_listings_count 0 0.000000
availability_365 0 0.000000

Check it out. It looks like all the missing values ​​are in just one columns, reviews_per_month, so probably all the rows with missing values ​​will only have one.

We can use a heatmap to take a look at missing values ​​more easily.

plt.figure(figsize=(25, 10))
sns.heatmap(abnb_data.isnull(), cbar=True, cmap='gray')  # .isnull() is going to give you a DataFrame with True or False... and remeber that ( True = 1 , False = 0)  
plt.xlabel("Column_Name", size=14, weight="bold")
plt.title("Places of missing values in column",fontweight="bold",size=14)
plt.show()
../_images/airbnb_new_york_23_0.png

So, how do we solve that?

Note that whenever there is a null value in the reviews_per_month column is because the value of number of review column is 0. So we can solve this problem replacing the null values by 0. Now when the value in reviews_per_month is 0 meanings that this airbnb hostel has 0 views per month.

abnb_data['reviews_per_month'].replace({abnb_data.loc[2]['reviews_per_month']:0.0},inplace = True) # Replacing Null values with Zeros.

See how the rows that were changed look like now.

abnb_data[abnb_data['number_of_reviews'] == 0].head()  
host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
2 4632 Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 0.0 1 365
19 17985 Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 0.0 2 249
26 26394 Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 0.0 1 0
36 7355 Brooklyn Bedford-Stuyvesant 40.68876 -73.94312 Private room 35 60 0 0.0 1 365
38 45445 Brooklyn Flatbush 40.63702 -73.96327 Private room 150 1 0 0.0 1 365

What About The Outliers?#

First of all … Univariate Analysis

We can start by trying to identify univariate outliers in numerical columns

abnb_numerical = abnb_data.select_dtypes(exclude = ['category','object']).drop(columns = ['calculated_host_listings_count', 'latitude','longitude'])   # Lets build a DataFrame with just numerical data. 
#computed_host_listings_count, latitude and longitude are columns of type int64 but isnot logical to try to determine an outlier for this columns considering what it means

We can observate descriptive statistic from the variables before try to identify outliers

abnb_numerical.describe()   
price minimum_nights number_of_reviews reviews_per_month availability_365
count 48895.000000 48895.000000 48895.000000 48895.000000 48895.000000
mean 152.720687 7.029962 23.274466 1.090910 112.781327
std 240.154170 20.510550 44.550582 1.597283 131.622289
min 0.000000 1.000000 0.000000 0.000000 0.000000
25% 69.000000 1.000000 1.000000 0.040000 0.000000
50% 106.000000 3.000000 5.000000 0.370000 45.000000
75% 175.000000 5.000000 24.000000 1.580000 227.000000
max 10000.000000 1250.000000 629.000000 58.500000 365.000000

We can calculate an upper and lower bound to determinante outliers

def limits(serie):
  iqr = serie.quantile(q = 0.75) - serie.quantile(q = 0.25)
  return [serie.quantile(q = 0.75) + 1.5 * iqr, serie.quantile(q = 0.25) - 1.5 * iqr]

df_limites = pd.DataFrame()
for col in abnb_numerical.columns:
  df_limites[col] = limits(abnb_numerical[col])

df_limites['Limits'] = ['upper','lower']  
df_limites.set_index('Limits', inplace = True )
df_limites
price minimum_nights number_of_reviews reviews_per_month availability_365
Limits
upper 334.0 11.0 58.5 3.89 567.5
lower -90.0 -5.0 -33.5 -2.27 -340.5

let’s look at some histograms after applying the lmits constraints to the variables

import matplotlib.pyplot as plt

price = abnb_numerical[(abnb_numerical['price'] < 334.0)]['price']
minimum_nights = abnb_numerical[(abnb_numerical['minimum_nights'] < 11.0)]['minimum_nights']
number_of_reviews = abnb_numerical[(abnb_numerical['number_of_reviews'] < 58.5)]['number_of_reviews']
reviews_per_month	= abnb_numerical[(abnb_numerical['reviews_per_month'] < 4.765)]['reviews_per_month'] 

fig, ax = plt.subplots(2,2,figsize=(15,10))
sns.histplot(price,color = 'red', ax = ax[0,0])
sns.countplot(minimum_nights, color = 'red', ax = ax[0,1])
sns.histplot(number_of_reviews, color = 'red', ax = ax[1,0])
sns.histplot(reviews_per_month, color = 'red', ax = ax[1,1])

ax[0,0].set_title('Price',fontsize = 14)
ax[0,1].set_title('Minimum_Nights', fontsize = 14)
ax[1,0].set_title('Number_of_Reviews', fontsize = 14)
ax[1,1].set_title('Reviews_per_month', fontsize = 14)

fig.tight_layout()
plt.show()
c:\users\dfbb2\desktop\davidbautista_blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
../_images/airbnb_new_york_37_1.png

however, if we want to apply any modification to the data is better to try analyzing the columns in a multivariable way

from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
import numpy as np 

The Mahalanobis distance is the distance between two points in a multivariate space. It’s often used to find outliers in statistical analyses that involve several variables.

mahal_distances = []

for row in abnb_numerical.to_numpy():
  mahal_distances.append(mahalanobis(row, abnb_numerical.mean(), np.linalg.inv(np.cov(abnb_numerical.values.T))))

k = abnb_data.shape[0]

abnb_data['mahal_distances'] = mahal_distances 
abnb_data['p_value'] = 1 - chi2.cdf(abnb_data['mahal_distances'], k-1)
abnb_data = abnb_data[abnb_data['p_value'] >  0.05]

After applying the Mahalanobis distance to the data, the conclusion is that we cannot drop any rows. Perhaps the most interesting outliers were in the price column but we can discuss this later in relation to variables such location, room type and others.

Now… Let Analyze Duplicates#

There’s two kind of duplicates, the first one is explicit duplicates, this happens when are row exactly equals. So we can use drop_duplicates() from pandas to solve this.

print(f'Data shape before use drop_duplicates() : {abnb_data.shape}')
abnb_data.drop_duplicates(inplace = True)
print(f'Data Shape after drop_duplicates() : {abnb_data.shape}')
Data shape before use drop_duplicates() : (48895, 14)
Data Shape after drop_duplicates() : (48895, 14)

The second type of duplicates are implicit duplicates, This happens when the values ​​​​of the columns are written differently but they are the same data. Some of the columns that this is able to happen is price, minimum nights ans number of reviews however we set the kind data in int64 so we dont have any problem.

abnb_data.select_dtypes(include = 'int64').columns
Index(['price', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

Let’s change the point of view#

There is a problem with this specific data, the column host_listings_count indicated the number of rows of a single host_id in all the data, that means maybe there is the same airbnb hostel more than once.

All this becomes a problem because the main idea of ​​our data visualization is to understand information about the total number of airbnbs but without them being repeated.

It is understood that this type of information where a record of several uses of the same hostel is kept is useful on some occasions, however in this case it would not be desirable. Since the latitudes and longitudes are not exactly the same, there is no accurate way of knowing that it is the same hostel. so we have to think of a solution by separating the hostels by a minimum distance

So we are going to propose a solution by creating a function that removes all airbnbs hostels that are within a shorter distance than the indicated.after that we will develop a data analysis.

Let’s rename the data, just to make the code easier :)

data = abnb_data.drop(columns = ['mahal_distances','p_value','host_id'])
import geopy.distance

def droping_duplicates_abnbs(distance):
  nbhds = list(data[data['calculated_host_listings_count'] != 1]['neighbourhood'].value_counts().index)
  final_data, cont = pd.DataFrame(columns = data.columns), 0
  print(f'Charging... It could take a minutes')
  for nbh in nbhds:
    cont +=1
    idx_lst = []
    dinamic_data = data[data['neighbourhood'] == nbh]
    locs =  [[lat,long] for (lat,long) in zip(dinamic_data['latitude'],dinamic_data['longitude'])]
    for idx,i in zip(list(dinamic_data.index),locs):
      for idx_,j in zip(list(dinamic_data.index),locs):
        if i == j:
          None
        else:
          coords_1, coords_2 = i,j
          vector_dst = geopy.distance.vincenty(coords_1, coords_2).m
          if vector_dst < distance:
            if idx_lst.count(idx) > 0:
              None
            else:
              idx_lst.append(idx_)
    final_data = pd.concat([final_data,dinamic_data.drop(idx_lst)])    
  print(f'Now the data is ready, emjoy it :)')
  return pd.concat([final_data,abnb_data[abnb_data['calculated_host_listings_count'] == 1]])
# data = droping_duplicates_abnbs(100)
# data.reset_index(inplace = True)
# data.to_csv('abnb_data.csv')

Time For Data Visualization#

Next we are going to structure a data visualization divided into univariate, bivariate and multivariate visualization. However, before we can reapply the idea of the mahalanobis distance and univariate outliers to our data, since these are different because of the modification we just made.

data = pd.read_csv('https://raw.githubusercontent.com/BautistaDavid/Proyectos_ClaseML/main/data/abnb_data.csv')  # we can import the new data 
data.drop(columns = ['Unnamed: 0','index'], inplace = True)

# let analyze again the mahalanobis distances
data_numerical = data[['price',	'minimum_nights','number_of_reviews','reviews_per_month']]
mahal_distances = []
# data
for row in data_numerical.to_numpy():
  mahal_distances.append(mahalanobis(row, data_numerical.mean(), np.linalg.inv(np.cov(data_numerical.values.T))))

k = abnb_data.shape[0]

data['mahal_distances'] = mahal_distances 
data['p_value'] = 1 - chi2.cdf(data['mahal_distances'], k-1)
data = data[data['p_value'] >  0.05]  
# And also we can analyze the univariate outliers reusing limits()
df_limites_ = pd.DataFrame()
for col in data_numerical.columns:
  df_limites_[col] = limits(data_numerical[col])

df_limites_['Limits'] = ['upper','lower']  
df_limites_.set_index('Limits',inplace = True )
df_limites_

# The mahalanobis distance and univariate outliers info are ready
price minimum_nights number_of_reviews reviews_per_month
Limits
upper 332.5 8.5 61.0 3.44
lower -87.5 -3.5 -35.0 -2.00

The mahalanobis distance and univariate outliers info are ready. So we can continue.

Univariate visualization#

neighbourhood_group

We can plot a bar graph to identify the most frequent neighborhood groups for airbnbs in new york. But keep in mind that these neighborhood groups have different sizes, so we can create a metric that gives us information about the number of airbnbs per kilometer in the different neighborhood groups.

from matplotlib import rcParams
plt.figure(figsize = (8,5))   # JUst for confugure a figure size 
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})  # Seaborn Style
sns.set_context("notebook", font_scale=1.25)            # More Seaborn Style
rcParams['axes.titlepad'] = 20                          # set a space between  the title and the figure
plt.xticks(rotation=45)
neigh_plot = sns.countplot(x = 'neighbourhood_group' , data = data).set_title('Airbnbs per neighborhood group\n in New York',fontsize = 16) 
../_images/airbnb_new_york_62_0.png

Note that the most common airbnb districts are Brooklyn and Manhattan. Also note that the difference between the second and third set of neighbors is apparently large. This could be because Brooklyn and Manhattan have more tourist attractions than the others.

freqs = data['neighbourhood_group'].value_counts()
index_abnb_per_km = pd.DataFrame({'neig_group':freqs.index, 'freq_abnbs':freqs,'km_2':[59,183,283,109,151]})
index_abnb_per_km['abnbs/km_2'] = index_abnb_per_km['freq_abnbs'] / index_abnb_per_km['km_2']
index_abnb_per_km
neig_group freq_abnbs km_2 abnbs/km_2
Brooklyn Brooklyn 17298 59 293.186441
Manhattan Manhattan 17109 183 93.491803
Queens Queens 5638 283 19.922261
Bronx Bronx 1358 109 12.458716
Staten Island Staten Island 459 151 3.039735
plt.figure(figsize = (8,5))   # JUst for confugure a figure size 
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})  # Seaborn Style
sns.set_context("notebook", font_scale=1.25)            # More Seaborn Style
rcParams['axes.titlepad'] = 20                          # set a space between  the title and the figure
plt.xticks(rotation=45)
neigh_km_plot = sns.barplot(data = index_abnb_per_km, x = 'neig_group', y = 'abnbs/km_2').set_title(r'Airbnbs per $km^2$ in neighborhoods groups of New York',fontsize = 16) 
../_images/airbnb_new_york_65_0.png

Now the airbnbs per square kilometer indicator shows us that the frequency order of the districts is the same. However, it is interesting that brooklyn has a big difference with airbnbs despite being the smallest district.

neighbourhood

Now we can plot a heat map above new york map to identify the principles neighbourhoods that has more airbnbs hostels than the others. If you are in github and you can’t see the map, follow this link

import folium
from folium import plugins
import matplotlib.pyplot as plt

stationArr = data[['latitude', 'longitude']].to_numpy()

fol = folium.Map(location = [40.727, -74.097],zoom_start = 11 )
fol.add_children(plugins.HeatMap(stationArr, radius=14)) # ploting the heatmap

fol
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead.
  # Remove the CWD from sys.path while we load stuff.
Make this Notebook Trusted to load map: File -> Trust Notebook

The heat map allows us to verify that most of the frequencies of the airbnbs neighborhoods are in brooklyn and manhattan.

room tpye

Now we can see a countplot about room types categories in the data

plt.figure(figsize = (8,5))   # JUst for confugure a figure size 
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})  # Seaborn Style
sns.set_context("notebook", font_scale=1.25) 
           # More Seaborn Style
rcParams['axes.titlepad'] = 20                          # set a space between  the title and the figure
plt.xticks(rotation=45)
neigh_plot = sns.countplot(x = 'room_type' , data = data).set_title('Airbnbs room type in New York',fontsize = 16) 
../_images/airbnb_new_york_71_0.png

here we can see that the most frequent type of room are entire houses or apartments


Below you can find different distribution diagrams of the numerical variables of the data. The interesting thing is that we draw two graphs per variable, the first is the distribution applying the univariate limits and the second is the distribution of the total data. Also you can see some descriptive statistics of the two graphs

price

price_data_univariate = data[data['price'] < df_limites_.loc['upper','price']]
fig, ax = plt.subplots(1,2,figsize = (12,4))
plt.subplots_adjust(top=0.75) 
fig.suptitle('Price Violinplots with and without\n univariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 

with_ = sns.violinplot(data = price_data_univariate, x = 'price',ax = ax[0],palette="Set2").set_title('With')
without_ = sns.violinplot(data = data, x = 'price',ax = ax[1],palette="Set2").set_title('Without')
../_images/airbnb_new_york_75_0.png
inf = pd.concat([price_data_univariate['price'].describe(),data['price'].describe()], axis = 1)
inf.columns = ['Price with univariate limits','Price Total Data']
inf
Price with univariate limits Price Total Data
count 39449.000000 41862.000000
mean 120.987249 153.248173
std 66.684526 249.940511
min 0.000000 0.000000
25% 70.000000 70.000000
50% 100.000000 110.000000
75% 159.000000 175.000000
max 332.000000 10000.000000

minimum_nights

minimum_nights_data_univariate = data[data['minimum_nights'] < df_limites_.loc['upper','minimum_nights']]
fig, ax = plt.subplots(1,2,figsize = (12,5))
plt.subplots_adjust(top=0.7, wspace = 0.3) 
fig.suptitle('Minimum_nights Boxplots with and without \nunivariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 

with_ = sns.boxplot(data = minimum_nights_data_univariate, x = 'minimum_nights',ax = ax[0],palette="Set2").set_title('With')
without_ = sns.boxplot(data = data, x = 'minimum_nights',ax = ax[1],palette="Set2").set_title('Without')
../_images/airbnb_new_york_78_0.png
inf = pd.concat([minimum_nights_data_univariate['minimum_nights'].describe(),data['minimum_nights'].describe()], axis = 1)
inf.columns = ['minimum_nights with univariate limits','minimum_nights Total Data']
inf
minimum_nights with univariate limits minimum_nights Total Data
count 37472.000000 41862.000000
mean 2.689982 6.092112
std 1.684817 21.981921
min 1.000000 1.000000
25% 1.000000 1.000000
50% 2.000000 2.000000
75% 3.000000 4.000000
max 8.000000 1250.000000

number_of_reviews

number_of_reviews_data_univariate = data[data['number_of_reviews'] < df_limites_.loc['upper','number_of_reviews']]
fig, ax = plt.subplots(1,2,figsize = (12,5))
plt.subplots_adjust(top=0.75,wspace=0.3) 
fig.suptitle('number_of_reviews Violinplots with and without\n univariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 

with_ = sns.violinplot(data = number_of_reviews_data_univariate, x = 'number_of_reviews',ax = ax[0],palette="Set2").set_title('With')
without_ = sns.violinplot(data = data, x = 'number_of_reviews',ax = ax[1],palette="Set2").set_title('Without')
../_images/airbnb_new_york_81_0.png
inf = pd.concat([number_of_reviews_data_univariate['number_of_reviews'].describe(),data['number_of_reviews'].describe()], axis = 1)
inf.columns = ['number_of_reviews with univariate limits','number_of_reviews Total Data']
inf
number_of_reviews with univariate limits number_of_reviews Total Data
count 36732.000000 41862.000000
mean 10.062289 24.183293
std 13.960237 46.078980
min 0.000000 0.000000
25% 1.000000 1.000000
50% 4.000000 5.000000
75% 14.000000 25.000000
max 60.000000 629.000000

reviews_per_month

reviews_per_month_data_univariate = data[data['reviews_per_month'] < df_limites_.loc['upper','reviews_per_month']]
fig, ax = plt.subplots(1,2,figsize = (12,5))
plt.subplots_adjust(top=0.7, wspace = 0.2) 
fig.suptitle('reviews_per_month Violinplots with and without \nunivariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 

with_ = sns.violinplot(data = reviews_per_month_data_univariate, x = 'reviews_per_month',ax = ax[0],palette="Set2").set_title('With')
without_ = sns.violinplot(data = data, x = 'reviews_per_month',ax = ax[1],palette="Set2").set_title('Without')
../_images/airbnb_new_york_84_0.png
inf = pd.concat([reviews_per_month_data_univariate['reviews_per_month'].describe(),data['reviews_per_month'].describe()], axis = 1)
inf.columns = ['reviews_per_month with univariate limits','reviews_per_month Total Data']
inf
reviews_per_month with univariate limits reviews_per_month Total Data
count 38557.000000 41862.000000
mean 0.670479 1.010094
std 0.877857 1.509071
min 0.000000 0.000000
25% 0.030000 0.040000
50% 0.240000 0.320000
75% 1.000000 1.400000
max 3.430000 20.940000

availability_365

plt.figure(figsize = (8,5))   # JUst for confugure a figure size 
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})  # Seaborn Style
sns.set_context("notebook", font_scale=1.25) 
rcParams['axes.titlepad'] = 20   
# sns.displot(data['availability_365'])
aval = sns.histplot(data = data, x = 'availability_365',palette="Set2",kde = True).set_title('availability_365 Histogram')
../_images/airbnb_new_york_87_0.png

Bivariate visualization#

neighbourhood_group vs room_type

Below, we can see a count chart that includes the information on districts and room types. Note that in the districts with more airbnbs, they have the majority of full-service hostels. For the rest of the districts they have a majority of hostels of the private room type.

plt.figure(figsize = (8,5))   # Just for confugure a figure size 
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})  # Seaborn Style
sns.set_context("notebook", font_scale=1.25) 
           # More Seaborn Style
rcParams['axes.titlepad'] = 20                          # set a space between  the title and the figure
plt.xticks(rotation=45)
neigh_plot = sns.countplot(x = 'neighbourhood_group' , hue = 'room_type',data = data).set_title('Airbnbs per neighborhood group\n and room type in New York',fontsize = 16) 
../_images/airbnb_new_york_91_0.png

Below you can see distribution graphs as boxplots and Violinplots of the variables price, reviews_per_month and minimum_nights , being categorized by the variable neighbourhood_group. The above is done both in the sample with limits of outliers of the main variables and with the total data.

neighbourhood_group vs price

price_data_univariate = data[data['price'] < df_limites_.loc['upper','price']]
fig, ax = plt.subplots(1,2,figsize = (16,5))
plt.subplots_adjust(top=0.75,wspace = 0.3) 
ax[0].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)
ax[1].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)

fig.suptitle('Price Violinplots per neighbourhood_group with and without\n univariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 

with_ = sns.violinplot(data = price_data_univariate, x = 'neighbourhood_group', y = 'price' ,ax = ax[0]).set_title('With')
without_ = sns.violinplot(data = data, x = 'neighbourhood_group', y = 'price').set_title('Without')
../_images/airbnb_new_york_94_0.png

neighbourhood_group vs minimum_nights

minimum_nights_data_univariate = data[data['minimum_nights'] < df_limites_.loc['upper','minimum_nights']]
fig, ax = plt.subplots(1,2,figsize = (16,5))
plt.subplots_adjust(top=0.75,wspace = 0.3) 
ax[0].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)
ax[1].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)

fig.suptitle('minimum_nights Violinplots per neighbourhood_group with and without\n univariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 


with_ = sns.boxplot(data = minimum_nights_data_univariate, x = 'neighbourhood_group', y = 'minimum_nights' ,ax = ax[0]).set_title('With')
# ax[0].xticks(rotation=45)
without_ = sns.boxplot(data = data, x = 'neighbourhood_group', y = 'minimum_nights').set_title('Without')
../_images/airbnb_new_york_96_0.png

neighbourhood_group vs reviews_per_month

reviews_per_month_data_univariate = data[data['reviews_per_month'] < df_limites_.loc['upper','minimum_nights']]
fig, ax = plt.subplots(1,2,figsize = (16,5))
plt.subplots_adjust(top=0.75,wspace = 0.3) 
ax[0].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)
ax[1].set_xticklabels(list(data['neighbourhood_group'].value_counts().index), rotation = 30)

fig.suptitle('reviews_per_month Violinplots per neighbourhood_group with and without\n univariate outlier restrictions', fontsize=16)
sns.set_context("notebook", font_scale=1.25) 


with_ = sns.violinplot(data = reviews_per_month_data_univariate, x = 'neighbourhood_group', y = 'reviews_per_month' ,ax = ax[0]).set_title('With')
# ax[0].xticks(rotation=45)
without_ = sns.violinplot(data = data, x = 'neighbourhood_group', y = 'reviews_per_month').set_title('Without')
../_images/airbnb_new_york_98_0.png

Multivariate Analysis#

Next, we are going to plot some heat maps of the correlation levels of the variables using different methods. It can be expected that some variables have a high relationship, for example the number of reviews and reviews per month, it would also be interesting to observe the degree of correlation between the price and the number of reviews or the minimum number of nights

plt.figure(figsize = (15,7))
plt.title('Correlation of variables Pearson method (linear correlations)')
heat1 = sns.heatmap(data_numerical.corr(), annot=True,square = True, linewidths=1)
../_images/airbnb_new_york_100_0.png
plt.figure(figsize = (15,7))
plt.title('Correlation of variables spearman method (monotonous correlations)')
heat2 = sns.heatmap(data_numerical.corr(method = 'spearman'), annot=True,square = True, linewidths=1)
../_images/airbnb_new_york_101_0.png

Next we are going to generate subplots of the price comparison with two other variables at the same time to study the price distribution more specifically.

By the way, we are going to build a Python function to save repeated lines of code when creating the diagrams

def multi_boxplots(data, variable, title):
  plt.figure(figsize = (15,5))
  sns.set_context("notebook", font_scale=1.25) 
  sns.boxplot(data = data, y = variable ,x = 'neighbourhood_group',hue = 'room_type').set_title(title)
  plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
multi_boxplots(price_data_univariate, 'price', 'Price Boxplots per neighbourhood_group and room_type\n with univariate outlier restrictions' )
../_images/airbnb_new_york_104_0.png
multi_boxplots(data, 'price', 'Price Boxplots per neighbourhood_group and room_type\n without univariate outlier restrictions' )
../_images/airbnb_new_york_105_0.png
multi_boxplots(reviews_per_month_data_univariate, 'reviews_per_month', 'reviews_per_month Boxplots per neighbourhood_group and room_type\n with univariate outlier restrictions' )
../_images/airbnb_new_york_106_0.png
multi_boxplots(data, 'reviews_per_month', 'reviews_per_month Boxplots per neighbourhood_group and room_type\n without univariate outlier restrictions' )
../_images/airbnb_new_york_107_0.png
data.pivot_table(index = ['neighbourhood_group'],values = 'price', aggfunc= 'mean')
# mean of the prices by neighbourhood_group
price
neighbourhood_group
Bronx 89.509573
Brooklyn 129.074228
Manhattan 198.666959
Queens 106.293189
Staten Island 136.647059
import numpy as np
data.pivot_table(index = ['neighbourhood_group'],values = 'price',aggfunc=np.std)
# standart deviation of the prices by neighbourhood_group
price
neighbourhood_group
Bronx 100.414386
Brooklyn 187.441384
Manhattan 307.191291
Queens 215.105580
Staten Island 346.356282

Recommendations For Airbnb Data Generators#

It would be advisable to create a system that associates a single location from latitude to longitude to each airbnb, in order to perform an analysis from a different perspective in relation to the total number of hostels.

On the other hand, we can comment that under this analysis and visualization of data we could structure a deeper analysis in order to carry out some type of supervised learning model that allows us to predict the price of a hostel.

Some Interesting Conclusions#

As might be expected, the analysis carried out allows us to observe that Manhattan is the district where the median price tends to be higher. Likewise, reviewing the pivot tables, we can see that this district has the highest average price, however, it also has a standard deviation above the other districts.

A large part of the high prices in the Manhattan district may be due to the characteristics of tourism in the place, so it is possible that with the variables presented by the Airbnb records, it is possible to determine only an effect on the price related to the characteristics of the hostel. without taking into account other variables that could be interesting