About Me
Hello! My name is Dante Navaza, and I am a passionate computer science student at Pontifícia Universidade Católica do Rio de Janeiro (PUC Rio). I am fascinated by technology, programming, and machine learning, and I am excited about using these tools to create innovative solutions that can positively impact the world.
My Interests
I have a strong interest in machine learning, game development, and data analysis. I enjoy exploring different algorithms and frameworks to understand how they work and how they can be applied to solve real-world problems. I also have experience and interest in other areas of computer science, such as software development and web development.
My Skills
- Programming Languages : Proficient in Python and C#, with experiences in Java, C, and Javascript
- Game development: Proficient using the Unity Framework and 3d modeling tools such as Blender and ZBrush
- Data Analysis Tools : Proficient with Pandas, Numpy, graphing tools such as Matplotlib and Seaborn, Scikitlearn
- Software Development : Knowledge of software engineering principles and version control systems like Git
- Deployment Platforms : Experience with various deployment platforms and methods such as Streamlit and Render.
- Multiple languages: Fluent in English, Spanish, and Portuguese
My Experience
I have worked on several projects during my studies. Some include:
- Automation projects in which a algorithm creates a report and automatically sends it via email to a desired user
- Machine learning projects where i devloped a predictive model to predict the future sales of a company
- Web development in which i'm currently building my own portfolio and ecommerce websites
- Game development: I have developed 5 complete games in Unity and Construct 3d
My Goals
As I continue my studies, I aim to further develop my skills especially in machine learning and related fields, which is why I decided to develop this project. Ultimately, my goal is to contribute to projects that have a positive social or environmental impact, such as this one which aims to help airbnb renters and property owners.
Contact Me
If you'd like to learn more about my work or discuss potential collaborations/business inquiries, feel free to connect with me on LinkedIn or reach out via email at dantenavaza2005@gmail.com. Alternatively, you can also check out my other projects on my github portfolio!
Introduction
Wikipedia guide
The documentation for this project was created using mdBook (the documentation for mdBook can be found on its official GitHub page). The development process was divided into chapters, which you can navigate through either the table of contents or by using the left and right arrow keys on your keyboard.
The toolbar at the top of the page contains four different buttons:
- Toggle the table of contents
- Change the page's color theme
- Search for a specific word throughout the document (Shortcut: key 's')
- Print the entire book
Obs: The whole code for this wikipedia is also located inside the project's github repository.
Focus of the Project
This project focuses on the field of machine learning, specifically in the context of Airbnb. The objective was to create a model that can predict the daily rental price for Airbnb properties.
We explored various types of machine learning algorithms, considering supervised, unsupervised, and reinforcement learning. For this project, we chose to implement supervised learning with a focus on regression as it was most suitable for our dataset and objectives.
Machine Learning Considerations
During the development process, we took care to ensure the model was robust and generalizable. One of the main challenges in machine learning is overfitting, where a model becomes too tailored to the training data, resulting in poor performance on new, unseen data. To mitigate this risk, we employed several data treatment techniques that will be seen in the following chapters.
By incorporating these precautions, we aimed to create a model that not only performed well on the training set but also demonstrated strong generalization capabilities when applied to real-world data.
Context and Objective
Airbnb allows anyone with a spare room or property of any type (apartment, house, chalet, inn, etc.) to list their property for rent on a daily basis.
As a host, you create your profile and list your property. In this listing, hosts should provide a comprehensive description of the property to assist renters/travelers in choosing the best accommodation and to make their listing more appealing.
There are numerous customizations available in the listing, ranging from minimum stay requirements, pricing, number of rooms, to cancellation policies, extra guest fees, identity verification requirements for renters, etc.
Our Objective
To build a price prediction model that enables property owners to determine the appropriate daily rate for their property. Additionally, to assist renters in evaluating whether a listed property offers a competitive price compared to similar properties with similar characteristics.
Available Resources, Inspirations, and Credits
The datasets were sourced from Kaggle: https://www.kaggle.com/datasets/allanbruno/airbnb-rio-de-janeiro. Data spans from April 2018 to May 2020, with the exception of June 2018, which lacks data.
Given the 50MB per file space restriction in GitHub Repositories, the datasets utilized in this project are accessible for download via this link: https://drive.google.com/file/d/1_RtxDTXtF3CGvioFi1_ophzLNmyguEHl/view?usp=sharing. Alternatively, you may procure the datasets directly from Kaggle. However, it's noteworthy that discrepancies may arise in results if the datasets have been updated subsequent to the project's inception.
- File names are in brazilian portuguese
- The datasets contain property prices and their respective characteristics for each month.
- Prices are listed in Brazilian Real (R$).
Initial Expectations
- Seasonality is expected to be a significant factor, as months like December tend to have higher prices in Rio de Janeiro.
- Property location is likely to have a substantial impact on pricing, given that location can drastically alter the characteristics of a place (e.g., safety, natural beauty, proximity to tourist attractions).
- Additional amenities may have a significant impact, considering the prevalence of older buildings and houses in Rio de Janeiro.
We aim to explore the extent to which these factors influence pricing and identify any less intuitive yet crucial factors.
Installing libraries
Download the requirements.txt and in the command prompt run:
pip install requirements.txt
Importing libraries
#? for path management
import os
import pathlib
import joblib
#? for data manipulation
import pandas as pd
import numpy as np
#? for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import webbrowser
#? for machine learning
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split
Importing and consolidating the database
The dataset comprises 25 databases stored in separate tables. To facilitate data processing and facilitate machine analysis, a consolidated database termed "main_dataframe" was created by integrating all these databases. This integration process involved the utilization of the Python pandas library, wherein each database, stored as a CSV file, was imported individually and appended to an initially empty list named "df_list". Subsequently, the pandas function pd.concat() was employed to concatenate all individual dataframes into a singular cohesive dataset.
Furthermore, each dataframe within the main_dataframe contains a column representing dates. However, the dates are inconsistently formatted strings. To address this inconsistency, an extraction process was implemented to isolate the month and year components from the dataframe names. These extracted components were then utilized to create corresponding columns within the main_dataframe, thus ensuring consistency and facilitating further analytical operations.
dataset_folder = pathlib.Path('dataset')
months = {'jan': 1, 'fev': 2, 'mar': 3, 'abr': 4, 'mai': 5, 'jun': 6, 'jul': 7, 'ago': 8, 'set': 9, 'out': 10, 'nov': 11, 'dez': 12}
df_list = []
#* concatenating all the databases into one while adding the month and year columns
for file in dataset_folder.iterdir() :
month = months[file.name[:3]]
year = int(file.name[-8:].replace('.csv', ''))
df = pd.read_csv(file)
df['month'] = month
df['year'] = year
df_list.append(df)
main_dataframe = pd.concat(df_list)
Steps underwent in treating the data
This data science project follows an eight-step process:
- Problem Definition : Clearly articulate the challenge or business question to address with data-driven insights.
- Domain Understanding : Acquire in-depth knowledge of the business area or industry context, focusing on key stakeholders and processes.
- Data Acquisition : Extract the relevant dataset from available sources, ensuring it meets the project requirements.
- Data Cleaning and Preprocessing : Perform data transformation, handling missing values, and correcting inconsistencies to prepare for analysis.
- Exploratory Data Analysis (EDA) : Conduct comprehensive EDA to understand data distributions, identify patterns, and detect anomalies.
- Model Development : Construct and train machine learning models using appropriate algorithms and techniques to meet the project objectives.
- Result Interpretation : Analyze the model outputs, validate the results, and extract actionable insights.
- Deployment and Production : Implement the final model in a production environment, ensuring scalability, reliability, and monitoring capabilities.
Reducing excessive amount of columns
As there are a lot of columns, we are going to reduce the amount of columns to enhance the algorithm efficiency.
Furthermore, a rapid analysis of the data shows a significant number of redundant columns in the data for the prediction model, leading to the removal of the following columns:
- Any collumn with unecessary data types that wont influence the final price such as images, verification methods, etc.
- ID: these quantitative values are unnecessary and could interfere with the final results as they would also be calculated.
- Repeated columns: the data contains many columns that are repeated or very similar to each other, such as the date, state, and country.
- Any collumn that contains hyperlinks or free-form text: besides of not containing relevant data for the desired result, it could interfere with the prediction model.
- If over 30% of the data is missing we will remove that collumn
After this analisis we were left with the following collumns:
filtered_columns = ['host_response_time','host_response_rate','host_is_superhost','host_listings_count','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','price','security_deposit','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value','instant_bookable','is_business_travel_ready','cancellation_policy','year','month']
main_dataframe = main_dataframe.loc[:, filtered_columns]
A excel file with the first 900 rows was generated in order to do a quick analysis of the data.
main_dataframe.head(900).to_csv('first_900_rows.csv', sep=';', index=False)
Treating missing values
Utilizing the command:
print(base_airbnb.isnull().sum())
We can see the amount of null values in each column:
If over 30% of the data is missing we will remove that collumn to optimize the model . Upon visualizing the dataset, it became apparent that certain columns contain a substantial amount of null values. Columns with over 30% of missing data will be entirely discarded, while those with fewer null values will undergo null value removal to ensure data integrity.
row_30_percent = main_dataframe.shape[0] * 0.3
for collumn in main_dataframe :
if main_dataframe[collumn].isnull().sum() >= row_30_percent :
main_dataframe = main_dataframe.drop(collumn, axis=1)
After removing said columns we were left with these:
print(base_airbnb.isnull().sum())
As there aren't a significant amount of null values left, we will remove them.
main_dataframe = main_dataframe.dropna()
In the end, all columns left won't have any null values, effectively optimizing our dataset for further analysis, thereby maximizing the accuracy and reliability of our results.
Verifying the data types of each collumn
Following inspection the data types of each column, it is evident that the majority conform to their intended data types. However, the 'price' and 'extra people' columns are incorrectly represented as objects (strings) rather than integers as expected, necessitating a conversion to their appropriate data type.
In order to observe the data types we utilized the following command:
print(base_dataframe.dtypes)
print('-'*60)
print(base_dataframe.iloc[0])
Additionally, all data types of float64 and int64 will be converted to their 32-bit variants to optimize memory usage.
for collumn in main_dataframe :
if main_dataframe[collumn].dtype == 'float64' :
main_dataframe[collumn] = main_dataframe[collumn].astype(np.float32)
elif main_dataframe[collumn].dtype == 'int64' :
main_dataframe[collumn] = main_dataframe[collumn].astype(np.int32)
for collumn in ['price', 'extra_people'] :
main_dataframe[collumn] = main_dataframe[collumn].str.replace(',', '').str.replace('$', '').astype(np.float32, copy=False) #used np.float32 to reduce memory usage
After the conversions the 'price' and 'extra_people' columns were effectively converted to float32 alongside with all the other columns that were int64 or int32, who where converted to their 32 bit variants
Exploratory analysis and treatment of outliers for numerical values
-
We will essentially examine each feature to:
- Conduct a correlation analysis among the features to ascertain their interrelationships and determine whether to retain all features. Features exhibiting strong correlations to the extent that they provide redundant information to the model will be removed.
- Eliminate outliers (using a rule where values below Q1 - 1.5 * Interquartile Range and above Q3 + 1.5 Interquartile Range will be excluded). Interquartile Range (IQR) = Q3 - Q1.
- Verify if all features are relevant for our model or if any of them will not contribute and should be removed.
-
We will begin by creating graphs to facilitate parts of our analysis
-
Then, we will analyize the columns of price (the ultimate target variable) and extra_people (also a monetary value). These are continuous numerical values.
-
Next, we will analyze the columns of discrete numerical values (accommodates, bedrooms, guests_included, etc.).
-
Finally, we will evaluate the text columns and determine which categories make sense to retain or discard.
NOTE: If x axis of the graphs is slightly cropped off the screen due to the size of the graph, go to the configure subplots button (second to last button located on the bottom left of the screen) and adjust the 'bottom' slider until the x axis appears on your desired position.
Analyzing heatmap
In order to perform a correlation analysis among the features we will create a heatmap of the correlation coefficient of each feature:
#* Making a heatmap from the correlation coefficient
plt.figure(figsize=(15,10))
plt.subplots_adjust(bottom=0.264)
sns.heatmap(main_dataframe.corr(numeric_only=True), annot=True)
plt.show()
None of the correlation coefficients observed among the features reached a strength indicative of redundancy for the prediction model (excluding the coefficient of 1 present in the comparison of same features).
Calculating limits
We will create a function called 'calculate_limits' that will recieve a collumn from the main_dataframe as a parameter and it will return to us the outliers below the first quartile and those above the third quartile as mentioned in the second step of how to locate outliers
def calculate_limits(collumn) :
q1 = collumn.quantile(0.25)
q3 = collumn.quantile(0.75)
iqr = q3 - q1
return q1 - 1.5 * iqr, q3 + 1.5 * iqr
# print(calculate_limits(main_dataframe['price']))
Creating graphs
Now we will create four functions, each for visualizing a different type of graph and test them with the 'guests_included' column (except with the bar graph for string data):
-
Bar Graph for string data
def bar_graph_string(main_dataframe_collumn_name) : plt.figure(figsize=(15, 5)) plt.subplots_adjust(bottom=0.3) bar_graph = sns.countplot(x=main_dataframe_collumn_name, data=main_dataframe) bar_graph.tick_params(axis='x', rotation=90) plt.show() bar_graph_string('bed_type')
Now, the graphs below will only work properly for NUMERICAL data, as they utilize limits determined by quartiles. Also, notice how it will be necessary to pass the whole column as a parameter (ex: main_dataframe['guests_included']) instead of only passing the string name of the column as done in the bar graph for strings that recieved just the name string 'bed_type'.
-
Box plot
def box_plot(main_dataframe_collumn) :
plt.figure(figsize=(15,5))
plt.subplots_adjust(bottom=0.3)
sns.boxplot(x = main_dataframe_collumn)
plt.show()
box_plot(main_dataframe['guests_included'])
- Histogram
def histogram(main_dataframe_collumn) :
plt.figure(figsize=(15, 5))
plt.subplots_adjust(bottom=0.3)
sns.histplot(x = main_dataframe_collumn, kde=True)
plt.show()
histogram(main_dataframe['guests_included'])
- Bar graph
def bar_graph(main_dataframe_collumn) :
plt.figure(figsize=(15, 5))
plt.subplots_adjust(bottom=0.3)
ax = sns.barplot(x = main_dataframe_collumn.value_counts().index, y = main_dataframe_collumn.value_counts())
ax.set_xlim(calculate_limits(main_dataframe_collumn))
plt.show()
bar_graph(main_dataframe['guests_included'])
Removing unecessary columns
After analyzing the graphs we decided to remove the following columns from the main_database:
- 'guests_included'
- 'number_of_reviews'
- 'maximum_nights'
Removing the 'guests_included' column
The 'guests_included' column will be disregarded from the model's analysis due to its significant skew towards a single value, specifically 1, indicating a maximum limit of one guest per residency.
Removing this column from the analysis as a whole is essential as excluding only the outliers could significantly influence the final result.
Moreover, this concentration likely stems from data entry errors, as typical housing accommodations should accommodate more than one person.
main_dataframe = main_dataframe.drop('guests_included', axis = 1)
Removing the 'number_of_reviews' column
The 'number_of_reviews' column will be disregarded from the model's analysis as its goal is to analyze the price for normal users and landlords, which in most cases won't have reviews or a large number of them. Also simpler models are faster and less prone to overfitting.
main_dataframe = main_dataframe.drop('number_of_reviews', axis = 1)
Removing the 'maximum_nights' column
The 'maximum_nights' feature will be excluded from the model's analysis due to its negligible contribution to price variance. Furthermore, the data exhibits apparent randomness with disparate values across entries, suggesting rapid insertion by users, thus warranting its omission from the analytical framework.
print(main_dataframe['maximum_nights'].value_counts())
main_dataframe = main_dataframe.drop('maximum_nights', axis = 1)
Outlier Function
Now we will create the 'exclude_outliers' function, that will recieve the main dataframe and the collumn that it will remove the outliers from. Usally, outliers above the upper limit represent luxury properties, which are not considered for the prediction model who's objective is to analyze common apparments, and thus should be removed.
def exclude_outliers(main_df, collumn) :
amount_lines = main_df.shape[0]
lower_limit, upper_limit = calculate_limits(main_df[collumn])
main_df = main_df.loc[(main_df[collumn] >= lower_limit) & (main_df[collumn] <= upper_limit), :]
return main_df, amount_lines - main_df.shape[0]
In this code the use of the .loc function with the condition [(main_df[column] >= lower_limit) & (main_df[column] <= upper_limit)]
inside constructs a boolean array that selects rows where the values in the specified column (column
) are greater than or equal to lower_limit
AND less than or equal to upper_limit
.
So, essentially, this line filters the DataFrame main_df
to include only the rows where the values in the specified column (column
) fall within the range defined by lower_limit
and upper_limit
.
Removing outliers from the 'price' and 'extra_people' columns
The first outliers to be removed will be from the 'price' and 'extra_people' collumns as they have the highest relevance in calculating the final price and they are continues numerical values (can be measured).
for collumn in ['price', 'extra_people'] :
main_dataframe, amount_removed_lines = exclude_outliers(main_dataframe, collumn)
print(f"{amount_removed_lines} lines were removed from {collumn}")
###histogram(main_dataframe[collumn])
###box_plot(main_dataframe[collumn])
Nearly 10% of the lines were removed from the 'price' column
Obs: when the price collumn is a integer, the quantity of apparments increases because landlords usally put their price as a whole value.
Removing outliers of discrete numerical columns
Now we will undergo the same procedure removing the outliers of the discrete numerical columns:
for collumn in ['host_listings_count','accommodates', 'bathrooms', 'beds','bedrooms', 'minimum_nights'] :
main_dataframe, amount_removed_lines = exclude_outliers(main_dataframe, collumn)
print(f"{amount_removed_lines} lines were removed from {collumn}")
###box_plot(main_dataframe[collumn])
###bar_graph(main_dataframe[collumn])
Treating non numerical values
We will analyze the columns without true or false and list values:
- 'property_type'
- 'bed_type'
- 'room_type'
- 'cancellation_policy'
Group_categories function
First, we will create a function to group specific categories of a given collumn in case said collumn contains a significant amount of categories with small values (value determined by the 'amount' parameter)
def group_categories(collumn, grouped_category_name, amount) :
series_category = main_dataframe[collumn].value_counts()
for category_type in series_category.index :
if series_category[category_type] < amount :
main_dataframe.loc[main_dataframe[collumn] == category_type, collumn] = grouped_category_name
The main filter occurs in the for loop that iterates throught each different category inside a column and if the amount of values of said categories is smaller than the specified amount passed on the parameters, they will all be grouped into a new category, 'grouped_category_name' whose name is also passed as a parameter of the function.
Group 'property_type'
Starting with the 'property_type' collumn, we will group all the entires will less than 2000 entries into the 'Other' category since they had a significant amount of categories and their low numbers would make the model more complex and less efficient.
Before grouping
bar_graph_string('property_type')
After grouping all categories with less than 2000 entries into a single 'Other' category:
group_categories('property_type', 'Other', 2000)
bar_graph_string('property_type')
Group 'bed_type'
Next, the 'bed_type' collumn only has 5 categories, however only the 'real_bed' category has significant values, while the others have small, broken values. Therefore, we will group all these entries into a 'Other beds' category. (The amount 10,000 was chosen simply to choose the remaining categories)
Before grouping:
bar_graph_string('bed_type')
After grouping all categories with less than 10,000 values:
group_categories('bed_type', 'Other beds', 10000)
bar_graph_string('bed_type')
Group 'cancellation_policty'
Following up, the 'cancellation_policy' collumn has 3 categories (strict, super_strict_60, super_strict_30) that contain small values in comparison to the rest. Therefore we will group all these entries into the 'strict' category.
Before:
After grouping said categories into the 'strict' category
Group room_type
Finally, the 'room_type' collumn only has 4 categories whithout any major value discrepancies in the data (only two categories have significant smaller values), so no change/grouping of categories will be needed.
bar_graph_string('room_type')
Treating 'amenities' collumn
Now we will treat the 'amenities' collumn. Since analyzing each amenity would require excessive complexity and computing power, we will instead analyze the length of each amenity array (the higher the length the higher the final price will be)
The code below creates a new collumn on the main dataframe consisting of the length of each amenity array and removes the original collumn containing the lists.
main_dataframe["Amount amenities"] = main_dataframe['amenities'].str.split(',').apply(len)
main_dataframe = main_dataframe.drop('amenities', axis = 1)
Now we will remove the outliers:
#* removing outliers
main_dataframe, amount_removed_lines = exclude_outliers(main_dataframe, 'Amount amenities')
print(f"{amount_removed_lines} lines were removed from Amount amenities")
Reducing visualized data
To facilitate rigorous comparisons and statistical analyses leveraging longitude and latitude data sourced from the dataframe, we will employ plotly.express to construct a Mapbox visualization. This visualization will accurately depict each property at its respective geographical coordinates, alongside with its corresponding price point.
However, in order to avoid crashes and slowdowns, only the first 70,000 samples will be visualized.
data = main_dataframe.sample(71000)
center = {'lat':data.latitude.mean(), 'lon':data.longitude.mean()}
map_graph = px.density_mapbox(data, lat='latitude', lon='longitude',z='price', radius=2.5, center=center, zoom=10, mapbox_style='open-street-map')
Loading the map into the browser
In order to load all of the data into the browser, we will save the map as an html file, then open it in the browser.
with open('map.html', 'w', encoding = 'utf-8') as f :
f.write(map_graph.to_html())
webbrowser.open(os.path.realpath('map.html'))
Encoding explanation
We need to adjust some non-numerical columns to facilitate the machine learning model's analysis as it can only analyze numbers.
There are two types of data inside these columns: Categories and True or False
Booleans will become 0 (false) and 1 (true)
Categories will be encoded using dummy encoding (creates columns for each category and applies binary values based if they are in the category or not). This process can be visualized in the image below:
Encoding booleans
First, we will create a copy of the dataframe to not accidentaly alter the original:
main_dataframe_coded = main_dataframe.copy()
The collumns 'host_is_superhost', 'instant_bookable', and 'is_business_travel_ready' already contain only 'f' and 't' values, so we will utilize the .map function to substitute those values with zeros and ones respectively.
#* Encoding booleans
for collumn in ['host_is_superhost', 'instant_bookable', 'is_business_travel_ready'] :
main_dataframe_coded[collumn] = main_dataframe[collumn].map({'f':0, 't':1})
Result before and after:
Encoding text columns (dummy encoding)
After having encoded the boolean columns, all that is left is to convert the text column via dummy encoding.
#* Encoding text columns with dummy encoding
main_dataframe_coded = pd.get_dummies(data = main_dataframe_coded, columns = ['property_type', 'room_type', 'bed_type', 'cancellation_policy'])
7 steps to build a prediction model
Firstly, it is necessary to comprehend the details of machine learning and its predictive modeling mechanics. As delineated in the introduction, our selection entails the utilization of a supervised learning model. Yet, within this domain resides a myriad of machine learning models, each distinguished by diverse techniques and methodologies for varied task execution, such as regression and classification.
Regression models are used in tasks that require a large numerical analysis that requires the prediction of a continuous numerical value, for example, the price of a household. Each model utilizes a unique regression technique such as the Linear Regression that works by fitting a straight line to a set of data points in such a way that the errors between the observed and predicted values are minimized.
It utilizes the equation *Y = a * x1 + b x 2 + c * x3 + d * x4 ... + z where
- Y is the predicted price of the house.
- x1,x2,x3… are the independent variables (number of rooms, number of bathrooms, size of the lot, etc.).
- a,b,c,d.. are the coefficients representing the impact of each independent variable on the predicted price.
- z is the intercept term.
During the training process, the model learns the values of the coefficients (a,b,c,d...) that minimize the errors between the predicted and actual prices in the training data. These coefficients capture the relationships between the independent variables and the dependent variable Y, allowing you to analyze the impact of various factors on the price of the house.
This linear regression model is one of the three models that were analyzed, with the other two being the Random Forest Regressor and Extra Trees (these three especifically were selected due to their wide use and effectivness in most cases), which will be explained in the third section (model selection).
Then there are other models which are revolved around solving classification problems that require a prediction of a discrete category or class label. A example of this would be a model that predicts if a email is SPAM or not. It uses algorithms to categorize input data into one or more discrete classes or categories.
The construction of a prediction model is sperated into seven steps that we will follow:
- Define if it is classification or regression problem
- Choose the metrics to evaluate the model
- Choose which models we are going to use
- Train the models and test
- Compare the results of the models and choose the best one
- Analyse the best model
- Adjust and improve the best model
Defining if it is classification or regression problem
Taking into consideration the definitions explained in the previous section:
- Regression models predict a continuous numerical value.
- Classification models predict a discrete category or class label.
Given that our objective is to predict the final price of a property according to its attributes such as localization, number of amenities, etc, we can conclude that our goal is to solve a regression problem.
Choosing the metrics to evaluate the model
We will utilize two statistical metrics to evaluate the accuracy of the model:
- R² - ranges from 0 to 1, measuring how much of the variation of data the model can explain
- Root Mean Square Error (RSME) - says how much the model is off
Choosing which models we are going to use
As mentioned previously, we will use and analyze three different models:
-
Linear Regression - traces a line that minimizes the erros, values closer to the line are better, not efficient with weak/no correlations
-
Random Forest Regressor - Decision trees - doing questions separating the data into different groups, random forest regressor uses multiple decision trees with random smaller parts of the data and calculates the mean to reach the final result
-
Extra Trees - Same as random forest regressor, however the random forest chooses the best question (that will filter the most data) while extra trees asks a random question (which could work best depending on the question)
Example: The Random Forest and Extra trees models utilize decision trees that are widely used in certain cenarios such as in the guessing game Akinator (the picture below displays the machine learning model's decision tree that branches after each question made).
Training and testing the models
In order to train and test our models we will have to randomly separate our data into two sets: training and testing data
- Training data will be used for the model to know what values to analyze (x variable in the equation will be the properties' features) and what it wants to calculate (y variable in the equation will be the price of the property)
- Testing data will be used for the model to analyze new data and check its accuracy after it is fully trained
Obs: To address overfitting, an 80-20 data split strategy is employed, where 80% of the dataset is allocated for training and 20% for testing. This approach ensures rigorous evaluation of the model's generalization ability by assessing its performance on unseen data. By separating datasets for training and testing, the model's tendency to overly adapt to training data nuances is mitigated, thereby enhancing its efficacy in handling new data. The image below demonstrates the three types of fitting present in the training of machine learning models:
Comparing the results of the models and choosing the best one
Taking into consideration the evaluation metrics mentioned in the second section we will:
- First choose 1 main metric, such as R², in which the model with the biggest R² will be considered the best model
- RSME will be used as a tiebreaker when models have very similar R²
- Time and complexity will also be taken into consideration (less time and less information needed is prefered)
Analyzing the best model
After having chosen the best model we will analyze the importance of each of the properties' features the model used to analyze and calculate the final price.
- If its not relevant, we can remove it to observe changes in the result
- We will perform changes with the goal of improving the R²/RSME, speed, and simplicity of the model
Adjusting and improving the best model
The iterative process of optimizing a machine learning model involves continuous refinements aimed at achieving peak performance in terms of efficiency and accuracy. This approach entails methodically fine-tuning hyperparameters exploring a wide range of parameter combinations.
Further adjustments involve refining the training process to achieve superior performance. The ultimate goal is to reach a point where additional iterations yield diminishing returns, signaling that the model has achieved its optimal balance between complexity and predictive power.
'analyze_model' function
Now we will implement all the steps mentioned previously on our code.
First we will create a function to evaluate each model. The parameter 'prediction' is the prediction made by the model and 'y_test' is the true value of the price used to compare the results.
def analyze_model(prediction, y_test) :
r2 = r2_score(y_test, prediction)
rsme = root_mean_squared_error(y_test, prediction)
print(f"The model has an R² of {r2} and an RSME of {rsme}\n")
Splitting the data
Now, we will randomly split the 80% of the data into training data and 20% into testing data
- y variables will be the 'price' collumn
- x variables will be the features of each property
#? setting up the y and x variables
y, x = (main_dataframe_coded['price'], main_dataframe_coded.drop('price', axis=1))
#? Splitting the data into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=20)
Testing the three models
Now we will train and test each of the three models:
- Linear Regression
- Random Forest Regressor
- Extra Trees
#? Now we will test the three models mentioned previously
linear_model, random_forest_model, extra_trees_model = (LinearRegression(), RandomForestRegressor(), ExtraTreesRegressor())
models_list = ["Linear Regression", "Random Forest Regressor", "Extra Trees"]
for i, model in enumerate([linear_model, random_forest_model , extra_trees_model]):
model.fit(x_train, y_train)
prediction = model.predict(x_test)
print(models_list[i])
analyze_model(prediction, y_test)
The results of each model are shown below:
Choosing the best model
Here are the results once more:
Following the evaluation of all model outputs, the Extra Trees model was selected as the most effective among the tested algorithms.
Although the linear regression model exhibited the fastest computation time, it recorded a notably low R² value of 32%, indicating a weak fit to the data.
Additionally, the Extra Trees and Random Forest models demonstrated similar R² values, suggesting comparable levels of predictive accuracy. However, the Random Forest model incurred a higher Root Mean Square Error (RMSE) and required substantially more computational time compared to the Extra Trees model, further supporting the selection of the latter for optimal performance.
Analyzing the best model
In order to analyze how the model works, we need to observe the importance of each feature the model uses when calculating the final price.
feature_importance_dict = dict(zip(x_train.columns, extra_trees_model.feature_importances_ ))
sorted_feature_importance_dict = sorted(feature_importance_dict.items(), key=lambda x:x[1])
feature_importance_dict = dict(sorted_feature_importance_dict)
print(feature_importance_dict)
A dictionary in ascending order with the importance (in percentage) of each feature is printed as a result:
After observing the influence of each feature, we can observe the relevance of the localization as the longitude and latitude account for 20% of the data used in the calculations for the price. Also, the amount of ammenities bedrooms are another crucial factors in the price definition as the houses become more attractive to customers, therefore increasing its prices.
However, other features such as 'is_buisness_travel_ready' have no relevance when calculating the final price, therefore we will experiment removing them to test if the model becomes more efficient.
Adjusting an improving the best model
Due to the large amount of columns present in the model's analysis, we will remove all of the features that have an importance of less than 0.007%. This will remove a large amount of redundant data in the model's analysis, making it considerably more efficient.
Also, after the removal, there will only be one 'room_type' column (the 'room_type_Entire home/apt'). So, in order to add more flexibility for the user when choosing a room type, we will keep the 'room_type_Private room' column which had the second highest importance.
room_type_private_room = main_dataframe_coded['room_type_Private room']
for column in feature_importance_dict :
if feature_importance_dict[column] < 0.007 :
print(f"Removed {column}")
main_dataframe_coded = main_dataframe_coded.drop(column, axis=1)
main_dataframe_coded['room_type_Private room'] = room_type_private_room
#? Separating the new data
y, x = (main_dataframe_coded['price'], main_dataframe_coded.drop('price', axis=1))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=20)
We will now create a new final Extra Trees model with the new data and observe its performance.
final_model = ExtraTreesRegressor()
final_model.fit(x_train, y_train)
prediction = final_model.predict(x_test)
analyze_model(prediction, y_test)
#? creating a new dictionary for the updated features
feature_importance_dict_final = dict(zip(x_train.columns, final_model.feature_importances_))
sorted_feature_importance_dict = sorted(feature_importance_dict_final.items(), key=lambda x:x[1])
feature_importance_dict_final = dict(sorted_feature_importance_dict)
print(feature_importance_dict_final)
The results are shown below:
After removing the 'is_buisness_ready' collumn the R² increased slightly while the RSME decreased, signifying a improvment in the model's accuracy.
After removing the 'property_type' and 'bed_type' collumns the models accuracy slightly decreased, however its simplicity and efficiency had a significant improvement.
Lastly, the model's accuracy barely changed after removing all collumns with less than 0.007% of importance, however its efficiency and simplicity has improved significantly (from taking approximately 5 minutes to test down to less than 2 minutes)
Applying the final changes
Finally, after implementing all necessary feature modifications, proceed with training the model to evaluate its performance in terms of efficiency and accuracy. If you wish to conduct comparative analysis with an alternative model to assess whether it outperforms the Extra Trees algorithm, you may consider using a Random Forest Regressor for this purpose, but feel free to choose whichever model you prefer!
After all the changes are done, we can move on to the final deployment fase.
Deployment forms
There are various ways to deploy a machine learning project such as:
- Host it on a website utilizing django/flask
- Transform it into a app with Tkinter
- Convert the project into a .exe file
- Deploying it with Streamlit
In this project, we're opting to deploy the machine learning model through a Streamlit-based web application, which will be accessible via a standalone executable (.exe) file. This deployment method was selected because the model's size is substantial, leading to hosting constraints on many platforms. Given the file size limitations imposed by various hosting providers, deploying through an executable Streamlit application offers a practical solution for accommodating large-scale machine learning models.
Exporting model as joblib
First we will export the model itself using the joblib library to save it inside a .joblib file. We will add the "price" collumn as we had previously removed it during the training process. The joblib file requires the x values as well as the y values.
x["Price"] = y
x.to_csv(r"deploy\final_data.csv")
joblib.dump(final_model, "final_model.joblib", compress = 3)
Obs: We added a compress level of 3 inside the joblib.dump parameters in order to reduce the size of the file from approximadetly 2 GB down to 400 MB, however, the model gets slower each time the compress level increases.
airbnb_deploy file
Create a folder called deploy and inside it a python file called airbnb_deploy.py. This file will be responsible for deploying the project.
Importing streamlit and setting dictionaries
Importing all necessary libraries:
import pandas as pd
import streamlit as st
import joblib
import sys
from streamlit.web import cli as stcli
import os
Setting up the directories and making sure all requirements are installed:
script_directory = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_directory)
sys.argv = ["pip", "install", "-r", "requirements.txt"]
Now, we will write the code below in order to automatically run the streamlit website locally everytime the file is executed:
try :
if __name__ == '__main__':
sys.argv = ["streamlit", "run", "airbnb_deploy.py"]
sys.exit(stcli.main())
except RuntimeError as e:
if str(e) == "Runtime instance already exists!":
pass
Obs: This code should always be kept on the bottom of the file as it will instantly open the website with only the features of the code above it!
Now, in order to make the buttons for each feature used in the prediction model's analysis, we will first create three dictionaries for each diferent data type (numerical, boolean and lists).
Then, we will create a button for each dictionary and update the values of each column depending on the input.
Creating the dictionaries:
x_numerical = {'latitude': 0, 'longitude': 0, 'accommodates': 0, 'bathrooms': 0, 'bedrooms': 0, 'beds': 0, 'extra_people': 0, 'minimum_nights': 0, 'year': 0, 'Amount amenities': 0, 'host_listings_count': 0}
x_boolean = {'host_is_superhost': 0, 'instant_bookable': 0}
x_lists = {'property_type': ['Apartment', 'House'], 'room_type': ['Entire home/apt', 'room_type_Private room'], 'cancellation_policy': ['flexible', 'moderate', 'strict_14_with_grace_period']}
Now we will create another dictionary containing only the lists created by the dummy variables. This is so we can store the values the user enters.
list_values = {'property_type_Apartment' : 0, 'property_type_House' : 0, 'room_type_Entire home/apt' : 0, 'room_type_Private room' : 0, 'cancellation_policy_flexible' : 0, 'cancellation_policy_moderate' : 0, 'cancellation_policy_strict_14_with_grace_period' : 0}
Setting up page and config
Setting up the page title and icon:
st.set_page_config(page_title="Airbnb Deployment", page_icon=":shark:")
st.title("Airbnb Machine Learning Model Deployment")
We will add a link to a google drive containing joblib file of the prediction model for download in case the user doesn't have it already installed on their computer. The following link was used: https://drive.google.com/file/d/1VMhrCh5l2neipciZF15Y1lBDS02lgeN5/view?usp=sharing
st.write("If you don't have the file of the model, download it [here](https://drive.google.com/file/d/1VMhrCh5l2neipciZF15Y1lBDS02lgeN5/view?usp=sharing)")
Obs: On streamlit, use [text] (link) to add a specific link to a text.
Next, we will add a upload box for the user to upload the joblib file:
model_file = st.file_uploader("Upload your model file", accept_multiple_files=False)
However, the default upload limit in streamlit is 200 MB and our file is over 400 MB. In order to fix this, we will go to the .streamlit folder and inside the config.toml file write the following code:
[server]
maxUploadSize = 600
maxMessageSize = 600
Obs: If said files are not present on the folder of your main project, create them with the same exact names.
The streamlit page looks like this now:
Creating the buttons
Now we will create a button for each dictionary and updating the values after the user's input.
Some buttons such as the latitude and logitude, will have float values and others will have integer, we will adjust their values as needed.
for item in x_numerical :
if item == 'latitude' or item == 'longitude' :
value = st.number_input(f'{item}', step=0.000001, value=0.0, format="%.6f")
elif item == 'extra_people' :
value = st.number_input(f'{item}', step=0.01, value = 0.0) #? default decimal places for floats are already two, so no format is needed
else :
value = st.number_input(f'{item}', step = 1, value = 0)
x_numerical[item] = value
for item in x_boolean :
value = st.selectbox(f'{item}', ('Yes', 'No'))
if value == 'Yes' :
x_boolean[item] = 1
else :
x_boolean[item] = 0
We will iterate over each element of the list_value dictionary, as it is the one that stores the values the user enters
for item in x_lists :
value = st.selectbox(f'{item}', x_lists[item])
list_values[f'{item}_{value}'] = 1
The streamlit website looks contains the buttons now:
Creating the preview value button
Lastly, we will create a button for the user to see the predicted value:
preview_button = st.button("View the predicted value")
Once the user clicks the button, we will join the list_values, x_numerical and x_boolean dictionaries updated by the user's input and will create a new Dataframe out of it for the machine learning model to use.
if preview_button :
list_values.update(x_numerical)
list_values.update(x_boolean)
x_value_dataframe = pd.DataFrame(list_values, index = [0])
As the data has to be in the same order as the data the model trained upon, we will create a list containing the columns in the same order (excluding the first column which is for indexing and the last column ('Price') which was implemented after the training). The column order list will be used to rearrange the columns in the dataframe:
data = pd.read_csv("final_data.csv")
column_order_list = list(data.columns)[1:-1]
x_value_dataframe = x_value_dataframe[column_order_list]
#? Loading the model and making the prediction
model = joblib.load(model_file)
prediction = model.predict(x_value_dataframe)
st.write(f"The predicted value is R$ {prediction[0]:.2f}")
Obs: All of the code block above is located whithin the preview_button condition!
Done! Now we can calculate the estimated price of a airbnb property given the values of the features that we choose on the website for the model!
Finalizing
Lastly, we need to configure our project to run streamlit via a .exe file and finalize the deployment fase.
- First, inside the deploy folder, create a run.py file and copy this code for the .exe to run the aplication:
import streamlit
import joblib
import scipy.special._cdflib
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split
import streamlit.web.cli as stcli
import os, sys
def resolve_path(path):
resolved_path = os.path.abspath(os.path.join(os.getcwd(), path))
return resolved_path
if __name__ == "__main__":
sys.argv = [
"streamlit",
"run",
resolve_path("airbnb_deploy.py"),
"--global.developmentMode=false",
]
sys.exit(stcli.main())
- Enter the command prompt and ensure you are always inside the deploy folder directory, if you are not, run the command
cd deploy
- Next, go to the command prompt inside the deploy folder directory and create a requirements.txt file via the command
pip freeze > requirements.txt
and move the file into the deploy folder. - Now create a folder called 'hooks' inside the deploy folder and inside it create a hook-streamlit.py file with the following code:
from PyInstaller.utils.hooks import copy_metadata
datas = copy_metadata("streamlit")
- Also make a copy of the airbnb_deploy.py file and move it into the hooks folder
- On the command prompt run the command inside the deploy folder directory run the command
pyinstaller --onefile --additional-hooks-dir=./hooks run.py --clean
- This will generate
build
anddist
folders and arun.spec
file. Edit therun.spec
file to ensure paths are set properly as below:
- This will generate
# -*- mode: python ; coding: utf-8 -*-
from PyInstaller.utils.hooks import collect_data_files
from PyInstaller.utils.hooks import copy_metadata
datas = [(".venv/Lib/site-packages/streamlit/runtime", "./streamlit/runtime")]
datas += collect_data_files("streamlit")
datas += copy_metadata("streamlit")
block_cipher = None
a = Analysis(
["run.py"],
pathex=["."],
binaries=[],
datas=datas,
hiddenimports=[],
hookspath=[],
hooksconfig={},
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False,
)
pyz = PYZ(a.pure)
exe = EXE(
pyz,
a.scripts,
a.binaries,
a.datas,
[],
name='run',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
upx_exclude=[],
runtime_tmpdir=None,
console=True,
disable_windowed_traceback=False,
argv_emulation=False,
target_arch=None,
codesign_identity=None,
entitlements_file=None,
)
Obs: the .venv in the path passed on the data variable at the begining of the code is the name of the path of the computer, it might change depending on the user.
- On the cmd prompt, inside the deploy folder directory, run
pyinstaller run.spec --clean
- If faced with bugs related to paths not found, re-check if there is a Lib directory or if the name of your virtual enviroment is written correctly
- Also make sure that you are inside the right directory inside the cmd prompt
- Copy and paste the .streamlit folder with the config.toml file into the dist folder, located inside the deploy folder. This will keep the modifications we made regarding the maximum limit of upload size.
All done! The .exe is now located inside the dist folder and when it is executed the streamlit website appears where we can upload the prediction model and utilize its features. We can now send this project to anyone and they can use it regardless if they have python installed or not. You will just need to send the
Obs: If this is your first time running a streamlit/pyinstaller aplication, it might ask you for your email before executing it on the cmd prompt, however you can provide it whithout any issue as they don't send spam.
Thank You
Thank you for taking the time to read through my project. If you'd like to discuss this project further or have any questions, I'm open to connecting. Once more, feel free to reach out via email at dantenavaza2005@gmail.com or LinkedIn. I appreciate your interest and look forward to potential collaborations.
Future plans
Having finished this large machine learning project, I will now refocus on other projects I had in progress:
- E-commerce Website : Im currently working on an e-commerce platform designed to offer a seamless shopping experience. This project aims to provide a user-friendly environment where people can find and purchase unique products containing a full integration will banking systems.
- Portfolio Website : In addition to this project, I'm developing my portfolio website to showcase my work in technology, machine learning, and other creative endeavors. This platform will serve as a central hub for my professional achievements and personal projects.
- New Game Development : I'm currently working on a photorealistic third person action game that portrays a alien invasion cenario using a large variety of technologies such as Unity, Blender, Character Creator 4, and ZBrush.