September 24, 2023

– Advertisement –

Photo by Luis Droege on Unsplash In this blog, we want to introduce you to an example use case of the Forrester package. We will present a package usage scenario with a real life story in the background. It will also include code examples, result analysis, and comments. The package concept is described in more detail in a previous blog post introducing packages.

Let’s imagine that we are a young homeowner moving from Lisbon to Warsaw, and due to lack of savings, we decide to sell our apartment to buy a new one in Poland. Our decision was made in haste, because next month we start our new job as a researcher at MI².AI and we know nothing about the real estate market in Portugal. Luckily, as skilled data scientists, we have managed to scrape information about real estate properties and created a Lisbon dataset. It’s not much but will do for our case. The first observations are presented below.

– Advertisement –

First overview of the Lisbon dataset.

First, we want to know what happens inside our scraped dataset and whether it is sufficient for our analysis. Usually we would start by writing a tedious exploratory data analysis script, however, the Forester package provides us with a function that provides basic information about the dataset. In that case, we import the package, load the dataset and create a data_check() report called value with the use of the dataset name and our target column name. The required code and result are shown below.

– Advertisement –

library(Forrester) data(lisbon) check Data check report for the Lisbon dataset.

See also  Italian church says 600 intercourse abuse circumstances despatched to Vatican

From the above report we can know that the dataset is not correct and we can find some issues considering the column values.

– Advertisement –

First we identify the fixed columns (each observation has a single value) which are country, district and municipality. We also find that AreaNet is highly correlated with AreaGross and so is the propertytype – propertysubtype pair. This means that these columns provide the same information for our future model. Finally, we find out that the column named ID cannot provide any information because it is an index column.

To overcome these main problems with our dataset we decide to drop the above columns to get better results. Data check report also gives us a note about duplicate columns, missing values, outliers and abnormal distribution of value column, however, these parameters are acceptable in our case, so we will ignore them.

lisbon At this point, we already know something about our dataset, so it’s time to build the first model. To do this we use the train() function which wraps the entire AutoML pipeline. Usually we only provide two required parameters, but as we want to get baseline results fast, and we have already run data_check() , we decide to skip some modules (we use random search algorithm, Bayesian optimization and turn off) printing message).

output_1 The output of the train() function is complex, but we’ll focus on the ranked list. In the table below one can see all the trained models along with some metrics calculated on the test subset. The first model scored 0.77 in the R2 metric, which is relatively high. Not only is R2 the best but so is MSE (mean squared error) and MAE (mean absolute deviation). We could already use that model to estimate our house price but let’s see if we can do better after setting different parameters.

See also  R | Data transformation and standardization in R bloggers

Original training result rank_list.

We want to improve the models by changing their hyperparameters. Doing it manually would require a lot of effort and expertise. Thankfully the train() function has an option to do this automatically. We set bayes_iter and random_evals to 20 which runs the respective tuning methods during training.

Output_2 Tuned training result rank_list.

With Bayesian optimization, we improved the R2 metric for the best model from 0.77 to 0.91. MSE has also improved a lot. This model is xgboost, trained with Bayesian optimization. This looks very promising, but to make sure it’s believable, let’s explain how these results were obtained.

Fortunately, the Forrester package provides an interface for easy integration with the DALEX package, a well-known interpretable artificial intelligence (XAI) solution. With just a few steps we can create a explainer and feature importance plot that shows us which columns were most important to the model.

library(‘DALEX’) ex The five most important columns for the Lisbon dataset.

From the plot above we can see that the most important factors for the xgboost model were the area of ​​the apartment, the number of bathrooms, the latitude – which translates to distance from the city center, the condition of the apartment, and the price per square meter. All these factors also seem very reasonable to us, so we can diagnose that the model behaves sensibly and we trust it.

At this point, we cross-checked our data, trained several models, and interpreted the best model. But we want all this information in one place! To do this we can create a report with the report() function. It produces a PDF or HTML file that presents data and information about the model in a formal and clear way. The report will be covered in detail in another blog.

See also  Newly unsealed indictment accuses three Iranian nationals of ransomware attacks against hundreds of U.S. victims

report (output_2)Example of a report for the regression work done on the Lisbon dataset.

Now that we have a model, we can estimate the value of our home. We create an overview with all the necessary information about our apartment. We choose the best model made by the Forester package and we make a price prediction for our observation, which is equal to 214 156 Euros. Now, we can save the model for the future and add an estimated price to our ad!

x To view more R related content visit https://www.r-bloggers.com

Forrester: Predicting House Price Use Cases was originally published in Responsible ML on Medium, where people are continuing the conversation by highlighting and responding to this story.

Connected

Source link

– Advertisement –