Monday, August 14, 2017

Use Machine Learning to Help House Purchase

House purchase is the most tedious process where numerous factors are involved, e.g., locations, built-in, number of bathrooms/bedrooms, etc. Many websites, such as Trulia, provide detailed information of available houses on the market, including house status that whether a house is still available or under a contract. Therefore, we can use the house status as the label to see what other possible factors can influence the decision of house purchases.

To start with, we need the data. Here is a python script that can help you extract the information of houses in your area from Trulia with Beautiful Soup.

This is an example of what data you can get from Trulia:
The collected data in Access

The next step is to prepare the data for training. Because we are using Rapidminer, we have to import the collected data from Access table to Rapidminer. 


Prepare the data

In the data preparation step, we filtered unnecessary information, such as MLS ID, URL of each house, etc. We then changed the status of for sale by owner to for sale so that there are only two status: for sale and pending (i.e., under a contract). We set the house status as the label, namely the target we want to predict, and exclude some features of houses that are not common, e.g., has a lawn or has a pond.

The next question is: which model to use? MOD is a very useful website that can recommend a set of models based on the data type and size of the data. In our situation Decision Tree, Naive Bayes, k-NN, Random Forest, Deep Learning and Generalized Linear Model are recommended. We can the compare the accuracy of those models using Compare ROCs Operator in Rapidminer. 


The comparison of different models

The results of Decision Tree and Random Forest are unrealistically accurate, followed by Rule Induction and Logistic Regression models.

Here is the distribution of the predicted result from Deep Learning model:
The result from Deep Learning model
The most pending houses are predicted as pending with higher confidence, and some for sale houses are also predicted as pending with relative low confidence. Those for sale --> pending houses may have the potential and worth of special investigation. Also most of the for sale houses are predicted as for sale.

Finally, we can apply our models by feeding the specific house that we might be interested. We also created a set of records for the same house with different prices to see if price decline can influence the purchase decision.

Apply the models

All the models predicted that this houses, although with declined prices, should always be for sale, namely should not be considered. Specifically, Decision Tree model gives  less than 28% confidence that this house will become sold, and the confidence is not sensitive to prices. 

The Decision Tree model prediction
Deep Learning model and Generalized Linear Model also don't think the house will be sold, but their confidence decreases as the price goes down. Therefore, we can continue offering a lower price to see to what extend the computer thinks that deal is reasonable :)

The Deep Learning model prediction




The Generalized Linear model prediction