To start with, we need the data. Here is a python script that can help you extract the information of houses in your area from Trulia with Beautiful Soup.
This is an example of what data you can get from Trulia:
|The collected data in Access|
The next step is to prepare the data for training. Because we are using Rapidminer, we have to import the collected data from Access table to Rapidminer.
|Prepare the data|
The next question is: which model to use? MOD is a very useful website that can recommend a set of models based on the data type and size of the data. In our situation Decision Tree, Naive Bayes, k-NN, Random Forest, Deep Learning and Generalized Linear Model are recommended. We can the compare the accuracy of those models using Compare ROCs Operator in Rapidminer.
|The comparison of different models|
The results of Decision Tree and Random Forest are unrealistically accurate, followed by Rule Induction and Logistic Regression models.
Here is the distribution of the predicted result from Deep Learning model:
|The result from Deep Learning model|
Finally, we can apply our models by feeding the specific house that we might be interested. We also created a set of records for the same house with different prices to see if price decline can influence the purchase decision.
|Apply the models|
All the models predicted that this houses, although with declined prices, should always be for sale, namely should not be considered. Specifically, Decision Tree model gives less than 28% confidence that this house will become sold, and the confidence is not sensitive to prices.
|The Decision Tree model prediction|
Deep Learning model and Generalized Linear Model also don't think the house will be sold, but their confidence decreases as the price goes down. Therefore, we can continue offering a lower price to see to what extend the computer thinks that deal is reasonable :)
|The Deep Learning model prediction|
|The Generalized Linear model prediction|