To start with, we need the data. Here is a python script that can help you extract the information of houses in your area from Trulia with Beautiful Soup.

This is an example of what data you can get from Trulia:

The collected data in Access |

The next step is to prepare the data for training. Because we are using Rapidminer, we have to import the collected data from Access table to Rapidminer.

Prepare the data |

*for sale by owner*to

*for sale*so that there are only two status:

*for sale*and

*pending*(i.e., under a contract). We set the house status as the label, namely the target we want to predict, and exclude some features of houses that are not common, e.g.,

*has a lawn*or

*has a pond*.

The next question is: which model to use? MOD is a very useful website that can recommend a set of models based on the data type and size of the data. In our situation

*Decision Tree*,

*Naive Bayes*,

*k-NN*,

*Random Forest*,

*Deep Learning*and

*Generalized Linear Model*are recommended. We can the compare the accuracy of those models using

*Compare ROCs Operator*in Rapidminer.

The comparison of different models |

The results of

*Decision Tree*and

*Random Forest*are unrealistically accurate, followed by

*Rule Induction*and

*Logistic Regression*models.

Here is the distribution of the predicted result from

*Deep Learning*model:

The result from Deep Learning model |

*pending*houses are predicted as

*pending*with higher confidence, and some

*for sale*houses are also predicted as

*pending*with relative low confidence. Those

*for sale*-->

*pending*houses may have the potential and worth of special investigation. Also most of the

*for sale*houses are predicted as

*for sale*.

Finally, we can apply our models by feeding the specific house that we might be interested. We also created a set of records for the same house with different prices to see if price decline can influence the purchase decision.

Apply the models |

All the models predicted that this houses, although with declined prices, should always be

*for sale*, namely should not be considered. Specifically,*Decision Tree*model gives less than 28% confidence that this house will become sold, and the confidence is not sensitive to prices.The Decision Tree model prediction |

*Deep Learning*model and

*Generalized Linear Model*also don't think the house will be sold, but their confidence decreases as the price goes down. Therefore, we can continue offering a lower price to see to what extend the computer thinks that deal is reasonable :)

The Deep Learning model prediction |

The Generalized Linear model prediction |