Monday, August 14, 2017

Use Machine Learning to Help House Purchase

House purchase is the most tedious process where numerous factors are involved, e.g., locations, built-in, number of bathrooms/bedrooms, etc. Many websites, such as Trulia, provide detailed information of available houses on the market, including house status that whether a house is still available or under a contract. Therefore, we can use the house status as the label to see what other possible factors can influence the decision of house purchases.

To start with, we need the data. Here is a python script that can help you extract the information of houses in your area from Trulia with Beautiful Soup.

This is an example of what data you can get from Trulia:
The collected data in Access

The next step is to prepare the data for training. Because we are using Rapidminer, we have to import the collected data from Access table to Rapidminer. 

Prepare the data

In the data preparation step, we filtered unnecessary information, such as MLS ID, URL of each house, etc. We then changed the status of for sale by owner to for sale so that there are only two status: for sale and pending (i.e., under a contract). We set the house status as the label, namely the target we want to predict, and exclude some features of houses that are not common, e.g., has a lawn or has a pond.

The next question is: which model to use? MOD is a very useful website that can recommend a set of models based on the data type and size of the data. In our situation Decision Tree, Naive Bayes, k-NN, Random Forest, Deep Learning and Generalized Linear Model are recommended. We can the compare the accuracy of those models using Compare ROCs Operator in Rapidminer. 

The comparison of different models

The results of Decision Tree and Random Forest are unrealistically accurate, followed by Rule Induction and Logistic Regression models.

Here is the distribution of the predicted result from Deep Learning model:
The result from Deep Learning model
The most pending houses are predicted as pending with higher confidence, and some for sale houses are also predicted as pending with relative low confidence. Those for sale --> pending houses may have the potential and worth of special investigation. Also most of the for sale houses are predicted as for sale.

Finally, we can apply our models by feeding the specific house that we might be interested. We also created a set of records for the same house with different prices to see if price decline can influence the purchase decision.

Apply the models

All the models predicted that this houses, although with declined prices, should always be for sale, namely should not be considered. Specifically, Decision Tree model gives  less than 28% confidence that this house will become sold, and the confidence is not sensitive to prices. 

The Decision Tree model prediction
Deep Learning model and Generalized Linear Model also don't think the house will be sold, but their confidence decreases as the price goes down. Therefore, we can continue offering a lower price to see to what extend the computer thinks that deal is reasonable :)

The Deep Learning model prediction

The Generalized Linear model prediction

Wednesday, November 11, 2015

Visualize Twitter Online Interactions in Social Network

Twitter users can mention or reply other Twitter users in each single tweet. Visualizing those online interactions on Twitter provide better understandings of human online behaviors, such as identifying who are the most important information source, and how information is transmitted online, and etc.

To visualize Twitter users online interactions, we have to extract tweets, and separate the mentioned/ replied users from the tweet texts. This Python script uses Twitter REST API  can extract mentioned/ replied users from tweets, and store the results in a CSV file. To maximize the usage of the REST API, the since_id and max_id are used to retrieve more tweets from a single Timeline.

Once the mentioned/ replied users are separated, we can use Networkx to create a network graph in which the mentioned/ replied users and the author of each single tweet are connected. This created network can be exported into Gephi for further visualization and analysis. A simple script is available here.

Saturday, October 10, 2015

Extract Historic Tweets

Extracting historic Twitter data is always problematic, because the Twitter Rest API can only retrieve 3,200 of a user’s most recent Tweets. That being said, if you want to find what YouTube tweeted between 2013 and 2014, this task can be almost impossible by suing the REST or Stream API only.

 However, Twitter Advanced Search provides historic tweets based on user's defined query, including the time of posting a tweet. For example, this link  from Twitter Advanced Search gives you the full list of tweets that YouTube posted between 2013 and 2014. 

Since the parameters in the request URL of the Twitter Advanced Search can be customized, it is possible to extract historic tweet information by sending request to the Twitter Advanced Search, and extract tweet information from the returned webpages.

Here is a simple Python script that can extract the historic tweet ID for specific twitter users. The logical is straightforward: 

  1. Customize the URL to request the historic tweets of specific twitter users during a defined time period;
  2. Get the responded page from the Twitter Advanced Search;
  3. If the returned page is short as one page, use BeautifulSoup package to extract the twitter id from the webpage;
  4.  If the returned page is longer than one page, use Selenumu to scroll down the webpage to get all the twitter id;
  5. Use Twitter Search API to extract the tweet contents from those collected tweet ID.

Thursday, March 26, 2015

Twitter Data Acquisition in Python

 Install Python at

 Create a Twitter Application
1)      Register a Twitter Application at
2)      After you have successfully created a Twitter Application, write down your CONSUMER_KEY, CONSUMER_SECRET, Access_TOKEN, and Access_TOKEN_SECRET.
 Install necessary python libraries
1)      Go to, download the;
2)      Add Installation Folder/Python27/Scripts to the Path Variable in My Computer/ Properties/ Advanced system settings/ System Environment Variables
3)      Right click on the downloaded , choose Edit with IDLE, Run … Run Module (F5)
4)      Go to Windows/Start, in the Search programs and files type cmd
5)      In the pop-up window, type pip install twitter, the twitter library will be installed automatically.

6)      Type pip install dbf, the dbf library will be installed automatically
7)    Download the and at

Create dbf table: 

 Right click on the file, choose Edit with IDLE, Run … Run Module (F5), a dbf table named Tweet will be created

Customize python script:
    Open the file with IDLE (right click on the file, choose Open with IDLE)
    Fill in your CONSUMER_KEY, CONSUMER_SECRET, OAUTH(Access)_TOKEN, and OAUTH(Access)_TOKEN_SECRET in the OAUTH section

In the define query section, modify the following parameters:
1)      q: define the text that contained in the collected tweets returned by REST API
2)      count: define the maximal number of collected tweets returned by REST APT
3)      lang: specify the language of the tweets returned by the REST API
4)      geocode: define the latitude, longitude and radius where the tweets will be collected by the REST API

Collect Tweets:
Right click on the customized, choose Edit with IDLE, Run … Run Module (F5)
    Open the Tweet.dbf in Excel to view the collected tweets.

Wednesday, February 4, 2015

Calculate Spatial Importance of Road Network in ArcGIS

A recent study found that Random Walk algorithm can be utilized to rank spatial importance of road networks. The basic idea is that by simulating a person's random walking in a road networks, the road segments or interactions that have been walked through many times are considered spatially important. Such spatial importance of road networks is evidenced in their close correlation to some social-economic characteristics of surrounding urban areas structured by road networks, e.g., population density, job density, or even house prices. More details can be found at this article: The Random Walk Value for Ranking Spatial Characteristics in Road Networks.

An  ArcGIS Tool has been devised to implement this Random Walk simulation. 

Four functions are provided in this ArcGIS Tool:
  1. Construct graph object

    Open a road network shapefile in ArcGIS. Open Construct Graph Network tool in the ArcGIS tool box, select the road shapefile as the edge layer. The weight field can be any numerical attributes of a road network, such as width, design speed and etc. In the node layer field, select the nodes that will be included in the random walk simulation, such as bus stops. Select the X and Y coordination of the node layer, and define the output folder and network name. This tool will create a graph object.
    If you don't have specific road nodes or you want to include all the road junctions in the random walk simulation , you can create road nodes by adding a network dataset of the road shapefile in ArcGIS.
  2. Simulate random walk

    Open the Calculate Random Walking Value tool in the ArcGIS toolbox, select the created graph in the Network File field. Define the output folder, field name, threshold of loop value, weight, and simulation method of the random walk simulation. The definitions of those parameters can be found in the The Random Walk Value for Ranking Spatial Characteristics in Road Networks.

    Random walk simulation may take several minutes. After the calculation is done, you can import the calculated edge and node shapefile in ArcGIS. A wlk files recording the walking paths is also created in the defined folder.

  3. Visualize random walk paths

    You can visualize the simulated random walk paths in ArcGIS by using the Check Random Walking Paths tool. In the Check Random Walking Paths tool, select the created wlk file in step 3, define how many walking paths you want to check, and define the output folder if you want to visualize those walking paths as shapefiles.
  4. Calculate other network measures of road networks (using Networkx)

    This tool can also calculate other network measures such as PageRank, betweenness, closeness and etc. To do that,open the Calculate PageRank Value tool in the ArcGIS toolbox, select the create network graph in the step1, and define the output folder of the network calculation. The network measures will be saved in a table in the defined folder.