Recalling the PhD experience

Woke up at 5am this Saturday morning for some research and work before the family wakes up. While doing some digital clean up on with my old Evernote notes, I came across my phd work at Walden University which I completed in 2015. At the time I created a Google Sites workplace phd.hoteit.net that I used to put together all my work, shared progress with my supervisors, store drafts and todo lists, and put some notes about the code that I working on. When I open up the notes, everything about the phd experience flashed back – the stress of writing with the thoughts of rejection by advisors, the obstacles faced when you thought that the data you need for your dissertation is within reach but when in reality it is not so you have to work harder to get to your data, or the thought that your original intuition about the research prior to writing the dissertation did not align with the rational outcome from the research. But then you come to learn more about your self, your thoughts, your discoveries, and even your own beliefs regardless of what what the research outcome ended up to be. Developing and completing a dissertation is a personal feat that makes you a better person as an intellectual human being with little more wisdom than what you had before the journey. I strongly recommend the journey – not just for the sake of contributing to the scholarly world but for the sake of personal human development when we are now racing not only with ourselves as a human race but also racing against our own artificial intelligence creations.

My dissertation: Effects of Investor Sentiment Using Social Media on Corporate Financial Distress

GitHub: hoteit/finSentiment

Financial Sentiment Analysis

As part of my Phd dissertation at Walden University, I developed an application that would analyze the sentiments of tweets that include the stock symbols of the publicly held firms in the United States and correlate the results with the financial data of such firms during the period of the research. At the time of the coding, between 4th quarter of 2014 and the first quarter of 2015, I could not find ready made tools that I could use for conducting the data analysis. Some companies offered solutions at very high costs while other tools had limited capabilities. So I went ahead and stitched various tools and coding techniques for my own research. The key steps and scripting tools used were as follows:

  • Used Twitter APIs and Tweepy libraries in my Python code that would extract the relevant tweets on a streaming-basis from Twitter
  • Leveraged Yahoo Developer Network using my Python code to extract the financial data of each of the publicly held firms in the United States.
  • Extracted the stock symbols of all publicly held companies in the United States by extracting the data from nasdaq.com using IPython Pandas libraries
  • Developed a portal using Python Django that would help me train the machine learning system to recognize negative and positive sentiments.
  • Used Stanford Core NLP java modules with the help of the trained data to analyze the sentiments of all the tweets
  • Used IPython Pandas in a Notebook format to conduct the data analysis.

The source code is available on finSentiment on Github. In subsequent posts I will explain what the scripts do, which is relevant and which is not, and how the scripts can be used in other projects.

Stanford Core NLP Sentiment Analysis

There is no shortage of natural language processing (NLP) algorithms on the Internet. Open source software is available for accessing NLP libraries are also available in nearly every major programming language. Few of the major key NLP frameworks are:

Stanford Core NLP

Stanford Core NLP includes a set of libraries for natural language analysis. This includes part of speech tagger (POS), named entity recognizer (NER), a statistical parser, a coreference detection system, sentiment analyzer, pattern-based extractor system, and other tools. Stanford Core NLP is licensed under GNU General Public License. Stanford NLP Group offers a number of different software that you can check at Stanford Core NLP Software

I personally used Stanford Core NLP sentiment analysis in my dissertation. The original code runs in Java and requires a training dataset. You could use Stanford Sentiment Treebank or you could allow the framework to run on a treebank that you can develop as the training dataset model that you require for a specific research domain. To illustrate Stanford sentiment analyzer, checkStanford Core NLP Live Demo. Source code instructions instructions on how to retrain the model using your own database is available on Stanford Core NLP Sentiment Analysis code. To run the Stanford already trained model on movie reviews,

java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -file foo.txt  //foo.txt is some text file)
 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin  //as command line input)

Training Stanford CoreNLP Sentiment Analysis model requires a Penn Tree Bank (PTB) dataset

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

`‘dev.text’, ‘train.txt’, and preferable ‘dev.txt’ would be your standard subset of data from your dataset for better supervised training techniques. However, such text should be in PTB format, such as

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

where the numbers represent the annotations for each word in the document. Stanford Core NLP Java class PTBTokenizer can help you with tokenizing the text.

Training the model with your dataset takes good amount of time. I highly recommended not to run the model training on a cloud vm instance. Instead, run it on a local machine. Training the model would result in model.ser.gz that will used to perform sentiment analysis on new untrained text.

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

In my application, I ran the model training on a local machine and then used Python to execute in Java edu.stanford.nlp.sentiment.SentimentPipeline via a command pipeline for each input that i had previously captured using a different Python algorithm and stored into my MySql/Django database. I will not go in details about the results of my implementation. I will leave it to another blog post.

Stanford CoreNLP is really cool. You will really gain a lot of insights on natural language processing by leveraging such model.