As part of my Ph.D. dissertation at Walden University, I developed an application to analyze the sentiments within the tweets associated with publicly held firms in the United States and correlate the results with the financial status of such companies at the time of the study. At the time of the coding, between the 4th quarter of 2014 and the first quarter of 2015, I could not find ready-made tools to conduct the data analysis. Some companies offered costly solutions, while other tools had limited capabilities. So, I stitched various tools and coding techniques for my research. The key steps and scripting tools used were as follows:

  • Used Twitter APIs and Tweepy libraries in my Python code that would extract the relevant tweets on a streaming basis from Twitter
  • Leveraged Yahoo Developer Network to extract the financial data of every publicly held firm in the United States.
  • Extracted the stock symbols of all publicly held companies in the United States by extracting the data from nasdaq.com using IPython Pandas libraries
  • Developed a portal using Python Django to help me train the machine learning system to recognize negative and positive sentiments.
  • Used Stanford Core NLP java modules with the help of the trained data to analyze the sentiments of all the tweets
  • Used IPython Pandas in a Notebook format to conduct the data analysis.

The source code is available on finSentiment. The dissertation paper is available at Effects of Investor Sentiment Using Social Media on Corporate Financial Distress

Stanford Core NLP

Stanford Core NLP includes a set of libraries for natural language analysis. It consists of a speech tagger (POS), a named entity recognizer (NER), a statistical parser, a coreference detection system, a sentiment analyzer, a pattern-based extractor system, and other tools. Stanford Core NLP is licensed under GNU General Public License. Stanford NLP Group also offers additional software on its Stanford Core NLP Software website.

My dissertation used Stanford Core NLP sentiment analysis. The original code runs in Java and requires a training dataset. You could use Stanford Sentiment Treebank, or you could allow the framework to run on a treebank that you can develop as the training dataset model that you require for a specific research domain.

To illustrate Stanford’s sentiment analyzer, check Stanford Core NLP Live Demo. The Source code instructionon how to retrain the model using your own database is available on
Stanford Core NLP Sentiment Analysis code. A pretrained model that is used on movie reviews is available on the site.

 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -file foo.txt  //foo.txt is some text file)
 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin  //as command line input)

Training Stanford CoreNLP Sentiment Analysis model requires a Penn Tree Bank (PTB) dataset

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

` ‘dev.text ‘, ‘train.txt ‘, and preferable ‘dev.txt ‘ would be your standard subset of data from your dataset for better supervised training techniques. However, such text should be in PTB format, such as

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

where the numbers represent the annotations for each word in the document. Stanford Core NLP Java class PTBTokenizer
can help you with tokenizing the text.

Training the model with your dataset takes a good amount of time. I highly recommended not to run the model training on
a cloud VM instance. Instead, run it on a local machine. Training the model would result in model.ser.gz that will be used to perform sentiment analysis on new untrained text.

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

In my application, I ran the model training on a local machine and ran the Java class edu.Stanford.nlp.sentiment.SentimentPipeline. I ran the model over the previously captured data from scraping Twitter and stored the results in a MySQL/Django database.

I will not go into detail about the results of my implementation. I will leave it to another blog post.

Stanford CoreNLP is cool. You will gain many insights into natural language processing by leveraging such a model.