Tarek Hoteit

Tarek's news and notes on computers and innovation

17 November 2016

Stanford Core NLP Sentiment Analysis

There is no shortage of natural language processing (NLP) algorithms on the Internet. Open source software is available for accessing NLP libraries are also available in nearly every major programming language. Few of the major key NLP frameworks are:

Stanford Core NLP

Stanford Core NLP includes a set of libraries for natural language analysis. This includes part of speech tagger (POS), named entity recognizer (NER), a statistical parser, a coreference detection system, sentiment analyzer, pattern-based extractor system, and other tools. Stanford Core NLP is licensed under GNU General Public License. Stanford NLP Group offers a number of different software that you can check at Stanford Core NLP Software

I personally used Stanford Core NLP sentiment analysis in my dissertation. The original code runs in Java and requires a training dataset. You could use Stanford Sentiment Treebank or you could allow the framework to run on a treebank that you can develop as the training dataset model that you require for a specific research domain. To illustrate Stanford sentiment analyzer, check Stanford Core NLP Live Demo. Source code instructions instructions on how to retrain the model using your own database is available on Stanford Core NLP Sentiment Analysis code. To run the Stanford already trained model on movie reviews,

 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -file foo.txt  //foo.txt is some text file)
 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin  //as command line input)

Training Stanford CoreNLP Sentiment Analysis model requires a Penn Tree Bank (PTB) dataset

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

`‘dev.text’, ‘train.txt’, and preferable ‘dev.txt’ would be your standard subset of data from your dataset for better supervised training techniques. However, such text should be in PTB format, such as

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

where the numbers represent the annotations for each word in the document. Stanford Core NLP Java class PTBTokenizer can help you with tokenizing the text.

Training the model with your dataset takes good amount of time. I highly recommended not to run the model training on a cloud vm instance. Instead, run it on a local machine. Training the model would result in model.ser.gz that will used to perform sentiment analysis on new untrained text.

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

In my application, I ran the model training on a local machine and then used Python to execute in Java edu.stanford.nlp.sentiment.SentimentPipeline via a command pipeline for each input that i had previously captured using a different Python algorithm and stored into my MySql/Django database. I will not go in details about the results of my implementation. I will leave it to another blog post.

Stanford CoreNLP is really cool. You will really gain a lot of insights on natural language processing by leveraging such model.

tags: phd