AI/ML supporting hometowns of international students: what, how and why?

Nearly everyone around by now has either heard or used artificial intelligence (AI) and machine learning (ML) in some form or fashion. Some students are already publishing papers in the field while other students are applying various AI techniques in their research, internships, or just for fun. Professionals in the industry have either incorporated some form of AI/ML into their product or services or are currently considering it. Either way, AI and ML have a lot to offer but not without a good amount data, significant processing power, right skillsets, and a lot of patience with design and execution of such projects. For that problem, AutoML is a promising new technique in the field that allows researchers and professionals to make use of pre-trained models and cloud-based services to roll out AI solutions much more rapidly than building machine learning models from scratch. AutoML provides the methods and processes to apply, integrate, deploy, and scale machine learning intelligence without requiring expert knowledge. Major AI platforms, starting with Google and followed by Microsoft, H20.ai, and others are priming AutoML as the next evolutionary frontier in artificial intelligence so that humans can spend zero time recreating machine learning models from scratch, and, instead, focus on applying the models while letting machines take care of building them.

References: -PwC 2017 report “PwC’s Global Artificial Intelligence Study: Exploiting the AI Revolution” https://www.pwc.com/gx/en/issues/data-and-analytics/publications/artificial-intelligence-study.html

-National Foundation for American Policy (October 2017) “The Importance of International Students to American Science and Engineering” http://nfap.com/wp-content/uploads/2017/10/The-Importance-of-International-Students.NFAP-Policy-Brief.October-20171.pdf

New Arabic NLP findings – An Arabic speech corpus and an Arabic Root Finder neural net

Published a new blog post on Arabic.Computer about two projects I found on the Internet that are useful for Arabic NLP initiatives. The first one is Arabic Speech Corpus in a Damascian accent provided by Nawar Halabi and is offered under a non-commercial license. The other one is Arabic Root Finder, a useful Keras/Scikit-Learn neural network for finding Arabic word roots and is offered by Tyler Boyd under a GPL-3 license.

Links and more information are available at the blog post.

Arab-related Open Source Survey – Arab-speaking developers requested to complete survey

I am an Arab American software development professional who, besides his day to day job as Director of Technology at Thomson Reuters, is currently working on an academic paper about the opportunities and challenges of Arabic-related open source development initiatives. I am actively searching for Arab-speaking developers to complete a short survey. The results will be part of an upcoming self-published paper that will be shared under open access scholarly publication and will be freely available for all our community. If you speak Arabic and is involved in developing software or is associated with the technology industry in general, your help completing the survey is truly appreciated. Please access the survey at http://survey.tarek.computer

Arabic-related Open Source Survey

I am working on an academic paper related to Arab English-speaking community and open source. The purpose is to understand the level of interests and the challenges/opportunities encountered by Arab software developers in contributing to open source projects that are associated with the Arabic language.  If you are an Arab and is involved with computer software, I would really appreciate your help to complete this short survey. Your response will be kept anonymous.

Link to the survey using Google Forms: https://goo.gl/forms/8AKzEAujvLe4UyOr1 

Thank you.

Friday web findings on machine learning

Friday Reads

Machine Learning in General

Technology in General

Recalling the PhD experience

Woke up at 5am this Saturday morning for some research and work before the family wakes up. While doing some digital clean up on with my old Evernote notes, I came across my phd work at Walden University which I completed in 2015. At the time I created a Google Sites workplace phd.hoteit.net that I used to put together all my work, shared progress with my supervisors, store drafts and todo lists, and put some notes about the code that I working on. When I open up the notes, everything about the phd experience flashed back – the stress of writing with the thoughts of rejection by advisors, the obstacles faced when you thought that the data you need for your dissertation is within reach but when in reality it is not so you have to work harder to get to your data, or the thought that your original intuition about the research prior to writing the dissertation did not align with the rational outcome from the research. But then you come to learn more about your self, your thoughts, your discoveries, and even your own beliefs regardless of what what the research outcome ended up to be. Developing and completing a dissertation is a personal feat that makes you a better person as an intellectual human being with little more wisdom than what you had before the journey. I strongly recommend the journey – not just for the sake of contributing to the scholarly world but for the sake of personal human development when we are now racing not only with ourselves as a human race but also racing against our own artificial intelligence creations.

My dissertation: Effects of Investor Sentiment Using Social Media on Corporate Financial Distress

GitHub: hoteit/finSentiment

Financial Sentiment Analysis

As part of my Phd dissertation at Walden University, I developed an application that would analyze the sentiments of tweets that include the stock symbols of the publicly held firms in the United States and correlate the results with the financial data of such firms during the period of the research. At the time of the coding, between 4th quarter of 2014 and the first quarter of 2015, I could not find ready made tools that I could use for conducting the data analysis. Some companies offered solutions at very high costs while other tools had limited capabilities. So I went ahead and stitched various tools and coding techniques for my own research. The key steps and scripting tools used were as follows:

  • Used Twitter APIs and Tweepy libraries in my Python code that would extract the relevant tweets on a streaming-basis from Twitter
  • Leveraged Yahoo Developer Network using my Python code to extract the financial data of each of the publicly held firms in the United States.
  • Extracted the stock symbols of all publicly held companies in the United States by extracting the data from nasdaq.com using IPython Pandas libraries
  • Developed a portal using Python Django that would help me train the machine learning system to recognize negative and positive sentiments.
  • Used Stanford Core NLP java modules with the help of the trained data to analyze the sentiments of all the tweets
  • Used IPython Pandas in a Notebook format to conduct the data analysis.

The source code is available on finSentiment on Github. In subsequent posts I will explain what the scripts do, which is relevant and which is not, and how the scripts can be used in other projects.

Stanford Core NLP Sentiment Analysis

There is no shortage of natural language processing (NLP) algorithms on the Internet. Open source software is available for accessing NLP libraries are also available in nearly every major programming language. Few of the major key NLP frameworks are:

Stanford Core NLP

Stanford Core NLP includes a set of libraries for natural language analysis. This includes part of speech tagger (POS), named entity recognizer (NER), a statistical parser, a coreference detection system, sentiment analyzer, pattern-based extractor system, and other tools. Stanford Core NLP is licensed under GNU General Public License. Stanford NLP Group offers a number of different software that you can check at Stanford Core NLP Software

I personally used Stanford Core NLP sentiment analysis in my dissertation. The original code runs in Java and requires a training dataset. You could use Stanford Sentiment Treebank or you could allow the framework to run on a treebank that you can develop as the training dataset model that you require for a specific research domain. To illustrate Stanford sentiment analyzer, checkStanford Core NLP Live Demo. Source code instructions instructions on how to retrain the model using your own database is available on Stanford Core NLP Sentiment Analysis code. To run the Stanford already trained model on movie reviews,

java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -file foo.txt  //foo.txt is some text file)
 java -cp "*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin  //as command line input)

Training Stanford CoreNLP Sentiment Analysis model requires a Penn Tree Bank (PTB) dataset

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

`‘dev.text’, ‘train.txt’, and preferable ‘dev.txt’ would be your standard subset of data from your dataset for better supervised training techniques. However, such text should be in PTB format, such as

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

where the numbers represent the annotations for each word in the document. Stanford Core NLP Java class PTBTokenizer can help you with tokenizing the text.

Training the model with your dataset takes good amount of time. I highly recommended not to run the model training on a cloud vm instance. Instead, run it on a local machine. Training the model would result in model.ser.gz that will used to perform sentiment analysis on new untrained text.

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

In my application, I ran the model training on a local machine and then used Python to execute in Java edu.stanford.nlp.sentiment.SentimentPipeline via a command pipeline for each input that i had previously captured using a different Python algorithm and stored into my MySql/Django database. I will not go in details about the results of my implementation. I will leave it to another blog post.

Stanford CoreNLP is really cool. You will really gain a lot of insights on natural language processing by leveraging such model.