Twitter Sentiment Analysis using machine learning




Twitter Sentiment Analysis


Problem Statement

Twitter is a popular social networking website where members create and interact with messages known as “tweets”. This serves as a mean for individuals to express their thoughts or feelings about different subjects. Various different parties such as consumers and marketers have done sentiment analysis on such tweets to gather insights into products or to conduct market analysis. Furthermore, with the recent advancements in machine learning algorithms, we are able improve the accuracy of our sentiment analysis predictions.

In this report, we will attempt to conduct sentiment analysis on “tweets” using various different machine learning algorithms. We attempt to classify the polarity of the tweet where it is either  positive or negative. If the tweet has both positive and negative elements, the more dominant sentiment should be picked as the final label.
We use the dataset from Kaggle which was crawled and labeled positive/negative. The dat  provided comes with emoticons, usernames and hashtags which are required to be processed and converted into a standard form. We also need to extract useful features from the text such unigrams
and bigrams which is a form of representation of the “tweet”. We use various machine learning algorithms to conduct sentiment analysis using the extracted features.
However, just relying on individual models did not give a high accuracy so we pick the top few models to generate a model ensemble. Ensembling is a form of meta learning algorithm technique where we combine different classifiers in order to improve the prediction accuracy. Finally, we report our experimental results and findings at the end.

2 Data Description

The data given is in the form of a comma-separated values files with tweets and their  corresponding sentiments. The training dataset is a csv file of type tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the tweet, sentiment is either 1 (positive) or 0 (negative) , and tweet is the tweet enclosed in "". Similarly, the test dataset is a csv file of type
tweet_id,tweet. and emoticons contribute to predicting the sentiment, but URLs and references to people don’t.
Therefore, URLs and references can be ignored. The words are also a mixture of misspelled words, extra punctuations, and words with many repeated letters. The tweets, therefore, have to be preprocessed to standardize the dataset




Share this

Related Posts

Previous
Next Post »

thank you for your comment

pls call me on 8125424511