Twitter Sentiment Analysis
Problem
Statement
Twitter is a popular social networking website where members
create and interact with messages known as “tweets”. This serves as a mean for
individuals to express their thoughts or feelings about different subjects.
Various different parties such as consumers and marketers have done sentiment
analysis on such tweets to gather insights into products or to conduct market
analysis. Furthermore, with the recent advancements in machine learning algorithms,
we are able improve the accuracy of our sentiment analysis predictions.
In this report, we will attempt to
conduct sentiment analysis on “tweets” using various different machine learning
algorithms. We attempt to classify the polarity of the tweet where it is either
positive or negative. If the tweet has
both positive and negative elements, the more dominant sentiment should be
picked as the final label.
We use the dataset from Kaggle which was crawled and labeled
positive/negative. The dat provided comes
with emoticons, usernames and hashtags which are required to be processed and converted
into a standard form. We also need to extract useful features from the text
such unigrams
and bigrams which is a form of representation of the “tweet”. We
use various machine learning algorithms to conduct sentiment analysis using the
extracted features.
However, just relying on individual
models did not give a high accuracy so we pick the top few models to generate a
model ensemble. Ensembling is a form of meta learning algorithm technique where
we combine different classifiers in order to improve the prediction accuracy.
Finally, we report our experimental results and findings at the end.
2
Data Description
The
data given is in the form of a comma-separated values files with tweets and
their corresponding sentiments. The
training dataset is a csv file of type tweet_id,sentiment,tweet where the tweet_id
is a unique integer identifying the tweet, sentiment is either 1 (positive) or
0 (negative) , and tweet is the tweet enclosed in "". Similarly, the
test dataset is a csv file of type
tweet_id,tweet.
and emoticons contribute to predicting the sentiment, but URLs and references
to people don’t.
Therefore, URLs and references can be ignored. The
words are also a mixture of misspelled words, extra punctuations, and words
with many repeated letters. The tweets, therefore, have to be preprocessed to
standardize the dataset
thank you for your comment
pls call me on 8125424511