top of page

NLP: Twitter Sentiment Analysis (1): Import Machine-Learning Libraries and Datasets




In this series, NLP: Twitter Sentiment Analysis, we're going to train a Naive Bayes classifier to predict sentiment from thousands of Twitter tweets.


This project could be practically used by any company with social media presence to automatically predict customer's sentiment (i.e.: whether their customers are happy or not).


The process could be done automatically without having humans manually review thousands of tweets and customer reviews.


Let's go.



1. Download and Install the latest version of Python




2. Install and run Jupyter Notebook


  • Open Terminal, install the Jupyter Notebook, pandas, numpy, seaborn, matplotlib, jupyter-themes with:


pip install notebook
pip install pandas
pip install numpy
pip install seaborn
pip install matplotlib
pip install jupyterthemes

  • To run the notebook:


jupyter notebook


3. Import Machine-Learning Libraries


  • In Jupyter Notebook, press A key to add a new section and add the following code. After that, press Shift + Enter to run the script:


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from jupyterthemes import jtplot

jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False) 


▲ It imports the necessary libraries pandas, numpy, seaborn, matplotlib.pyplot, jupyterthemes and set a variable name for each of them.


jtplot.style sets the style of the notebook to be monokai theme.


ticks=True means showing ticks/labels on the x and y axes.


grid=False means hiding the gridlines to see the content clearly.



4. Import Datasets for Analysis


  • Load the data (e.g. dataset twitter.csv):


tweets_df = pd.read_csv('twitter.csv')


▲ You can assign the variable name to the loaded dataset whatever you like.


As this project is to analyse sentiment from tweets, I just assign it as tweets_df (tweets' datafile).


pd.read_csv is used to load the dataset twitter.csv.



4. Manipulate the data


  • View all data:


tweets_df


▲ Call tweets_df to view the loaded dataset twitter.csv.


id is the order of tweets, label means whether the tweet has a positive/neutral or negative tone, 0 for positive/neutral, 1 for negative.


tweet refers to tweets' content for sure.



(Result)



  • Check data info:


tweets_df.info()


▲ You can see 3 columns inside the dataset, both contains 31962 non-null entries, which means probably no data is missing.


id and label columns contain integer data in 64-bit, when tweet column contains object data.



(Result)



  • Generate descriptive statistics:


tweets_df.describe()


(Result)



  • View specific column (e.g. tweet column):


tweets_df['tweet']


(Result)



  • Drop/delete specific column (e.g. id column):


tweets_df = tweets_df.drop(['id'], axis=1)

▲ Then if you run tweets_df again, you should see the following result with id column droped.


(Result)








Congratulations on completing this tutorial!


You've taken a big step towards mastering the subject at hand.


See you in NLP: Twitter Sentiment Analysis (2)!








© 2023 Harmony Pang. All rights reserved.


Comments


bottom of page