Sentiment Analysis for Stance & Category Detection of Arabic COVID Tweets

This project involves two sentiment classifiers that based on a given tweet decides its vaccination stance (whether its with, against or indifferent towards vaccination) and its category (vaccination news, celebrities, government plans,...). The category classification model won second place 🥈 and the project overall was granted bonuses in the competition that corresponded to the project in fulfillment of the classwork requirements of the NLP course taught to computer engineering juniors in Cairo University.

Datasets & Preprocessing 📊

Training was done over 7000 arabic COVID related tweets scraped from Twitter. The dataset suffered from severe imbalance and small subsets

We considered three approaches for preprocessing (Hand-made, NLTK-based & Snowball-based). Each of them involved Tweet Cleaning, Arabic Letters Normalization, Stop-word Removal, Stemming & Tokenization.

Features Extracted 🗿

We have considered BoW, TF-IDF (n-gram), One-hot, Aravec, FastTEXT and AraBERT Contextual Embeddings for features.

To illustrate the difficulty of the task at hand, here is an example for feature visualization over the stance task (TF-IDF after PCA):

Models Considered 🤖

For each of the two tasks, we used two sets of models. A classical set composing Naive Bayes, SVM and Gradient Boosting and a deep learning set composing FFNN, LSTM, GRU and ARABERT.

Metric & Results 📉

The metric set for the competition was Macro F1 score which too sensitive to the present imbalance issues in the sense that 545 errors for the 2nd class are equivalent to 2 errors for the 7th class considering category classification over the dev set. Nonetheless, after a lot of fine tuning as could be evidenced by the logs (which were automated) our best models for each task were:

Task	Stance Classification	Category Classification
Best Model	AraBERT with Class Weighting	Naive Bayes with Class Weighting
Macro F1 Score	0.61	0.59

The figures severely drop if class weighting is ignored. For more details check Report.pdf

Running the Project 🚀

If you are a developer then you know how to navigate to the corresponding model/feature extractor/preprocessing module and run it (everything is notebooks). Understanding the directory structure should help

1-Preprocessing
   |-- DataInsight
   |   |-- DataInsight.ipynb
   |-- Preprocessing
   |   |-- Preprocess.ipynb
2-FeatureExtraction
   |-- BagOfWords
   |   |-- BagOfWords.ipynb
   |-- BowTF
   |   |-- BowTF.ipynb
   |-- ContextualEmbeddings
   |   |-- ContextualEmbeddings.ipynb
   |-- OneHot
   |   |-- OneHot.ipynb
   |-- TF-IDF
   |   |-- TF-IDF.ipynb
   |-- TweetFeatures
   |   |-- TweetFeatures.ipynb
   |-- WordEmbeddings
   |   |-- FastText.ipynb
   |   |-- WordEmbeddings.ipynb
   |   |-- data.txt
3-Models
   |-- Arabert
   |   |-- Arabert-1.ipynb
   |   |-- Arabert-2.ipynb
   |   |-- runs.csv
   |-- GRU
   |   |-- GRU-1.ipynb
   |   |-- GRU-2.ipynb
   |   |-- data.txt
   |   |-- runs.csv
   |-- GradientBoost
   |   |-- GB-1.ipynb
   |   |-- GB-2.ipynb
   |   |-- runs.csv
   |-- LSTM
   |   |-- LSTM-1.ipynb
   |   |-- LSTM-2.ipynb
   |   |-- runs.csv
   |-- NaiveBayes
   |   |-- NB-1.ipynb
   |   |-- NB-2.csv
   |   |-- NB-2.ipynb
   |   |-- model.pkl
   |   |-- model2.pkl
   |   |-- runs copy.csv
   |   |-- runs.csv
   |-- SVM
   |   |-- SVM-1.ipynb
   |   |-- SVM-2.ipynb
   |   |-- runs.csv
Dataset
   |-- SavedFeatures
   |   |-- CONTEXT
   |   |   |-- X_test.npy
   |   |   |-- x.npy
   |   |-- ONEHOT
   |   |   |-- Preprocessing.npy
    ...

Collaborators

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
1-Preprocessing		1-Preprocessing
2-FeatureExtraction		2-FeatureExtraction
3-Models		3-Models
Dataset		Dataset
Research Papers		Research Papers
.gitignore		.gitignore
Model Weights.md		Model Weights.md
NLP Project F22.pdf		NLP Project F22.pdf
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis for Stance & Category Detection of Arabic COVID Tweets

Datasets & Preprocessing 📊

Features Extracted 🗿

Models Considered 🤖

Metric & Results 📉

Running the Project 🚀

Collaborators

About

Releases

Packages

Contributors 4

Languages

EssamWisam/Sentiment-Classification-Arabic-COVID-Tweets

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis for Stance & Category Detection of Arabic COVID Tweets

Datasets & Preprocessing 📊

Features Extracted 🗿

Models Considered 🤖

Metric & Results 📉

Running the Project 🚀

Collaborators

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages