Labwork No.4. Sentiment Analysis of IMDb Reviews.

MSIAM 1st year / ENSIMAG 2nd year.

Spring 2021.

Abstract

Each team has to upload a report on Teide before May 21 at 20:00. All results should be commented and interpreted. Answer to the questions in order and refer to the question number in your report. Computations and graphics have to be performed with R. The report should be written using the Rmarkdown format. From your .rmd file, you are asked to generate an .html file and upload them both on Teide. In the .html file, you should limit the displayed R code to the most important instructions.

In this tutorial, we consider a dataset of movie reviews taken from the IMDb website. This is a binary classification task: the goal is to predict whether a movie review is positive or negative.

The original text data was already vectorized using two different bag-of-words representations: word count and tf-idf. To read training/test sets in both representations, run the following lines of code:

word_count_train <- read.csv("datasets/word_count_train.csv")
word_count_test <- read.csv("datasets/word_count_test.csv")
tfidf_train <- read.csv("datasets/tfidf_train.csv")
tfidf_test <- read.csv("datasets/tfidf_test.csv")

In each dataset, the first column contains the response \(y\), where 1 corresponds to a positive review, whereas 0 indicates that the review is negative. Other columns corresponds to words and their weights in each document. Attention: the data sets are heavy, so work in the environment first, or use cache=TRUE for the loading chunk, if you compile .Rmd file every time.

Question 1:

The data are stored as dense matrices. What are disadvantages of this (subject to a bag-of-words representation)? Is there another way to store large-scale data?

Question 2 :

At first, we consider reviews in the word count representation.

What is the vocabulary size?
For the review No.7 of the training set, what is the most frequent word? What is its frequency? What are the practical consequences of this?
How many times does “aaaand” appear in the training reviews? What is more reasonable: to delete this word from consideration or to keep it?

Question 3 :

Now, we want to classify data using the multinomial naive Bayes classifier, which is available at the naivebayes package (multinomial_naive_bayes function). Attention: In old versions of the package multinomial_naive_bayes is not available! If it is not possible to install the latest version, use the fnb.multinomial function from the package fastNaiveBayes.

What is the purpose of the Laplace smoothing ?
Train the multinomial naive Bayes on the training set in the word count representation. Set the Laplacian parameter to 1.
Predict labels both for training and test observations.
Compute the misclassification error both on the training and the test sets. Comment the observed results.

Question 4 :

Now, we move to the tf-idf representation.

How many positive and negative reviews are there in the training set? In the test set?
Look at the tf-idf representation of the training set. What are the three most important words for negative reviews? Comment this result.

Question 5 :

For classification we are going to use linear discriminant analysis (LDA). We keep to consider our data in the tf-idf representation. Before performing LDA, we will reduce dimension of data by principal component analysis (PCA).

Perform PCA (prcomp) on the training set. Transform train and test observations using obtained model. Keep the first 700 principal components.
Perform LDA on the PCA-projected training set.
Predict labels both for training and test observations.
Compute the misclassification error both on the training and the test sets. Comment the observed results and compare them with the multinomial Bayes classifier.

Bonus Question :

Feel free to propose any other technique from the course (word filtering, another classifier etc.) to improve the classification performance.