In this tutorial, we consider a dataset of movie reviews taken from the IMDb website. This is a binary classification task: the goal is to predict whether a movie review is positive or negative.

The original text data was already vectorized using two different bag-of-words representations: word count and tf-idf. To read training/test sets in both representations, run the following lines of code:

word_count_train <- read.csv("datasets/word_count_train.csv")
word_count_test <- read.csv("datasets/word_count_test.csv")
tfidf_train <- read.csv("datasets/tfidf_train.csv")
tfidf_test <- read.csv("datasets/tfidf_test.csv")

In each dataset, the first column contains the response \(y\), where 1 corresponds to a positive review, whereas 0 indicates that the review is negative. Other columns corresponds to words and their weights in each document. Attention: the data sets are heavy, so work in the environment first, or use cache=TRUE for the loading chunk, if you compile .Rmd file every time.

Question 1:

The data are stored as dense matrices. What are disadvantages of this (subject to a bag-of-words representation)? Is there another way to store large-scale data?

Question 2 :

At first, we consider reviews in the word count representation.

Question 3 :

Now, we want to classify data using the multinomial naive Bayes classifier, which is available at the naivebayes package (multinomial_naive_bayes function). Attention: In old versions of the package multinomial_naive_bayes is not available! If it is not possible to install the latest version, use the fnb.multinomial function from the package fastNaiveBayes.

Question 4 :

Now, we move to the tf-idf representation.

Question 5 :

For classification we are going to use linear discriminant analysis (LDA). We keep to consider our data in the tf-idf representation. Before performing LDA, we will reduce dimension of data by principal component analysis (PCA).

Bonus Question :

Feel free to propose any other technique from the course (word filtering, another classifier etc.) to improve the classification performance.