Abstract
Each team has to upload a report on Teide before May 21 at 20:00. All results should be commented and interpreted. Answer to the questions in order and refer to the question number in your report. Computations and graphics have to be performed with R. The report should be written using the Rmarkdown format. From your .rmd file, you are asked to generate an .html file and upload them both on Teide. In the .html file, you should limit the displayed R code to the most important instructions.
In this tutorial, we consider a dataset of movie reviews taken from the IMDb website. This is a binary classification task: the goal is to predict whether a movie review is positive or negative.
The original text data was already vectorized using two different bag-of-words representations: word count and tf-idf. To read training/test sets in both representations, run the following lines of code:
word_count_train <- read.csv("datasets/word_count_train.csv")
word_count_test <- read.csv("datasets/word_count_test.csv")
tfidf_train <- read.csv("datasets/tfidf_train.csv")
tfidf_test <- read.csv("datasets/tfidf_test.csv")
In each dataset, the first column contains the response \(y\), where 1 corresponds to a positive review, whereas 0 indicates that the review is negative. Other columns corresponds to words and their weights in each document. Attention: the data sets are heavy, so work in the environment first, or use cache=TRUE
for the loading chunk, if you compile .Rmd file every time.
The data are stored as dense matrices. What are disadvantages of this (subject to a bag-of-words representation)? Is there another way to store large-scale data?
At first, we consider reviews in the word count representation.
Now, we want to classify data using the multinomial naive Bayes classifier, which is available at the naivebayes
package (multinomial_naive_bayes
function). Attention: In old versions of the package multinomial_naive_bayes
is not available! If it is not possible to install the latest version, use the fnb.multinomial
function from the package fastNaiveBayes
.
Now, we move to the tf-idf representation.
For classification we are going to use linear discriminant analysis (LDA). We keep to consider our data in the tf-idf representation. Before performing LDA, we will reduce dimension of data by principal component analysis (PCA).
prcomp
) on the training set. Transform train and test observations using obtained model. Keep the first 700 principal components.Feel free to propose any other technique from the course (word filtering, another classifier etc.) to improve the classification performance.