Classification: Example

Loading the necessary packages.

# knn
library(class)
# lda
library(MASS)
# featurePlot
library(caret)
set.seed(100)

We give an example based on the Iris data set.

df <- iris

Data Exploration

To be more familiar with the data set, we output its general information.

head(df)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(df)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

summary(df)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

We can see that $n=150$, $d=4$, the class distribution is balanced (each category contains 50 observations). Let’s take a look how observations are distributed across different features with respect to a class label. For this, we plot a scatterplot matrix:

featurePlot(x = df[,-5], 
            y = df[,5], 
            plot = "pairs",
            auto.key = list(columns = 3))

We observe that the class setosa is well separated from the others; the information about petal seems more valuable to separate versicolor and virginica.

Train / Test Split

We create a function to split data into two parts, train and test, in a certain ratio:

train_test_split <- function(X, y, prop_train=0.8){
  sample <- sample.int(n = nrow(X), size = floor(prop_train*nrow(X)), replace = F)
  train_x <- X[sample, ]
  train_y <- y[sample]
  test_x <- X[-sample, ]
  test_y <- y[-sample]
  return(list(train_x=train_x,
              train_y=train_y,
              test_x=test_x,
              test_y=test_y))
}

Then, we split our data set in a ration $80:20$:

splitted <- train_test_split(iris[,-5], iris[,5], prop_train = 0.8)
train_x <- splitted$train_x
train_y <- splitted$train_y
test_x <- splitted$test_x
test_y <- splitted$test_y

Classification Using LDA

We train a model:

fit <- lda(train_x, train_y)

Then, predict labels for test_x:

test_y_pred <- predict(fit, test_x)$class

Note that we need $class, since by default predict.lda outputs posterior probabilities. Finally, we obtain that the accuracy is:

print(sum(test_y == test_y_pred) / length(test_y))

## [1] 1

We also can plot the confusion matrix:

conf_mat <- table(test_y, test_y_pred)
conf_mat

##             test_y_pred
## test_y       setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         10         0
##   virginica       0          0         6

We can see that in the train/test sets, the class distribution differ from the initial one. This is due to the train/test split function we implemented. It would be more appropriate to split data in a stratified way, i.e. taking account the class distribution while splitting.

Classification Using kNN

In this example, we use 2 nearest neighbours. As kNN is an instance-based algorithm, there is no the training phase, it directly predicts labels:

test_y_pred <- knn(train_x, test_x, train_y, k=2)

The accuracy score is:

print(sum(test_y == test_y_pred) / length(test_y))

## [1] 0.9

The confusion matrix is:

conf_mat <- table(test_y, test_y_pred)
conf_mat

##             test_y_pred
## test_y       setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0          9         1
##   virginica       0          2         4

We see that setosa was perfectly classified as expected. Versicolor and virginica have some overlap; probably, 2 nearest neighbours were not enough to distinguish between these two classes as good as the LDA.

Classification: Example

Vasilii Feofanov

April 2020

Data Exploration

Train / Test Split

Classification Using LDA

Classification Using kNN