Loading the necessary packages.
# knn
library(class)
# lda
library(MASS)
# featurePlot
library(caret)
set.seed(100)
We give an example based on the Iris data set.
df <- iris
To be more familiar with the data set, we output its general information.
head(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
summary(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We can see that \(n=150\), \(d=4\), the class distribution is balanced (each category contains 50 observations). Let’s take a look how observations are distributed across different features with respect to a class label. For this, we plot a scatterplot matrix:
featurePlot(x = df[,-5],
y = df[,5],
plot = "pairs",
auto.key = list(columns = 3))
We observe that the class setosa is well separated from the others; the information about petal seems more valuable to separate versicolor and virginica.
We create a function to split data into two parts, train and test, in a certain ratio:
train_test_split <- function(X, y, prop_train=0.8){
sample <- sample.int(n = nrow(X), size = floor(prop_train*nrow(X)), replace = F)
train_x <- X[sample, ]
train_y <- y[sample]
test_x <- X[-sample, ]
test_y <- y[-sample]
return(list(train_x=train_x,
train_y=train_y,
test_x=test_x,
test_y=test_y))
}
Then, we split our data set in a ration \(80:20\):
splitted <- train_test_split(iris[,-5], iris[,5], prop_train = 0.8)
train_x <- splitted$train_x
train_y <- splitted$train_y
test_x <- splitted$test_x
test_y <- splitted$test_y
We train a model:
fit <- lda(train_x, train_y)
Then, predict labels for test_x
:
test_y_pred <- predict(fit, test_x)$class
Note that we need $class
, since by default predict.lda
outputs posterior probabilities. Finally, we obtain that the accuracy is:
print(sum(test_y == test_y_pred) / length(test_y))
## [1] 1
We also can plot the confusion matrix:
conf_mat <- table(test_y, test_y_pred)
conf_mat
## test_y_pred
## test_y setosa versicolor virginica
## setosa 14 0 0
## versicolor 0 10 0
## virginica 0 0 6
We can see that in the train/test sets, the class distribution differ from the initial one. This is due to the train/test split function we implemented. It would be more appropriate to split data in a stratified way, i.e. taking account the class distribution while splitting.
In this example, we use 2 nearest neighbours. As kNN is an instance-based algorithm, there is no the training phase, it directly predicts labels:
test_y_pred <- knn(train_x, test_x, train_y, k=2)
The accuracy score is:
print(sum(test_y == test_y_pred) / length(test_y))
## [1] 0.9
The confusion matrix is:
conf_mat <- table(test_y, test_y_pred)
conf_mat
## test_y_pred
## test_y setosa versicolor virginica
## setosa 14 0 0
## versicolor 0 9 1
## virginica 0 2 4
We see that setosa was perfectly classified as expected. Versicolor and virginica have some overlap; probably, 2 nearest neighbours were not enough to distinguish between these two classes as good as the LDA.