Unsupervised Methods:


1. Principal Component Analysis


We will use the USArrests data to illustrate the implementation of Principal Component method

glimpse(USArrests)
Observations: 50
Variables: 4
$ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.6, 10.4, 7.2, 2....
$ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 249, 113, 56, 115,...
$ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 66, 52, 66, 51, 6...
$ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 20.2, 14.2, 24.0,...

We will now check for the means and variances of the variables and decide if standarization is required or not

apply(USArrests, 2, var)
    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916 

We notice that the means and variances are quite different. And since in Principal Component method we aim to identify the linear combination of variables that maximizes the variance, the result will be dominated by the variable that has the greatest variance.

So, we will standardize the variables (i.e. bring the variance of all variables to 1 unit) before implementing the method.

This can be achieved by setting the scale argument of the prcomp function to TRUE

pca_res
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

The standard deviation displayed in the result is the standard deviation of each of the 4 principal components. (Remember that the total number of Principal Components for a dataset = MIN[n-1, p])

Notice that the standard deviations always decreases.

The Rotation in the above summary is nothing but the loadings.

The first principal component is loaded equally on all the 3 kinds of crime. And it has got a lower loading on UrbanPop

So the first principal component esentially measure the average of the 3 crimes in any state

The second principal component is heavily loaded on UrbanPop

Visualizing the Principal Components

Interpretation Since the loadings were negative for the first principal component, states with a negative values have high crime rate (like Michigan, Nevada, California)

Similarly, the second principal component had a negative loading corresponding to UrbanPop. Hence states like New Jersey, Hawaii has high percentage of urban population.



2. K-means clustering


We will work with a simulated 2-dimensional data to illustrate the application of k-means clustering method.

Now, the cluster_assign store the true cluster numbers for each data point.

We will now run k-means algorithm on this dataset. The true clsuter assignment will be hidden from the algorithm.

Determining the optimal value of k - * Elbow Curve Method *

Based on the plot above, we will select k = 4

kmeans_out
K-means clustering with 4 clusters of sizes 29, 22, 28, 21

Cluster means:
       [,1]      [,2]
1 -1.208942 -3.512880
2  3.062712  1.015205
3 -8.447148 -3.005280
4 -2.368405  1.643897

Clustering vector:
  [1] 1 4 2 3 3 1 4 4 3 4 3 2 3 2 3 2 2 4 1 1 2 3 3 1 3 3 2 3 1 1 4 1 3 3 2 1 2 3 1 2 3 1 1 2 2 4
 [47] 1 2 1 4 4 4 2 1 3 1 3 1 4 4 1 4 3 1 3 2 1 4 2 3 1 1 3 1 2 4 4 2 3 3 1 1 3 1 1 3 1 2 3 4 2 2
 [93] 2 1 4 3 4 4 4 3

Within cluster sum of squares by cluster:
[1] 40.73619 51.11144 72.60169 41.22388
 (between_SS / total_SS =  91.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"      

The output of k-means provides us with a number of metrics.

Visualizing the output of k-means We will compare the cluster results from k-means with the the true cluster assignments

We can see that k-means did a pretty good job in correctly assigning points to the clusters.



3. Hierarchical Clustering


We will use the same simulated dataset used to perform k-means clustering.

We use the function hclust() that accepts 2 parameters, one is the distance matrix and the other is the linkage method.

Since we know that there are 4 clusters in the data, the dendogram above infact shows the presence of 4 major cluster (if we cut the dendogram at height between 5 and 10)

Recall that complete linkage uses the maximum pairwise-distance between points in 2 clusters.

We will now use other linkage methods:

As expected, single linkage produced long, stringy trees. The 4 clsuters are not really prominent in the above dendogram.

Average linkage, like complete method, produces balanced trees. The 4 clusters are quite visible from the above dendogram.

Comparing the result from complete linkage method with the true clsuter assignments

table(hclust_complete_cut, cluster_assign)
                   cluster_assign
hclust_complete_cut  1  2  3  4
                  1 29  0  0  0
                  2  0 20  0  0
                  3  0  0 22  0
                  4  1  0  0 28

The table above shows that only 1 observation has been assigned to a wrong cluster.

Comparing the result from complete linkage method with the k-means clsuter assignments

table(hclust_complete_cut, kmeans_out$cluster)
                   
hclust_complete_cut  1  2  3  4
                  1 29  0  0  0
                  2  0  0  0 20
                  3  0 22  0  0
                  4  0  0 28  1
