K Means Clustering for Imagery Analysis

In this project, we will use a K-means algorithm to perform image classification. Clustering isn't limited to the consumer information and population sciences, it can be used for imagery analysis as well. Leveraging Scikit-learn and the MNIST dataset, we will investigate the use of K-means clustering for computer vision.

In this project, we will learn how to:

  • Preprocess images for clustering
  • Deploy K-means clustering algorithms
  • Use common metrics to evaluate cluster performance
  • Visualize high-dimensional cluster centroids

For this project, we will be using the MNIST dataset. It is available through keras, a deep learning library we have used in previous tutorials. Although we won't be using other features of keras today, it will save us time to import mnist from this library. It is also available through the tensorflow library or for download at http://yann.lecun.com/exdb/mnist/.

Importing the libraries

In [6]:
import sys
import sklearn
import matplotlib
import numpy as np

print('Python: {}'.format(sys.version))
print('Sklearn: {}'.format(sklearn.__version__))
print('Matplotlib: {}'.format(matplotlib.__version__))
print('NumPy: {}'.format(np.__version__))

import matplotlib.pyplot as plt

# python magic function
%matplotlib inline
Python: 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Sklearn: 0.20.2
Matplotlib: 3.0.2
NumPy: 1.15.4

Import the MNIST dataset

In [4]:
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

print('Training Data: {}'.format(x_train.shape))
print('Training Labels: {}'.format(y_train.shape))
Training Data: (60000, 28, 28)
Training Labels: (60000,)
In [5]:
print('Testing Data: {}'.format(x_test.shape))
print('Testing Labels: {}'.format(y_test.shape))
Testing Data: (10000, 28, 28)
Testing Labels: (10000,)

Create figure with 3x3 subplots using matplotlib.pyplot

In [8]:
fig, axs = plt.subplots(3, 3, figsize = (12, 12))

# loop through subplots and add mnist images
for i, ax in enumerate(axs.flat):
    ax.set_title('Number {}'.format(y_train[i]))
# display the figure
Preprocessing the MNIST images

Images stored as NumPy arrays are 2-dimensional arrays. However, the K-means clustering algorithm provided by scikit-learn ingests 1-dimensional arrays; as a result, we will need to reshape each image.

Clustering algorithms almost always use 1-dimensional data. For example, if you were clustering a set of X, Y coordinates, each point would be passed to the clustering algorithm as a 1-dimensional array with a length of two (example: [2,4] or [-1, 4]). If you were using 3-dimensional data, the array would have a length of 3 (example: [2, 4, 1] or [-1, 4, 5]).

MNIST contains images that are 28 by 28 pixels; as a result, they will have a length of 784 once we reshape them into a 1-dimensional array.

Convert each image to 1 dimensional array

In [9]:
X = x_train.reshape(len(x_train),-1)
Y = y_train

# normalize the data to 0 - 1
X = X.astype(float) / 255.

(60000, 784)

3. K-Means Clustering

Time to start clustering! Due to the size of the MNIST dataset, we will use the mini-batch implementation of k-means clustering provided by scikit-learn. This will dramatically reduce the amount of time it takes to fit the algorithm to the data.

The MNIST dataset contains images of the integers 0 to 9. Because of this, let's start by setting the number of clusters to 10, one for each digit.

In [10]:
from sklearn.cluster import MiniBatchKMeans

n_digits = len(np.unique(y_test))

kmeans = MiniBatchKMeans(n_clusters = n_digits)

Fit the model to the training data

In [11]:
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=10,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
In [12]:
array([0, 7, 5, ..., 0, 2, 9], dtype=int32)

4. Assigning Cluster Labels

K-means clustering is an unsupervised machine learning method; consequently, the labels assigned by our KMeans algorithm refer to the cluster each array was assigned to, not the actual target integer. To fix this, let's define a few functions that will predict which integer corresponds to each cluster.

In [13]:
def infer_cluster_labels(kmeans, actual_labels):
    Associates most probable label with each cluster in KMeans model
    returns: dictionary of clusters assigned to each label

    inferred_labels = {}

    for i in range(kmeans.n_clusters):

        # find index of points in cluster
        labels = []
        index = np.where(kmeans.labels_ == i)

        # append actual labels for each point in cluster

        # determine most common label
        if len(labels[0]) == 1:
            counts = np.bincount(labels[0])
            counts = np.bincount(np.squeeze(labels))

        # assign the cluster to a value in the inferred_labels dictionary
        if np.argmax(counts) in inferred_labels:
            # append the new number to the existing array at this slot
            # create a new array in this slot
            inferred_labels[np.argmax(counts)] = [i]

        #print('Cluster: {}, label: {}'.format(i, np.argmax(counts)))
    return inferred_labels  

def infer_data_labels(X_labels, cluster_labels):
    Determines label for each array, depending on the cluster it has been assigned to.
    returns: predicted labels for each array
    # empty array of len(X)
    predicted_labels = np.zeros(len(X_labels)).astype(np.uint8)
    for i, cluster in enumerate(X_labels):
        for key, value in cluster_labels.items():
            if cluster in value:
                predicted_labels[i] = key
    return predicted_labels
In [15]:
cluster_labels = infer_cluster_labels(kmeans, Y)
X_clusters = kmeans.predict(X)
predicted_labels = infer_data_labels(X_clusters, cluster_labels)
[8 0 4 1 7 2 1 8 1 5 3 1 3 6 1 7 2 5 6 5]
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]

Optimizing and Evaluating the Clustering Algorithm

With the functions defined above, we can now determine the accuracy of our algorithms. Since we are using this clustering algorithm for classification, accuracy is ultimately the most important metric; however, there are other metrics out there that can be applied directly to the clusters themselves, regardless of the associated labels. Two of these metrics that we will use are inertia and homogeneity.

Furthermore, earlier we made the assumption that K = 10 was the appropriate number of clusters; however, this might not be the case. Let's fit the K-means clustering algorithm with several different values of K, than evaluate the performance using our metrics.

In [16]:
from sklearn import metrics

def calculate_metrics(estimator, data, labels):

    # Calculate and print metrics
    print('Number of Clusters: {}'.format(estimator.n_clusters))
    print('Inertia: {}'.format(estimator.inertia_))
    print('Homogeneity: {}'.format(metrics.homogeneity_score(labels, estimator.labels_)))
In [17]:
clusters = [10, 16, 36, 64, 144, 256]

# test different numbers of clusters
for n_clusters in clusters:
    estimator = MiniBatchKMeans(n_clusters = n_clusters)
    # print cluster metrics
    calculate_metrics(estimator, X, Y)
    # determine predicted labels
    cluster_labels = infer_cluster_labels(estimator, Y)
    predicted_Y = infer_data_labels(estimator.labels_, cluster_labels)
    # calculate and print accuracy
    print('Accuracy: {}\n'.format(metrics.accuracy_score(Y, predicted_Y)))
Number of Clusters: 10
Inertia: 2437731.123809265
Homogeneity: 0.4096475793567839
Accuracy: 0.51555

Number of Clusters: 16
Inertia: 2204403.987161855
Homogeneity: 0.5567829328787175
Accuracy: 0.6493333333333333

Number of Clusters: 36
Inertia: 1957254.3585803441
Homogeneity: 0.6772397869350846
Accuracy: 0.74705

Number of Clusters: 64
Inertia: 1817112.314420748
Homogeneity: 0.738635205141766
Accuracy: 0.80595

Number of Clusters: 144
Inertia: 1636789.8157812618
Homogeneity: 0.8002855930418575
Accuracy: 0.8628333333333333

Number of Clusters: 256
Inertia: 1518446.3819365252
Homogeneity: 0.8404293057825444
Accuracy: 0.8927166666666667

Test kmeans algorithm on testing dataset

In [18]:
# convert each image to 1 dimensional array
X_test = x_test.reshape(len(x_test),-1)

# normalize the data to 0 - 1
X_test = X_test.astype(float) / 255.

# initialize and fit KMeans algorithm on training data
kmeans = MiniBatchKMeans(n_clusters = 256)
cluster_labels = infer_cluster_labels(kmeans, Y)

# predict labels for testing data
test_clusters = kmeans.predict(X_test)
predicted_labels = infer_data_labels(kmeans.predict(X_test), cluster_labels)
# calculate and print accuracy
print('Accuracy: {}\n'.format(metrics.accuracy_score(y_test, predicted_labels)))
Accuracy: 0.897

Visualizing Cluster Centroids

The most representative point within each cluster is called the centroid. If we were dealing with X,Y points, the centroid would simply be a point on the graph. However, since we are using arrays of length 784, our centroid is also going to be an array of length 784. We can reshape this array back into a 28 by 28 pixel image and plot it.

These graphs will display the most representative image for each cluster.

In [19]:
# Initialize and fit KMeans algorithm
kmeans = MiniBatchKMeans(n_clusters = 36)

# record centroid values
centroids = kmeans.cluster_centers_

# reshape centroids into images
images = centroids.reshape(36, 28, 28)
images *= 255
images = images.astype(np.uint8)

# determine cluster labels
cluster_labels = infer_cluster_labels(kmeans, Y)

# create figure with subplots using matplotlib.pyplot
fig, axs = plt.subplots(6, 6, figsize = (20, 20))

# loop through subplots and add centroid images
for i, ax in enumerate(axs.flat):
    # determine inferred label using cluster_labels dictionary
    for key, value in cluster_labels.items():
        if i in value:
            ax.set_title('Inferred Label: {}'.format(key))
    # add image to subplot
# display the figure
