This project would focus on mapping high dimensional data to a lower dimensional space, a necessary step for projects that utilize data compression or data visualizations. As the ethical discussions surrounding AI continue to grow, scientists and businesses alike are using visualizations of high dimensional data to explain results.
During this project, we will perform K Means clustering on the well known Iris data set, which contains 3 classes of 50 instances each, where each class refers to a type of iris plant. To visualize the clusters, we will use principle component analysis (PCA) to reduce the number of features in the dataset.
Let's dive right in!
First things first, let's ensure that we have all of the necessary libraries installed.
import sys
import pandas as pd
import numpy as np
import sklearn
import matplotlib
print('Python: {}'.format(sys.version))
print('Pandas: {}'.format(pd.__version__))
print('NumPy: {}'.format(np.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('Matplotlib: {}'.format(matplotlib.__version__))
from sklearn import datasets
# python magic function
%matplotlib inline
iris = datasets.load_iris()
features = iris.data
target = iris.target
df = pd.DataFrame(features)
df.columns = iris.feature_names
print(df.shape)
print(df.head(20))
print(df.describe())
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
scatter_matrix(df)
plt.show()
Since we already known that only three species of flowers are represented in this dataset, it's easy to predict the number of clusters we will need. However, what would we do if we didn't know this information? Since K-Means clustering is an unsupervised learning method, we often won't know the number of clusters necessary beforehand. Fortunately, the elbow method is commonly used to determine the appropriate number of clusters for a dataset. Let's use this method to confirm that three clusters is indeed the optimal value for this dataset.
from sklearn.cluster import KMeans
# empty x and y data lists
X = []
Y = []
for i in range(1,31):
# initialize and fit the kmeans model
kmeans = KMeans(n_clusters = i)
kmeans.fit(df)
# append number of clusters to x data list
X.append(i)
# append average within-cluster sum of squares to y data list
awcss = kmeans.inertia_ / df.shape[0]
Y.append(awcss)
import matplotlib.pyplot as plt
plt.plot(X,Y, 'bo-')
plt.xlim((1, 30))
plt.xlabel('Number of Clusters')
plt.ylabel('Average Within-Cluster Sum of Squares')
plt.title('K-Means Clustering Elbow Method')
plt.show()
From Wikipedia - principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
from sklearn.decomposition import PCA
from sklearn import preprocessing
pca = PCA(n_components=2)
pc = pca.fit_transform(df)
# print new dimensions
print(pc.shape)
print(pc[:10])
kmeans = KMeans(n_clusters = 3)
kmeans.fit(pc)
Now that our data has been compressed, we can easily visualize it using matplotlib.pyplot. Furthermore, since our data has only two components, we can predict the appropriate cluster for each X, Y point in our plot and produce a color-coordinated meshgrid to display the different clusters in our K means algorithm.
h = 0.02 # determines quality of the mesh [x_min, x_max]x[y_min, y_max]
x_min, x_max = pc[:, 0].min() - 1, pc[:, 0].max() + 1
y_min, y_max = pc[:, 1].min() - 1, pc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize = (12, 12))
plt.clf()
plt.imshow(Z, interpolation = 'nearest',
extent = (xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.tab20c,
aspect = 'auto', origin = 'lower')
for i, point in enumerate(pc):
if target[i] == 0:
plt.plot(point[0], point[1], 'g.', markersize = 10)
if target[i] == 1:
plt.plot(point[0], point[1], 'r.', markersize = 10)
if target[i] == 2:
plt.plot(point[0], point[1], 'b.', markersize = 10)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker = 'x', s = 250, linewidth = 4,
color = 'w', zorder = 10)
plt.title('K-Means Clustering on PCA-Reduced Iris Data Set')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.xticks(())
plt.yticks(())
plt.show()
It looks good! But did the PCA reduction impact the performance of our K means clustering algorithm? Let's investigate by using some common clustering metrics, such as homogeneity, completeness, and V-measure.
from sklearn import metrics
# K Means clustering on Non Reduced Data
kmeans1 = KMeans(n_clusters = 3)
kmeans1.fit(features)
# K Means clustering on PCA Reduced Data
kmeans2 = KMeans(n_clusters = 3)
kmeans2.fit(pc)
print('Non Reduced Data')
print('Homogeneity: {}'.format(metrics.homogeneity_score(target, kmeans1.labels_)))
print('Completeness: {}'.format(metrics.completeness_score(target, kmeans1.labels_)))
print('V-measure: {}'.format(metrics.v_measure_score(target, kmeans1.labels_)))
print('Reduced Data')
print('Homogeneity: {}'.format(metrics.homogeneity_score(target, kmeans2.labels_)))
print('Completeness: {}'.format(metrics.completeness_score(target, kmeans2.labels_)))
print('V-measure: {}'.format(metrics.v_measure_score(target, kmeans2.labels_)))
print(kmeans1.labels_)
print(kmeans2.labels_)
print(target)