How to apply KMeans to get the centroid using dataframe with multiple features

I am following this detailed KMeans tutorial: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py which uses dataset with 2 features.

But I have a dataframe with 5 features (columns), so instead of using the def euclidean_distance(x1, x2): function in the tutorial, I compute the euclidean distance as below.

def euclidean_distance(df):
    n = df.shape[1]
    distance_matrix = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            distance_matrix[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))
    return distance_matrix

Next I want to implement the part in the tutorial that computes the centroid as below;

def _closest_centroid(self, sample, centroids):
    distances = [euclidean_distance(sample, point) for point in centroids]

Since my def euclidean_distance(df): function only takes 1 argument, df, how best can I implement it in order to get the centroid?

My sample dataset, df is as below:

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,-2.14
0.52,0.44,0.19,0.29,30.44
1.27,1.15,1.32,0.60,-161.63
0.88,0.79,0.63,0.58,-49.52
1.39,1.15,1.32,0.41,-188.52
0.86,0.80,0.65,0.65,-45.27


Read more here: https://stackoverflow.com/questions/64805951/how-to-apply-kmeans-to-get-the-centroid-using-dataframe-with-multiple-features

Content Attribution

This content was originally published by Gee at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: