I am following this detailed KMeans tutorial: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py which uses dataset with 2 features.
But I have a dataframe with 5 features (columns), so instead of using the
def euclidean_distance(x1, x2): function in the tutorial, I compute the euclidean distance as below.
def euclidean_distance(df): n = df.shape distance_matrix = np.zeros((n,n)) for i in range(n): for j in range(n): distance_matrix[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2)) return distance_matrix
Next I want to implement the part in the tutorial that computes the centroid as below;
def _closest_centroid(self, sample, centroids): distances = [euclidean_distance(sample, point) for point in centroids]
def euclidean_distance(df): function only takes 1 argument, df, how best can I implement it in order to get the centroid?
My sample dataset, df is as below:
col1,col2,col3,col4,col5 0.54,0.68,0.46,0.98,-2.14 0.52,0.44,0.19,0.29,30.44 1.27,1.15,1.32,0.60,-161.63 0.88,0.79,0.63,0.58,-49.52 1.39,1.15,1.32,0.41,-188.52 0.86,0.80,0.65,0.65,-45.27