I'm working with network analysis and I performed spectral clustering in a graph and I'd like to evaluate my clustering using silhouette score, but I'm getting an error which makes me question if I'm going in the right direction. First, I define a custom distance function between two nodes of my graph (I'm using networkx to do that):
def compute_node_distance(nodeA, nodeB, graph): try: return nx.shortest_path_length(graph, source=nodeA['ID'], target=nodeB['ID']) except nx.exception.NetworkXNoPath as noPath: #Return a big value if there are no paths between nodeA and nodeB return 1000000
Then I try to compute the silhouette score using this custom function as my distance metric, like this:
silhouette_score( labeled_nodes_df, labeled_nodes_df['cluster'], metric=lambda a,b: compute_node_distance(a,b,G))
Where G is the Graph I'm performing spectral clustering on and labeled_nodes_df is a pandas DataFrame with only two columns, 'ID' and 'cluster'. In order to my custom function to work, I only need the nodes' IDs, so I believe this should be enough.
However, when running this code,
silhouette_score gives me an error, saying that it couldn't convert one my nodes ID to float:
ValueError: could not convert string to float: '030714318X'
The thing is I do NOT want this value to be converted to a float, because I have my own metric function that is perfectly able to compute the distance between two nodes given their IDs as strings.
I know I could pre-compute the distance matrix between my nodes and use
metric='precomputed', but the graph I'm working on is really big (65000+ nodes) and storing such a matrix is really inefficient, so I'm trying to compute the distances as they are needed.
Can someone please help me to figure out what I'm doing wrong and why silhouette_score is trying to convert my IDs to floats?