Create columns based on word matches from seprate dataframe – works but very slow

I have two dataframes:

df = list of words:

Words
xxx
yyy
www
eee
xxx yyyy

tag_matrix_df = matrix of columns, labels and search strings

Columns Search Labels
Col1 xxx label1
Col1 yyy label2
Col2 www label3
Col2 eee label3
Col2 xxx yyyy label4

I need to add columns to df based on tag_matrix_df when words are matched, for example: df =

Words Col1 Col2
xxx label1
yyy label2
www label3
eee label3
xxx yyyy label4

I have the following code, which seems to be working ok but it's very (tag_matrix_df > 5000 rows).

#df = Dataframe of words
#tag_matrix_df = datafreame of columns, labels and search strings
....

 def assign_label(kw, matrix_df):

    kw = kw.lower().strip()
    
    for ii, tag in matrix_df.iterrows():
      find_tag = tag['Find'].lower().strip()
      if kw != None and find_tag in kw.split():
        return tag['Label']

 flag_cols = tag_matrix_df['Flag Name'].unique()
 for flag in flag_cols:
   filtered_matrix_df = tag_matrix_df.loc[tag_matrix_df['Flag Name'] == flag]
   df[flag] = df.apply(lambda row: assign_label(row[0], filtered_matrix_df), axis=1)

I'm also not confident that I'm correctly matching the words as it I suspect its really only looking at the strings.

Any suggestions on a smarter implementation of this?



Read more here: https://stackoverflow.com/questions/68463126/create-columns-based-on-word-matches-from-seprate-dataframe-works-but-very-slo

Content Attribution

This content was originally published by Dimo at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: