How to remove common words from list of lists in Python?

I have a large number of "groups" of words. If any of the words from one group appears both in column A and column B, I want to remove the words in the group from the two columns. How do I loop over all the groups (i.e. over the sublists in the list)?

The flawed code below only removes the common words from the last group, not all three groups (lists) in stuff. [I first create an indicator if one of the words from the groups is in the string, and then create another indicator if both strings have a word from the group. Only for the pairs of A and B where both have a word from the group, I remove the particular group words.]

How do I correctly specify the loop?

# Input data:

data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumn wind'],
        'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
df = pd.DataFrame (data, columns = ['A', 'B'])  

                            A                           B
0  summer time third grey abc    defg autumn fourth table
1      yellow sky hello table  no red skies second garnet
2          fourth autumn wind     first blue chair winter
# Groups of words to be removed:

colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['spring times', 'spring time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']

stuff = [colors, seasons, numbers]

# Code below only removes the last list in stuff (numbers):

def fA(S,y):
    for word in listed:
        if'\b' + re.escape(word) + r'\b', S):
            y = 1
    return y

def fB(T,y):
    for word in listed:
        if'\b' + re.escape(word) + r'\b', T):
            y = 1
    return y

def fARemove(S):
    for word in listed:
        if'\b' + re.escape(word) + r'\b', S):
            S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
    return S

def fBRemove(T):
    for word in listed:
        if'\b' + re.escape(word) + r'\b', T):
            T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
    return T

for listed in stuff:

    df['A_Ind'] = 0
    df['B_Ind'] = 0

    df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
    df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)

    df['inboth'] = 0
    df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1

    df['A_new'] = df['A']
    df['B_new'] = df['B']

    df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
    df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)

    del df['inboth']
    del df['A_Ind']
    del df['B_Ind']
    df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
    df['A_new'] = df['A_new'].str.strip()
    df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
    df['B_new'] = df['B_new'].str.strip()

Expected output is:

         A_new             B_new
0     grey abc        defg table
1  hello table  no second garnet
2         wind        blue chair

Read more here:

Content Attribution

This content was originally published by pandini at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: