I need to get multiple rows from huge dataframe based on unique values from one of the column

I have data frame with around 100k records, from it I need to filter out multiple rows but based on unique value from another column (i.e. 'message_id'). Also, after getting multiple rows I need to get the count of particular column values (i.e 'event'). It has different events & I need to get the count of every event. Then, I need to create a single data frame from the filtered out rows and insert it into the database that too based on conditions. I need to read the data without using loop, that means if I have 700 unique ids then without using loop I need to get all the rows related to those 700 ids and create one single data frame out of it. So far I have tried the below code but it is taking too long time to create a single data frame.

message_id = df.message_id.unique()
count = 0
for rec in message_id:
    count += 1
    _logger.info(count)
    df1 = df.iloc[np.where(df.message_id.isin([rec]))]
    df2 = {
        'message_id': df1['message_id'].any(),
        'open': len(df1[df1['event'] == 'open']),
        'delivered': len(df1[df1['event'] == 'delivered']),
        'click': len(df1[df1['event'] == 'click']),
        'processed': len(df1[df1['event'] == 'processed']),
        'deferred': len(df1[df1['event'] == 'deferred']),
        'bounce': len(df1[df1['event'] == 'bounce']),
        'drop': len(df1[df1['event'] == 'drop']),
        'subject': df1['subject'].any(),
        'from': df1['from'].any(),
        'to': df1['email'].any(),
        'api_key_id': df1['api_key_id'].any(),
        'credential_id': df1['credential_id'].any(),
        'asm_group_id': df1['asm_group_id'].any(),
        'template_id': df1['template_id'].any(),
        'originating_ip': df1['originating_ip'].any(),
        'reason': df1['reason'].any(),
        'outbound_ip': df1['outbound_ip'].any(),
        'outbound_ip_type': df1['outbound_ip_type'].any(),
        'mx': df1['mx'].any(),
        'attempt': df1['attempt'].any(),
        'url': df1['url'].any(),
        'user_agent': df1['user_agent'].any(),
        'type': df1['type'].any(),
        'is_unique': df1['is_unique'].any(),
        'username': df1['username'].any(),
        'categories': df1['categories'].any(),
        'marketing_campaign_id': df1['marketing_campaign_id'].any(),
        'marketing_campaign_name': df1['marketing_campaign_name'].any(),
        'marketing_campaign_split_id': df1['marketing_campaign_split_id'].any(),
        'marketing_campaign_version': df1['marketing_campaign_version'].any(),
        'unique_args': df1['unique_args'].any(),
        'recv_message_id': df1['recv_message_id'].any(),
    }
    if not df1.loc[df1['event'] == 'processed', 'processed'].empty:
        df2.update({
            'processed_date': df1.loc[df1['event'] == 'processed', 'processed'].iloc[0]
        })
    if not df1.loc[df1['event'] == 'delivered', 'processed'].empty:
        df2.update({
            'delivered_date': df1.loc[df1['event'] == 'delivered', 'processed'].iloc[0]
        })
    if not df1.loc[df1['event'] == 'bounce', 'processed'].empty:
        df2.update({
            'bounce_date': df1.loc[df1['event'] == 'bounce', 'processed'].iloc[0]
        })
    if not df1.loc[df1['event'] == 'drop', 'processed'].empty:
        df2.update({
            'drop_date': df1.loc[df1['event'] == 'drop', 'processed'].iloc[0]
        })
    final = pd.DataFrame(df2, index=[0])
    final.to_csv('output.csv')

CSV File

Output File

After getting the desired output I need to insert the data into PSQL DB. I need to insert entire dataframe with multiple records at once that too based on condition.

i.e. If delivered as 1 count then the status field inside DB should be delivered instead of proccessed, bounce, drop.



Read more here: https://stackoverflow.com/questions/67000463/i-need-to-get-multiple-rows-from-huge-dataframe-based-on-unique-values-from-one

Content Attribution

This content was originally published by aasshhuu420 at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: