Pandas Fix a Dataframe

I have an input_df full of cases +20k which gets tested and converted into output_df.

Some of the cases from input_df fail then we get a FAIL output on the output_df. If that's the case we need to fix the input_df (old) and regenerate a new one input_df(new). So the final result we seek is input_df(new) The way we fix it is by simply changing the date in the input_date for the cases which have failed. We know the cases which failed because the output_df has FAIL or SUCCESS in the output_result column.

#sample example below - the actual df are more than 20k rows ie:

output_df

case_id        output_date        carBrand     ouput_result
1                 01/20/21             001          FAIL
2                 02/21/21             002          SUCCESS  
3                 02/08/20             003          FAIL 
4                 01/07/20             001          FAIL
5                 09/05/20             002          SUCCESS

input_df (old)

case_id    input_date         carBrand 
    1          01/20/21             001  
    2          02/21/21             002 
    3          02/08/20             003
    4          01/07/20             001
    5          09/05/20             002

new expected result after the fix (changing dates) =>

input_df (new)

case_id   input_date              carBrand 
    **1          01/13/21             001**  
    2          02/21/21             002 
    **3          02/22/20             003**
    **4          01/28/20             001**
    5          09/05/20             002

input_df and output_df have matching column case_id. We need to do the following

  1. get the rows case_id from output_df with FAIL status

  2. match those rows with the rows in the input_df using the case_id For instance above we see case_id 1,3,4 have failed. Therefore we go to the input_df and look for those rows 1,3,5. => the failure is because of a wrong date; therefore we need to select a new date within +- 7 days multiple of the old one. So if the old date was 01/20/2020 on the original input_df the new test date we choose could 01/27/2020 Or another multiple of +- 7 days. ie: 14,21,28,35..

  3. Now, we need to do a groupBy(input_carID) i.e: inputID is a group of cases. We do this because we need to fix the cases by not introducing duplicate values at the groupLevel. Think of this as a group of Mercedez, or group of Ford, McLaren, etc... So when we select a new date for the FAIL case we need to check all the cases within a certain group that the date we selected was not already selected before. => So we make a group of each car brand groupBy() and we then fix the cases in each group which failed.

When we have gone through all the FAIL cases within each group, we now have a new input_df (new).

The fix is about changing the date in the input_df (old) and therefore getting a new one input_df (new) This is what I have so far which is not working. I do not get unique values for the input_df and also the input_df gets printed 3 times for some reasons..

    import pandas as pd
    import numpy as np
    import datetime
    
    output_df = pd.DataFrame(
    {
    "case_id": [1, 2, 3, 4, 5],
    "output_date": ["2021-01-20", "2021-02-21", "2020-02-08", "2020-01-07", "2020-09-05"],
    "output_carId": ["001", "001", "003", "001", "002"],
    "output_result": ["FAIL", "SUCCESS", "FAIL", "FAIL", "SUCCESS"],
    },
    columns=["case_id", "output_date", "output_carId", "output_result"],
    )
    input_df = pd.DataFrame(
    {
    "case_id": [1, 2, 3, 4, 5],
    "input_date": ["2021-01-20", "2021-02-21", "2020-02-08", "2020-01-07", "2020-09-05"],
    "input_carId": ["001", "001", "003", "001", "002"],
    },
    columns=["case_id", "input_date", "input_carId"],
    )
    
 
    
    def func(x, old_input):
    # print(old_input)
    mask = x['output_result'] == 'FAIL'
    count = mask.sum()
    indexes = x.loc[mask]
    # print(indexes.index)
    arr = np.arange(1, count + 1) * 7
    np.random.shuffle(arr)
    td = pd.to_timedelta(arr, unit='d')
    old_input.loc[indexes.index, 'input_date'] = pd.to_datetime(old_input.loc[indexes.index, 'input_date']) + td
    old_input.loc[indexes.index, 'input_date'] = pd.to_datetime(old_input.loc[indexes.index, 'input_date']).dt.date
    return old_input
    
    
    **new_input** = output_df.groupby('output_carId').apply(lambda x: func(x, input_df))
    print(new_input)

 


Read more here: https://stackoverflow.com/questions/66324184/pandas-fix-a-dataframe

Content Attribution

This content was originally published by uniXVanXcel at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: