Create different dataframe using pandas based on the unique/duplicated with replicating SAS’ first and last functionality

Here is the data that create new variables with replicating the first and last function in python (LINK).

My sample data:

df = pd.DataFrame({"col": ['D1986','D1986','H1946','H1946','I1978','I1978','S1987','S1987', 'D1974','L1977'], 
                   "ANOTHER_COL1":['DD','DD','HH','HH','II','II','SS','SS','D','L'],
                   "COL_WITH_NaN": ['TT','TT','WW','WW',np.nan,np.nan,'ZZ','ZZ',np.nan, np.nan]})   

    col     ANOTHER_COL1    COL_WITH_NaN
0   D1986   DD              TT
1   D1986   DD              TT
2   H1946   HH              WW
3   H1946   HH              WW
4   I1978   II              NaN
5   I1978   II              NaN
6   S1987   SS              ZZ
7   S1987   SS              ZZ
8   D1974   D               NaN
9   L1977   L               NaN

In SAS, I can use first and last function to get the values D1974 and L1977, which are also unique value.

I want to output 2 dataframes.

Desired output:

# df_A
    col     ANOTHER_COL1    COL_WITH_NaN
0   D1986   DD              TT
1   D1986   DD              TT
2   H1946   HH              WW
3   H1946   HH              WW
4   S1987   SS              ZZ
5   S1987   SS              ZZ
6   D1974   D               NaN
7   L1977   L               NaN


# df_B
    col     ANOTHER_COL1    COL_WITH_NaN
0   I1978   II              NaN
1   I1978   II              NaN

The above is just the simple example. The complete SAS code is:

DATA df_A df_B;                                                            
   SET DROP4;                                                                     
   BY ANOTHER_COL1 col DESCENDING COL_WITH_NaN;                                      
   IF FIRST.col AND LAST.col THEN OUTPUT df_A;                         
   ELSE IF COL_WITH_NaN = ' ' THEN OUTPUT df_B;                                 
   ELSE OUTPUT df_A;
RUN;

I don think the ANOTHER_COL1 and COL_WITH_NaN are issue. The logic of this is to search unique col record and output to df_A first, and then if there is any missing in COL_WITH_NaN then output to df_B. At last, remainings output to df_A.

Logic of first and last function in sas: After group by ANOTHER_COL1, col, COL_WITH_NaN, SAS automatically creates two variables, FIRST. and LAST. Variables in SAS, that are temporary. SAS uses the value of the FIRST. and LAST. Variables to identify the first and last observations in a group.



Read more here: https://stackoverflow.com/questions/68492471/create-different-dataframe-using-pandas-based-on-the-unique-duplicated-with-repl

Content Attribution

This content was originally published by Peter Chen at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: