I am trying to implement a bootstrapping based Lasso algorithm in Python. I first bootstrap B number of samples from my data and the feature set and then apply Lasso on it, in the end I want to average out the coefficients of Lasso and I want to keep track of the number of times each feature is selected with bootstrap, so that I dont take general mean but specific one, based on the number of times the feature was selected. However, I cannot wrap my head around how to do this in python:
X = pd.DataFrame(X) # convert X to dataframe for easier bootstraping y = pd.DataFrame(y) # ----------------------------------------- n, p = X.shape # get parameters for beta matrix beta = pd.DataFrame(np.empty((self.bootstraps, p))) # empty beta matrix with B rows and P columns for i in tqdm(range(self.bootstraps)): # for loop for first bootstrapping features = np.random.choice(range(0, p), self.q1, replace=False) # generate random indices for X samples = np.random.choice(range(0, n), n, replace=True) # generate random indices for Y X1 = X.iloc[samples, features] # boostrapped X Y1 = y.iloc[samples] # boostrapped y # X1, Y1 = preprocessing.StandardScaler().fit(X1, Y1) lasso_cv = LassoCV(n_jobs=-1, **self.options) lasso_cv.fit(X1, Y1) beta.iloc[i, features] = lasso_cv.coef_ # save all coefficients for each bootstrap iterations beta = np.array(beta) probs = np.nanmean(np.abs(beta), axis=0) # slight deviation from Random Lasso, abs taken inside
Inside the for loop I want to create a array K which will keep track of number of times each feature is selected, which I would use in the end to take the average in the last line instead of np.nanmean.