I am trying to implement a bootstrapping based Lasso algorithm in Python. I first bootstrap B number of samples from my data and the feature set and then apply Lasso on it, in the end I want to average out the coefficients of Lasso and I want to keep track of the number of times each feature is selected with bootstrap, so that I dont take general mean but specific one, based on the number of times the feature was selected. However, I cannot wrap my head around how to do this in python:
X = pd.DataFrame(X) # convert X to dataframe for easier bootstraping
y = pd.DataFrame(y) # -----------------------------------------
n, p = X.shape # get parameters for beta matrix
beta = pd.DataFrame(np.empty((self.bootstraps, p))) # empty beta matrix with B rows and P columns
for i in tqdm(range(self.bootstraps)): # for loop for first bootstrapping
features = np.random.choice(range(0, p), self.q1,
replace=False) # generate random indices for X
samples = np.random.choice(range(0, n), n,
replace=True) # generate random indices for Y
X1 = X.iloc[samples, features] # boostrapped X
Y1 = y.iloc[samples] # boostrapped y
# X1, Y1 = preprocessing.StandardScaler().fit(X1, Y1)
lasso_cv = LassoCV(n_jobs=-1, **self.options)
lasso_cv.fit(X1, Y1)
beta.iloc[i, features] = lasso_cv.coef_ # save all coefficients for each bootstrap iterations
beta = np.array(beta)
probs = np.nanmean(np.abs(beta), axis=0) # slight deviation from Random Lasso, abs taken inside
Inside the for loop I want to create a array K which will keep track of number of times each feature is selected, which I would use in the end to take the average in the last line instead of np.nanmean.
Thank you!
Read more here: https://stackoverflow.com/questions/66330805/keeping-track-of-bootstraping-in-python
Content Attribution
This content was originally published by user13201583 at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.