I would like to include multiple features in a classifier for better improving the performance. I have a dataset similar to this one
Text is_it_capital? is_it_upper? contains_num? Label an example of text 0 0 0 0 ANOTHER example of text 1 1 0 1 What's happening?Let's talk at 5 1 0 1 1
I am applying different pre-processing algorithms to Text (BoW, TF-IDF,...). It was 'easy' to use only Text column in my classifier by selecting X['Text'] and applying the algorithm of pre-processing. However, I would like to include now also is_it_capital? and the other variables (except Label) as features as I found them potentially useful for my classifier. What I tried was the following:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']] y=df['Label'] # Need to use DenseTransformer to properly concatenate results # from CountVectorizer and other transformer steps from sklearn.base import TransformerMixin class DenseTransformer(TransformerMixin): def fit(self, X, y=None, **fit_params): return self def transform(self, X, y=None, **fit_params): return X.todense() from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('to_dense', DenseTransformer()), ]) transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough') X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40) X_train = transformer.fit_transform(X_train) X_test = transformer.transform(X_test) df_train = pd.concat([X_train, y_train], axis=1) df_test = pd.concat([X_test, y_test], axis=1)
#Logistic regression logR_pipeline = Pipeline([ ('LogRCV',countV), ('LogR_clf',LogisticRegression()) ])
logR_pipeline.fit(df_train['Text'], df_train['Label']) predicted_LogR = logR_pipeline.predict(df_test['Text']) np.mean(predicted_LogR == df_test['Label']) However I got the error:
TypeError: cannot concatenate object of type '<class 'scipy.sparse.csr.csr_matrix'>'; only Series and DataFrame objs are valid
Is there anyone that handled with a similar problem? How could I fix it? My goal is to include all the features in my classifiers.