Predictions in PySpark using pickled MLFlow model and pandas_udf

I have a LightGBM model found with randomized search that is saved to a .pkl file using MLFlow. The goal is to load that pickled model into Pyspark and make predictions there. Is that possible at all with simple unpickling:

with open(path, 'rb') as f:
    model = pickle.load(f)

and then applying pandas_udf:

import pyspark.sql.functions as F

@F.pandas_udf(returnType=DoubleType())
def predict_udf(*cols):
    df = pd.concat(cols, axis=1)
    return pd.Series(model.predict(df))

cols = columns_list

y_pred = X_pred.select(F.col('id'), predict_udf3(*cols).alias('prediction'))

And if I try to show, count or save the output, it returns:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 18.0 failed 1 times, most recent failure: Lost task 1.0 in stage 18.0 (TID 217, localhost, executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:486)

The model.predict() part itself works fine in python. If I replace model.predict() in the pandas_udf function above with something simpler, e.g. setting a constant, then it works fine, so I assume that the problem should not be in the versions, as some people reported a year or two ago that there is a specific working combination between pyarrow, numpy and pandas, mainly caused by changes in pyarrow (0.15+). I'm currently using:

pandas==0.25.1 (downgraded from 1.2.4)
numpy==1.17.2 (downgraded from 1.20.3)
pyarrow==0.14.0 (tried all versions between 0.14.0 and 4.0.0)

It was reported previously that old numpy (e.g 0.14) can fix the issue, but this is not working for me now, because unpickling the model fails if numpy is too old. So, a few questions:

  1. Is it possible at all to unpickle a MLFlow model into Spark and then do model.predict()?
  2. Is the pandas_udf abive done correctly?
  3. Is actually pandas_udf the most efficient way to do this? It's expected to make millions of predictions and the runtime will be critical.

Thanks in advance!



Read more here: https://stackoverflow.com/questions/67940234/predictions-in-pyspark-using-pickled-mlflow-model-and-pandas-udf

Content Attribution

This content was originally published by Svilen Stefanov at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: