Cloud Stack Ninja

this post is about implementing a similar concept to this post. Essentially I want to create an Enhanced Decision Tree Regressor by creating Decision Tree Regressor (DTR) and using the leaf nodes as input for a Linear Regression Model. Using a pipeline and cross validation I can create and train a Decision Tree Regressor but not sure how to use the DTR leaf nodes as input to a Linear Regression model as seen below. Any assistance would be greatly appreciated.

# Let's split our data into training data and testing data
trainTest = data2.randomSplit([0.8, 0.2])
trainingDF = trainTest[0]
testDF = trainTest[1]

assembler = VectorAssembler(
    inputCols = ['passenger_count','store_and_fwd_flag','pickup_day','dropoff_day','pickup_month','dropoff_month','pickup_hour','dropoff_hour','distance'],
    outputCol = 'features'

# Now create our decision tree and linear regressoin models
dtr = DecisionTreeRegressor(featuresCol="features", labelCol="trip_duration", predictionCol="prediction1")
lir = LinearRegression(featuresCol='features', labelCol='prediction1', predictionCol='prediction2')

# Create Evaluator
prmse = RegressionEvaluator(labelCol="trip_duration", predictionCol="prediction2", metricName="rmse")

pipeline = Pipeline(stages=[assembler, dtr, lir])

paramGrid = ParamGridBuilder() \
.addGrid(dtr.maxDepth, [10]) \
.addGrid(lir.maxIter, [10]) \
.addGrid(lir.regParam, [.01, .1, 1]) \
.addGrid(lir.elasticNetParam, [.5, .75, 1]) \

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=prmse, numFolds=10)

# Apply cross validation to the training data and generate a model

model =
predictions = model.transform(testDF).cache()

Read more here:

Content Attribution

This content was originally published by Andrew at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: