Machine Learning on Databricks — part 2: modeling experiments

Dr. Marco Berta
2 min readDec 15, 2021

The code discussed in this post is available at:

https://github.com/opsabarsec/titanic_on_databricks

Small note beforehand, for Data Science newbies or simply for people who think Deep Learning = Magic. Better data, smarter features engineering will beat better algorithm 9–0. Data Science is studying data, not models.

But having said it, experimenting several models and keeping track of results is a typical task of modern Data Scientists. After building a data pipeline on Databricks as shown in the previous post, time to feed all to a machine learning (ML) algorithm. Small recap: data were the famous Titanic dataset. What is for many the entrance door to artificial intelligence (AI). The problem is typically solved by classical ML algorithms such as logistic regression. When working with Spark, an important step is to insert this into a modelling pipeline.

Stage 1: one hot encoding of categorical columns, done with the following code:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]
oheOutputCols = [x + "OHE" for x in categoricalCols]

stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")
oheEncoder = OneHotEncoder(inputCols=indexOutputCols, outputCols=oheOutputCols)

Slightly more complicated than Pandas, since we use two functions (indexing + encoding) instead of one.

Stage 2: Assemble predictive features

Next step is something that instead differs between the two languages. You need to invoke at this point the VectorAssembler function. I quote from [1] “VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.”

Stage 3 : invoke the algorithm you want to use for the model

In this case we use Logistic Regression, from from pyspark.ml.classification libraries

Finally let’s build our model pipeline using the function provided in Spark:

from pyspark.ml import Pipelinestages = [stringIndexer, oheEncoder, vecAssembler, lr]pipeline = Pipeline(stages=stages)pipelineModel = pipeline.fit(trainDF)

The model can be exported

pipelineModel.write().overwrite().save(pipelinePath)

and reloaded when necessary, without having to train the algorithm

from pyspark.ml import PipelineModelsavedPipelineModel = PipelineModel.load(pipelinePath)

We can use it to predict our test set or more data which may be flowing in with time.

Poor prediction? Well, in my case R2 was a shameful 0.2. No, nothing to panic, this article shows how to better extract information from data to improve the predictions [2] (again, NOT changing algorithm)

Next, it would be good to track the experiments using MLFlow. For this a separate post should be made. Basics of MLFlow are illustrated in the Databricks documentation [3].

[1] https://george-jen.gitbook.io/data-science-and-apache-spark/vectorassembler

[2] Titanic Survival Prediction | Your first Data Science Project (analyticsvidhya.com)

[3] MLflow guide | Databricks on AWS

--

--

Dr. Marco Berta

Senior Data Scientist @ ZF Wind Power, Ph.D. Materials Science in Manchester University