Machine Learning on Databricks — part 1: the data pipeline

3 min readNov 27, 2021

The code discussed in this post is available at:

https://github.com/opsabarsec/titanic_on_databricks

Data Science is getting more and more automated. Tuning coefficients? Adding layers to your neural network, etc. are tasks that soon will be regarded as the work of artisans who manually bind books. Extract, Transform, Load (ETL) and sometimes simple data cleaning are tasks that don’t involve deep knowledge of Math and algorithms but that are crucial when you plan to make ML/AI models. Let’s imagine that you impressed your stakeholders with a brilliant proof of concept (POC) conveniently run using your beloved laptop, then soon it will be time to scale up the pipeline to gigabits of data. This was traditionally achieved using parallel processing with Hadoop, more recently Spark/Scala. Databricks provides a versatile and intuitive platform for the latter, currently running Spark 3.1.2. Here I show the use of this technology to build a simple data pipeline, from raw to data engineered for ML/AI models. The source data in this example is the basic Titanic data set, chosen for sake of simplicity. But the concepts can be very well applied to larger and more complicated datasets.

Data transformation on Databricks is typically done by following the multi-hop schema [2] illustrated below, which uses a terminology probably borrowed from Alchemy.

Figure1. Multi-hop data refining, common schema for ETL pipelines on Databricks.

Data are ingested from the data lake source, processed recorded at each step into tables that correspond to different quality levels in the pipeline:

data ingestion (“Bronze” tables). Set the correct schema.
data cleaning, augmenting (“Silver” tables)
transformation/feature engineering to make it ready for ML/AI (“Gold” tables)

Having separated tables allows having checkpoints, helping later modifications of the code. Let’s see each pipeline step more in details.

Step 0: Define data paths

Not mandatory but good practice. It saves lots of time that would otherwise be spent manually tracking bugs when for some reasons you need to modify data sources.

Step 1: Raw to Bronze: data ingestion

Time to start working with the data. First of all you simply recall the configuration file with the command:

%run ./configuration

then create the dataframe from the raw data, a csv file in this example, making sure header and column types are correct. The latter can be simplified with the “inferschema” Spark functionality. Modifications to the automatically detected schema can be made by setting it manually using a user-defined one. Finally, the dataframe is exported to a Delta table into the bronze folder

df.write.format(“delta”).mode(“overwrite”).save(bronze_path + ‘bronze_measurements’)

Step 2: Bronze to Silver: data cleaning

Several data cleaning operations can be applied to the dataset. For example verification that there are no negative values, outliers detection, duplicate IDs….

Main cleaning step here was the removal or imputation of empty cells. The main feature treated was the age, and imputation was assigned with the median value according to the passenger cabin class.

bronzeDF.groupBy(“Pclass”).agg(F.percentile_approx(“Age”, 0.5).alias(“median”)).show()

Step 3: Silver to Gold: data engineering

I looked for passenger groups having the same ticket but this case was rather rare and would not have added much information. I then worked out the family size with a simple operation

titanic_df = silverDF.withColumn(“Family_Size”,F.col(‘SibSp’)+F.col(‘Parch’)+1)

The total of siblings / spouses + parents / children aboard the Titanic + the passenger (1) to have the family size.

Each step presents only very basic operations since this is intended as a simple introduction to data engineering on Databricks rather than a complete treatment to obtain the best transformations as in a Kaggle competition. To build this pipeline is advisable to have some basic coding skill in Spark and SQL, and be familiar with Databricks basics [3].

References

[1] https://medium.datadriveninvestor.com/why-cant-we-find-enough-data-engineers-dbc0d053208b

[2] https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html

[3] https://towardsdatascience.com/spark-databricks-important-lessons-from-my-first-six-months-d9b26847f45d