Machine Learning on Databricks — part 1: the data pipeline

Figure1. Multi-hop data refining, common schema for ETL pipelines on Databricks.
  • data ingestion (“Bronze” tables). Set the correct schema.
  • data cleaning, augmenting (“Silver” tables)
  • transformation/feature engineering to make it ready for ML/AI (“Gold” tables)

Step 0: Define data paths

Step 1: Raw to Bronze: data ingestion

Step 2: Bronze to Silver: data cleaning

Step 3: Silver to Gold: data engineering

References

--

--

--

Data Scientist - R&D Engineer in Barry Callebaut, Ph.D. Materials Science in Manchester University

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Estimating term structure using exact methods

Building Better Data Through Teamwork

Short-Term Traffic Prediction using VLR Data — Introduction

Cosmic Conundrums of Psychoactive Substance Use, Abuse, and Dependence

Scraping US election results and using Jupyter notebook with plotly for data visualization

This Article will make YOU start a Data Science Project

IEEE-CIS Fraud Detection - Top 5% Solution

Collaborative Filtering

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dr. Marco Berta

Dr. Marco Berta

Data Scientist - R&D Engineer in Barry Callebaut, Ph.D. Materials Science in Manchester University

More from Medium

Training a Machine Learning model with Azure Synapse Analytics

Machine Learning Pipeline on Azure DataBricks — From Model tuning to Model Deployment

Data Lakehouses

Did Stacking Improve My PySpark Churn Prediction Model?