Big Data Time Series resampling using the Apache Spark library Tempo on Databricks

3 min readApr 2, 2024

Big data is a common issue when working with time series coming from IoT sensors. With most industrial machines, one or more controllers are connected to a gateway for HMI/supervisory control and data acquisition (SCADA) software. The HMI/SCADA gateway is in turn connected to a historian software, with capabilities to retrieve the data using SQL queries [1]. An example of mock data which could be imported into a table using Aveva Historian is shown below:

With the advent of cloud computing and AI, the next step has been to transfer these data as faithfully as possible from a local database to a cloud storage (AWS bucket, Azure blob….), typically as parquet files[2]. These in turn can be imported into dataframes using Apache Spark. Databricks provides an excellent interface for this purpose.

At this point we have a big bulky table which often contains fine grained data, too large to be exported to a Pandas dataframe. On Databricks one solution can be using Koalas[3] to translate easily Pandas logic to Spark. But often you don’t find the exact function, that Python library needed for your machine learning logic and you would prefer to work directly on Pandas dataframes. Moreover you may need sensor data points at minute or even hour intervals, not (milli)second. Resampling Spark dataframes though has proven to be challenging in the past due to inherent limitations of distributed computing [4]. Today, the package “tempo” has reached a good maturity and this process has been significantly simplified. It can be done in Databricks using few command lines. The tempo library can be conveniently installed using pip:

%pip install dbl-tempo

then it can be imported together with other Apache Spark functions

import pyspark.sql.functions as F
from tempo.tsdf import *

A Spark dataframe can be quickly obtained from a parquet file

df = spark.read.format("parquet").load("dbfs:/FileStore/shared_uploads/sensor/mock_data/part_00000_tid_1519639363878959916_76cbb055_c467_423b_a040_8ed072868ec3_3704_1_c000_snappy.parquet")

and it can be encoded into a timeseries using the tempo tsdf library:

df_tsdf = TSDF(df, ts_col = ‘DateTime’, partition_cols=[‘TagName’])

Notice that the time series and partition (labels) column should be specified. For our example those correspond to “DateTime” and “TagName” respectively. This time series dataframe at this point can be conveniently resampled using a built-in function. Allowable resample frequencies are microsecond (musec), millisecond (ms), sec (second), min (minute), hr (hour), day. Assuming that we want the average value for every hour, the resampled dataframe can be obtained as:

df_resampled = df_tsdf.resample(freq = ‘hr’, func=’mean’).df

and it can be conveniently converted to Pandas being much smaller than the original imported from a parquet file.

df_pandas = df_resampled.toPandas()

At this point we are ready for features engineering and machine learning analysis.

Big Data Time Series resampling using the Apache Spark library Tempo on Databricks

References

Written by Dr. Marco Berta