Time series - general notes

Temporal data is a little bit particular. It can be generated very frequently, as for instance every 500 ms or less. It's then important to store it efficiently and to allow quick and flexible reads. It's also important to know the specificities of time-series as a popular case of temporal data.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post is called "general notes" since it presents the common points of time series from the bird's eye view. The first section explains the general points about time series. The next one focuses on the data characteristics of the time series while the last one tries to explain the technical solutions helping to put them in place.

General information

Time series belongs to the family of the temporal data, i.e. the values somehow related to the time concept (either a point in time or an interval). The main ordering domain of this data category is unsurprisingly the time (e.g. event time).

The time series is the subset of the temporal data. It's represented as a sequence of data points happened over a time interval. The happen-moment can occur either at regular or irregular interval. This regularity has a big influence on the amount of registered data. The regular registering every 1 second will generate much more data than an irregular one saving for instance ATM transactions.

The registering regularity brings the first points describing time series - the granularity. It defines the scale of the stored data and thus determines the level of information we can gather through it. The information contained in time series is quite important. It lets pretty easily to observe the trends (e.g. website real-time audience) and correlate the measurements with potentially impacting events.

Among the operations we can made with the time series we can find a lot of common points with classical RDBMS querying, as:

Data characteristics

The time series data can be described by the following concepts:

To have a more precise context our temperature measurement example can look like in the table below:

The time series data character is also specific. It can be summarized by the following points:

The architectures for time series

In "Comparison of Time Series Databases" Andreas Bader classifies the time series databases according to the criteria emphasizing the technologies behind the databases used to store time series. Among them we can distinguish 3 technical approaches:

The "Comparison of Time Series Databases" gives also some insight on the evaluation of the architecture well suited for time series data:

This post summarized the basic information about the time series. As we could see in the first part, they're a subset of temporal data, i.e. data sorted by the event time. This kind type of data is characterized by regular or irregular generation. The database storing time series must be resilient and easily scalable since the volume of time series can grow very quickly. Moreover, it should support the requirement for long-term storage by supporting data removal and aggregation. As shown in the last section, different approaches exist to deal with that. The first ones are based on 3rd part NoSQL storages as Cassandra or HBase. The other ones use the classical RDBMS while the last category doesn't require any specific storage.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©