In two years (2020-2022), the total data volume will increase from approx. 1 PB to 2.02 PB. It follows that the average annual growth is 42.2%. The greater the amount of data, the more work data specialists and engineers have to do. The application of the functional programming paradigm to data engineering can bring clarity to the process. In addition, it facilitates the work of data teams and helps solve problems. The main focus of today’s article is just the approach of functional data engineering and its best practices.
What is functional data engineering?
The focus of functional programming is on creating functions for immutable variables. There are some important principles behind functional programming. One of them is to write a program only in pure functions. A pure function is a function with clearly declared inputs and outputs; it has no side effects. As a result, the functional data engineering paradigm focuses on defining data processes based on pure tasks. Due to its nature, functional programming is great for more demanding tasks such as data analysis and machine learning.
Four best practices
Maxime Beauchemin wrote an article on the idea of functional data engineering. The practices below are based on his considerations.
The first thing to mention about functional data engineering is reproducibility. It’s important both in data engineering and data processing.
First, reproducibility is a fundamental principle of science. Let’s take an example when you publish an article that contains outstanding results; other people must be able to reproduce your claims and findings.
Secondly, reproducibility is critical from a legal and sanity perspective. Suppose you publish an article, provide some data, and then make a decision based on those data. So you have to be able to explain on what basis you made such a decision. The next point is about the sanity standpoint. So, in case you are a data engineer, and even though you do the same job every day, you get different results every day, it can negatively affect your mental health. So, this perspective is also relevant. Generally, what guarantees reproducibility is the functional data engineering approach.
HAVING PURE TASKS
The first step toward the success of batch processing is to avoid any side effects unrelated to the task at hand. Pure tasks are deterministic in nature, meaning that given the same source partitions, the results of pure tasks will be the same. Pure tasks are also idempotent. So, it means that you will get the same result every time you restart the task. Apart from that, these tasks have no side effects, and they use immutable sources. Moreover, they usually target a single partition, so it’s easy to understand.
If you want a pure kind of functional data warehouse, it is recommended to avoid mutations (update, upsert, append, delete). It makes sense to insert an override partition instead. Another great practice is to have tasks that limit the number of source partitions scanned.
The use of immutable objects is an effective method for enforcing the functional programming paradigm. This design pattern defines an object’s final state, which cannot be changed afterward. As a result, if we want to make any change to its state, we will have to create a new object with its modified value. Generally, immutability is extremely important in functional programming. Mostly because it ensures that the function doesn’t alter other variables and remains disciplined.
To ensure functionality, you have to think of partitions as immutable blocks of data and also systematically replace partitions. Hence, partitions are equivalent to immutable objects and become a kind of foundation for your data warehouse.
Implementing functional data engineering involves partitioning all your tables systematically. This means that you don’t have to modify them but only append new partitions.
USING A PERSISTENT STAGING AREA
Another practice worth applying in the case of functional data engineering is the persistent staging area. The staging area is where you move your raw components and raw data from external systems into your data warehouse. Typically, your staging area isn’t transformed at all, so raw ingredients from external sources come into your warehouse mostly unchanged.
As long as there are immutable staging areas and pure tasks, in theory, the entire warehouse can be recompiled from scratch.
Functional data engineering rest upon transferring the ideas from functional programming to the data engineering world. We have listed a few of the most important practices, including:
- Reproducibility – having new data, you can reproduce the report even a year later.
- Pure tasks – a function without side effects (the same input, the same output).
- Immutable partitions – a function produces an output (data set, object) that cannot change once produced.
- Persistent staging areas – contain data that has been imported from the source systems.
Find out more about data engineering services.