Written by Juan Carlos Olamendy Turruellas
In this article, I want to talk about the evolution of data pipeline architectures from the traditional centralized, batch-oriented and report-only data warehouses towards the modern one based on distributed data stores, distributed computing, near real-time processing and the used of machine learning and analytics to support decision making process in today fast-changing business environment.
Big data has matured and become in one of the pillar of any business strategy today, specifically in the area of sales and digital marketing in order to increase the revenue and customer loyalty and expectation. In a highly competitive and regulated environment, businesses are required every day to make decisions based on data, instead of intuition. In order to make good decisions, it’s necessary to process a huge amount of data in an efficient way (the less possible computing resources and a minimum processing latency), add new data sources (structure, semi-structure and unstructured ones such as UI activities, logs, performance events, sensor data, emails, documents, social media, etc) and support the decisions using machine learning algorithms and visualization techniques.
Some companies such as Netflix are publicly declared data-oriented because their core business, products and services are based on insights derived from data analysis.
So, let’s see how we can transform our traditional data warehouse architecture into a modern one to support the challenges related to big data and high computing.
A traditional data warehouse architecture comprises of the following elements:
This kind of architecture can be illustrated in the following figure.
This architecture has some drawbacks as shown below:
So, in order to overcome the limitations of the previous architecture, we need to think using new paradigms.
The modern pipeline architect is an evolution from the previous one integrating new data sources and using new computing paradigms as well as the integration of artificial intelligence, machine learning algorithm and cognitive computing.
In this new approach, we have a pipeline engine with the following features:
Although, there are a huge amount (as big as the big data itself) of technologies related to big data and analytics, I’ll show a referential architecture for a modern data pipeline. I’ll illustrate the functional aspect of every layer using particular technologies for you to research further on this and learn more (see the figure 02).
From this referential architecture, we can derive specific use cases to be used in your business.
Data warehouse has being in major companies for many years. With the exponential growth of data, the DWHs are reaching their capacity limits and batch windows are also increasing putting at risk the SLA. One approach is to migrate heavy ETL and calculation workloads into Hadoop in order to achieve faster processing time, lower costs per stored data and free DWH capacity to be used in other workloads.
Here we have two major options:
We can visualize this scenario as shown in the figure 03.
In this scenario, we have Fume agents installed on every data source for ingesting data into the pipeline. We can also use Kafka as a streaming data source (front-end for real-time analytics) to store the incoming data in the form of events/messages. As part of the ingestion process, the data is stored directly on the Hadoop file system, or some scalable, fault torelant and distributed database such as HBase or Cassandra. Then the data is computed and some predictive models are created using Spark, Scala and MLLib technologies. The result is stored in ElasticSearch for improving the searching capabilities of the platform, predictive models can be stored in Hadoop file system and the result of calculations can be stored in Cassandra. The data can be consumed by traditional tools as well as by Web and mobile applications via API Rest.
This architectural design has the following benefits:
We can visualize this scenario as shown in the figure 04.
In this post, I've talked about the evolution of data pipeline architecture towards modern ones today. Using the architectural patterns and strategies explained before, you can adapt your data pipeline architectures to be more scalable, resilient and help make better decisions in today changing world.