The inconvenient truth about operational data pipelines

Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.

The world is filled with situations where one size doesn’t fit everyone – shoes, healthcare, the amount of sprinkles you want on a fudge sundae, just to name a few. You can add data pipelines to the list.

Traditionally, a data pipeline provides connectivity to business applications, controls the requests and flow of data into new data environments, and then manages the steps required to cleanse, organize, and present a sophisticated data product to consumers. inside or outside the corporate walls. These results have become indispensable to help decision-makers move their businesses forward.

Lessons from Big Data

Everyone knows the Big Data success stories: How companies like Netflix build pipelines that manage more than a petabyte of data every day, or how? meta analyzes over 300 petabytes of clickstream data within its analytics platforms. It is easy to assume that we have already solved all the difficult problems once we reach this scale.

Unfortunately it is not that simple. Just ask anyone who works with operational data pipelines – they’ll be the first to tell you that one size definitely doesn’t fit everyone.


MetaBeat 2022

MetaBeat will bring together thought leaders to offer advice on how metaverse technology will change the way all industries communicate and do business October 4 in San Francisco, CA.

Register here

For operational data, the data that underlies core business areas such as finance, supply chain, and HR, organizations routinely fail to deliver value from analytics pipelines. That’s true, even if they’re designed in a way that resembles Big Data environments.

Why? Because they’re trying to solve a fundamentally different data challenge with essentially the same approach, and it doesn’t work.

The problem is not the size of the data, but how complex it is.

Leading social or digital streaming platforms often store large data sets as a series of simple, ordered events. One row of data is recorded in a data pipeline for a user who watches a TV show, and another records every “Like” button clicked on a social media profile. All this data is processed through data pipelines at tremendous speed and scale using cloud technology.

The datasets themselves are large, which is fine, because the underlying data is initially very well organized and managed. The highly organized structure of clickstream data means that billions upon billions of records can be analyzed in no time.

Data Pipelines and ERP Platforms

In contrast, operational systems, such as enterprise resource planning (ERP) platforms that most organizations use to run their essential day-to-day processes, are a very different data landscape.

Since their introduction in the 1970s, ERP systems have evolved to optimize every shred of performance for capturing raw transactions from the business environment. Every sales order, financial ledger entry, and supply chain inventory item must be captured and processed as quickly as possible.

To achieve this feat, ERP systems have evolved to manage tens of thousands of individual database tables that track business data elements and even more relationships between those objects. This data architecture is effective at ensuring that a customer or supplier’s data is consistent over time.

But it turns out that what’s great for transaction speed within that business process is usually not so great for analytics performance. Instead of the clean, clear, and well-organized tables that modern online applications create, there is a spaghetti-like mess of data scattered across a complex, real-time, mission-critical application.

For example, analyzing a single financial transaction to a company’s books may require data from more than 50 different tables in the backend ERP database, often involving multiple lookups and calculations.

To answer questions spanning hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to return. Unfortunately, these questions simply never provide timely answers and leave the company blind at a critical point in their decision-making process.

To solve this, organizations are trying to evolve the design of their data pipelines with the goal of routing data to increasingly simpler business views that minimize the complexity of various queries to make them easier to execute.

This could work in theory, but it comes at the cost of oversimplifying the data itself. Rather than allowing analysts to ask and answer every question with data, this approach often summarizes or reshapes the data to improve performance. It means analysts can get quick answers to predefined questions and wait longer for everything else.

With inflexible data pipelines, asking new questions means going back to the source system, which becomes time-consuming and quickly expensive. If something changes within the ERP application, the pipeline breaks completely.

Rather than apply a static pipeline model that cannot respond effectively to data that is more interconnected, it is important to design this level of connectivity from scratch.

Instead of making pipelines smaller and smaller to solve the problem, the design should instead include those connections. In practice, this means addressing the fundamental reason behind the pipeline itself: making data accessible to users without the time and expense associated with expensive analytical searches.

Each connected table in a complex analysis puts additional pressure on both the underlying platform and those charged with maintaining business performance by tuning and optimizing these questions. To rethink the approach, one has to look at how everything is optimized when the data is loaded, but more importantly, before any queries are run. This is commonly referred to as query acceleration and provides a handy shortcut.

This approach to query acceleration delivers many multiples of performance compared to traditional data analytics. It achieves this without having to prepare or model the data beforehand. By scanning the entire data set and preparing that data before running queries, there are fewer restrictions on how questions can be answered. This also improves query usability by providing the full scope of the raw business data available for exploration.

By questioning the fundamental assumptions about how we acquire, process and analyze our operational data, it is possible to simplify and streamline the steps required to move from expensive, vulnerable data pipelines to faster business decisions. Remember: one size doesn’t fit all.

Nick Jewell is the senior director of product marketing at Incorta.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers