Data warehouses and lakes will merge

Register now for your free virtual pass to the Low-Code/No-Code Summit on November 9. Hear from executives at Service Now, Credit Karma, Stitch Fix, Appian and more. Learn more.


My first prediction concerns the foundation of modern data systems: the storage layer. For decades, data warehouses and lakes have enabled companies to store (and sometimes process) large amounts of operational and analytical data. While a warehouse stores data in a structured state, through schedules and tables, warehouses primarily store unstructured data.

However, as technologies mature and companies try to “win” the data storage wars, companies like AWS, Snowflake, Google and Databricks are developing solutions that combine the best of both worlds, blurring the lines between data warehouse and data lake architectures. In addition, more and more companies are using both warehouses and lakes – either as one solution or as a patchwork of several.

Primarily to keep up with the competition, major warehouse and lake suppliers are developing new functionalities that bring both solutions closer together. As data warehouse software expands to cover data science and machine learning use cases, more companies are building tools to help data teams get more from raw data.

But what does this mean for data quality? In our view, this convergence of technologies is ultimately good news. Kind of.

Event

Top with little code/no code

Join today’s leading executives at the Low-Code/No-Code Summit virtually on November 9. Register for your free pass today.

Register here

On the one hand, a way to better operationalize data with fewer tools means that there are – in theory – fewer opportunities for data to interrupt production. The lake house requires more standardization of how data platforms work and therefore opens the door to a more centralized approach to data quality and observability. Frames like PICKLES (Atomicity, Consistency, Insulation, Durability) and Delta Lake make managing data contracts and change management much more manageable at scale.

We predict that this convergence will be good for consumers (both financially and in terms of resource management), but also likely to add complexity to your data pipelines.

Emergence of new roles in the data team

In 2012, the Harvard Business Review called a “data scientist” the sexiest job of the 21st century. Shortly after, in 2015, DJ Patil, a PhD and former data science lead at LinkedIn, was hired as the first-ever Chief Data Scientist in the United States. And in 2017, Apache Airflow creator Maxime Beauchemin predicted the “Demise of the data engineer” in a canonical blog post.

The days of database administrators or analysts in silos are long gone. Data is emerging as a proprietary enterprise-wide organization with tailored roles such as data scientists, analysts, and engineers. We predict that even more specializations will emerge in the coming years to address the ingestion, cleansing, transformation, translation, analysis, productization and reliability of data.

This wave of specialization is, of course, not unique to data. Specialization is common in almost every industry and signals a market maturity that points to the need for scale, improved speed and increased performance.

The roles we predict will dominate the data organization over the next decade include:

  • Product manager data: The data product manager is responsible for managing the lifecycle of a particular data product and is often responsible for managing cross-functional stakeholders, product roadmaps and other strategic tasks.
  • Analytics Engineer: The analytical engineer, a term popularized by dbt Labs, sits in between a data engineer and analysts and is responsible for transforming and modeling the data so that stakeholders are empowered to trust and use that data. Analytics engineers are specialists and generalists at the same time, often having multiple tools in the stack and juggling many technical and less technical tasks.
  • Data Reliability Engineer: The data reliability engineer is dedicated to building more resilient data stacks, primarily through data sensing, testing, and other common approaches. Data reliability engineers often have DevOps skills and experience that can be applied directly to their new roles.
  • Data designer: A data designer works closely with analysts to help them tell stories about that data through business intelligence visualizations or other frameworks. Data designers are more common in larger organizations and often come from product design backgrounds. Data designers should not be confused with database designers, an even more specialized role that actually models and structures data for storage and production.

So, how will the emergence of specialized data roles – and larger data teams – affect data quality?

As the data team diversifies and use cases increase, so will stakeholders. Larger data teams and more stakeholders mean more eyeballs are looking at the data. As one of my colleagues says, “The more people look at something, the more likely they are to complain” [it].”

Rise of automation

Just ask a data engineer: more automation is generally positive.

Automation reduces manual work, scales up repetitive processes, and makes large-scale systems more fault-tolerant. When it comes to improving data quality, there are many opportunities for automation to fill the gaps where testing, cataloging, and other more manual processes fail.

We foresee that in the coming years, automation will increasingly be applied in several areas of data engineering that impact data quality and governance:

  • Hardcoding data pipelines: Automated recording solutions make it easy – and fast – to ingest data and send it to your warehouse or lake for storage and processing. In our opinion, there’s no reason engineers should spend their time moving raw SQL from a CSV file into your data warehouse.
  • Unit tests and orchestration checks: Unit testing is a classic scaling problem, and most organizations can’t possibly cover all of their pipelines end-to-end — or even have a test ready for every possible way data can go bad. One company had important pipelines that went directly to some strategic customers. They closely monitored data quality and ran more than 90 rules on each pipeline. Something broke and suddenly 500,000 rows were missing – all without triggering any of their tests. In the future, we expect teams to rely on more automated mechanisms for testing their data and orchestrating circuit breakers on broken pipelines.
  • Root Cause Analysis: Often when data breaks, the first step many teams take is to ping the most knowledgeable data engineer about the organization and hope they’ve seen this type of problem before. The second step is to manually check thousands of tables. Both are painful. We hope for a future where data teams can automatically perform root cause analysis as part of the data reliability workflow with a data observation platform or other type of DataOps tooling.

While this list only covers the surface of areas where automation can benefit our quest for better data quality, I think it’s a good start.

More distributed environments and the rise of data domains

Distributed data paradigms such as the data mesh make it easier and more accessible for functional groups across the enterprise to use data for specific use cases. The potential of domain-based ownership applied to data management is great (faster data access, more democratization of data, better informed stakeholders), but so are the potential complications.

Data teams need look no further than the microservice architecture to get a taste of what’s to come after the data mesh mania calms down and teams get serious about their implementations. Such distributed approaches require more discipline at both technical and cultural levels when it comes to enforcing data governance.

In general, siphoning technical components can increase data quality issues. For example, a schema change in one domain could trigger a data fire drill in another part of the company, or duplication of a critical table that is regularly updated or replenished for one part of the company could cause a pandemonium if used by another part. Without proactively raising awareness and creating context about how to work with the data, scaling the data mesh approach can be challenging.

So, where do we go from here?

I predict that achieving data quality will become both easier and more difficult for organizations across industries in the coming years, and it’s up to data leaders to help their organizations meet these challenges while driving their business strategies forward.

Increasingly complicated systems and larger amounts of data create complications; innovations and improvements in data engineering technologies mean greater automation and an improved ability to “cover our foundations” when it comes to preventing broken pipelines and products. But however you break it down, striving for some measure of data reliability will become an important issue for even the most novice data teams.

I expect that data leaders will begin to measure data quality as a vector of data maturity (if they haven’t already), and work towards building more reliable systems in the process.

Until then, we wish you no data downtime.

Barr Moses is the CEO and co-founder of Monte Carlo.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers