Solve the problem of unstructured data with machine learning

Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.


We are in the midst of a data revolution. The amount of digital data created in the next five years will be twice the amount in total produced so far – and unstructured data will define this new era of digital experiences.

Unstructured data — information that does not conform to conventional models or does not fit into structured database formats — represents more than 80% of all new company data. To prepare for this shift, companies are finding innovative ways to manage, analyze and maximize the use of data in everything from business analytics to artificial intelligence (AI). But decision-makers also run into an age-old problem: how do you maintain and improve the quality of huge, cumbersome data sets?

With machine learning (ML), that’s how. Advances in ML technology now enable organizations to efficiently process unstructured data and improve quality assurance efforts. With a data revolution happening all around us, where does your business fall? Are you burdened with valuable but unwieldy data sets – or are you using data to propel your business forward?

Unstructured data takes more than copy and paste

The value of accurate, timely and consistent data for modern enterprises is undisputed – it’s as essential as cloud computing and digital apps. Despite this reality, however, poor data quality still costs businesses on average $13 million a year.

Event

MetaBeat 2022

MetaBeat will bring together thought leaders to offer advice on how metaverse technology will change the way all industries communicate and do business October 4 in San Francisco, CA.

Register here

To navigate data problems, you can apply statistical methods to measure data shapes, enabling your data teams to track variability, remove outliers, and pull in data drift. Metrics-based controls remain valuable for assessing data quality and determining how and when to turn to datasets before making critical decisions. Although effective, this statistical approach is generally reserved for structured datasets, which lend themselves to objective, quantitative measurements.

But what about data that doesn’t fit neatly into Microsoft Excel or Google Sheets, including:

  • Internet of things (IoT): sensor data, ticker data and log data
  • Multimedia: Photos, audio and videos
  • Rich media: geospatial data, satellite imagery, weather data and surveillance data
  • Documents: word processing documents, spreadsheets, presentations, emails and communication data

When this kind of unstructured data is in play, incomplete or inaccurate information can easily slip into models. When errors go undetected, data problems pile up and wreak havoc on everything from quarterly reports to forecast forecasts. A simple copy-and-paste approach from structured data to unstructured data isn’t enough — and can actually make things much worse for your business.

The common saying, “garbage in, garbage out”, is very applicable to unstructured data sets. Maybe it’s time to destroy your current data approach.

The dos and don’ts of applying ML to data quality assurance

When considering solutions for unstructured data, ML should be at the top of your list. That’s because ML can analyze huge data sets and quickly find patterns among the clutter – and with the right training, ML models can learn to interpret, organize, and classify unstructured data types in any number of forms.

For example, an ML model can learn to recommend rules for data profiling, cleansing, and standardization, making efforts more efficient and accurate in industries such as healthcare and insurance. Similarly, ML programs can identify and classify text data by subject or sentiment in unstructured feeds, such as those on social media or in email records.

As you improve your data quality efforts through ML, keep in mind some key dos and don’ts:

  • Do automate: Manual data operations such as data decoupling and correction are tedious and time consuming. They’re also increasingly obsolete tasks, given today’s automation capabilities, that can take on mundane, routine operations and free up your data team to focus on more important, more productive efforts. Include automation as part of your data pipeline – just make sure you have standardized operating procedures and governance models in place to encourage streamlined and predictable processes around automated operations.
  • Don’t Ignore Human Oversight: The intricate nature of data always requires a level of expertise and context that only humans can provide, structured or unstructured. While ML and other digital solutions certainly help your data team, don’t rely on technology alone. Instead, empower your team to leverage technology while regularly monitoring individual data processes. This balance corrects any data errors that get past your technological measures. From there, you can retrain your models based on those discrepancies.
  • Detect root causes: When anomalies or other data errors pop up, it’s often not a single event. Ignoring deeper data collection and analysis issues puts your business at risk for ubiquitous quality issues across your entire data pipeline. Even the best ML programs are incapable of resolving upstream generated errors – again, selective human intervention supports your overall data processes and prevents major errors.
  • Don’t assume quality: To analyze data quality over the long term, you need to find a way to qualitatively measure unstructured data instead of making assumptions about data shapes. You can create and test ‘what-if’ scenarios to develop your own unique measurement approach, intended results and parameters. Running experiments on your data provides a definitive way to calculate its quality and performance, and you can automate the measurement of your data quality yourself. This step ensures that quality controls are always active and act as a fundamental feature of your data ingestion pipeline, never an afterthought.

Your unstructured data is a treasure trove of new opportunities and insights. But only 18% of organizations are currently taking advantage of their unstructured data – and data quality is one of the main factors holding more companies back.

As unstructured data becomes more prevalent and relevant to day-to-day business decisions and activities, ML-based quality controls provide much-needed assurance that your data is relevant, accurate, and useful. And if you’re not stuck with data quality, you can focus on using data to drive your business forward.

Just think of the opportunities that arise when you take control of your data – or better yet, let ML do the work for you.

Edgar Honing is senior solution architect at FORWARD.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the very latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers