Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.
Labeling data is one of the most fundamental aspects of machine learning. It’s also often an area that organizations struggle with, both to accurately categorize data and to reduce potential bias.
With data labeling technology, a data set used to train a machine learning model is first analyzed and labeled that provides a category and a definition of what the data is actually about. While data labeling is a critical part of the machine learning process, it has also been shown to be highly inconsistent recently, according to multiple studies. The need for accurate data labeling has led to a bustling market of data labeling suppliers.
One of the most popular data label technologies is the: open source Label Studio, which is backed by San Francisco-based startup Heartex. The new Label Studio 1.6 update released today provides users with new features to better analyze and label data in videos.
According to Michael Malyuk, co-founder and CEO of Heartex, the challenge for most companies with artificial intelligence (AI) is to work with good data.
Contents
Event
MetaBeat 2022
MetaBeat will bring together thought leaders to offer advice on how metaverse technology will change the way all industries communicate and do business October 4 in San Francisco, CA.
Register here
“We view labeling as a broader category of dataset development and Label Studio is a solution that ultimately allows you to do any kind of dataset development,” said Malyuk.
Defining categories for data labels is challenging
While Label Studio’s 1.6 release has a video player feature as the main new feature, Malyuk stressed that the technology is useful for any type of data, including text, audio, time series, and video.
One of the biggest problems with labeling all types of data is defining the categories used for data labels.
“Some people may name things one way, some people may name things another way, but essentially they mean the same thing,” Malyuk said.
He explained that Label Studio provides taxonomies for labels that users can choose from to describe a piece of data, be it a text, audio, or image file. If two or more people in the same organization label the same data differently, the Label Studio system will identify the conflict so that it can be analyzed and resolved. Label Studio offers both a manual conflict resolution system and an automated approach.
Vector Database vs. Data Labels?
The process of labeling data can often involve manual work, where people assign a label or validate that a label is correct.
There are a number of approaches to automate the process, Lightly AI startup uses a self-supervised machine learning model that integrates with Label Studio. Then there are vendors that use a vector database to convert data into math, rather than using data labels to identify data and its relationships.
Malyuk said vector databases have their uses and can be effective for performing tasks such as matching. The problem, he says, is that the vector approach is not as effective with unstructured data types such as audio and video. He noted that a vector database can use common object identification types.
“Once you start moving away from that general knowledge to something a little bit different, it gets really complicated without manual labeling,” Malyuk said.
How data labels can identify and reduce AI bias
Bias in AI is an ongoing challenge that many in the industry are trying to combat. The foundation of machine learning is the actual data, and the way data is labeled can also potentially lead to bias. Bias can be intentional, and it can also be indirect.
“If you label a very subjective data set in the morning before coffee and then again after coffee, you might get very different answers,” Malyuk said.
While it’s not always possible to ensure that data labeling processes are performed only by those that contain full caffeine, there are processes that can help. Malyuk said what Label Studio does on the software side, it provides a way to build a process so that everyone contributes individually. The system identifies and builds all the matrices in which it matches people and how they label the same items. It’s an approach that Malyuk says can potentially identify bias for a specific label.
The open-source Label Studio technology is intended for use by individuals and small groups, while the commercial project provides business features for larger teams in terms of security, collaboration, and scalability.
“With open source we focus on the user and we try to make the life of the individual user as easy as possible from a labeling perspective,” said Malyuk. “With the company, we focus on the organization and whatever the business needs are.”
The mission of VentureBeat is a digital city square for tech decision makers to learn about transformative business technology and transactions. Discover our briefings.
Janice has been with businesskinda for 5 years, writing copy for client websites, blog posts, EDMs and other mediums to engage readers and encourage action. By collaborating with clients, our SEO manager and the wider businesskinda team, Janice seeks to understand an audience before creating memorable, persuasive copy.