The billion-dollar potential of synthetic data

Missed a session of MetaBeat 2022? Visit the on-demand library for all our recommended sessions here.


Synthetic data will be a huge industry in five to ten years. For example Gartner estimates that by 2024 60% of data for AI applications will be synthetic. This type of data and the tools used to create it have significant untapped investment potential. This is why.

Synthetic data could feed data-hungry AI/ML

We are effectively on the brink of a revolution in how machine learning (ML) and artificial intelligence (AI) can grow and have even more applications across different sectors and industries.

We live in an era of skyrocketing demand for ML algorithms in every aspect of our lives, from fun face masking applications such as filters on Instagram or Snapchat to highly useful applications designed to enhance our work and living experiences, such as diagnosing diseases or recommending treatment. Key opportunities include emotion and engagement recognition, better homeland security features, and better anomaly detection in industrial contexts.

At the same time, people and businesses are hungry for ML/AI-based products, while algorithms are hungry for data to train on. All of that means we will inevitably see more and more diverse data needs, and fully fabricated data is key.

Event

Top with little code/no code

Join today’s leading executives at the Low-Code/No-Code Summit virtually on November 9. Register for your free pass today.

Register here

From Grand Theft Auto google

Heard about self-driving cars learning the traffic rules by playing games like: Grand Theft Auto V study virtual traffic? That was an early version of ML using synthetic data. Likewise, many in tech may have come across synthetic “scanned documents,” which have been used to train text recognition and data extraction models.

Banking and finance is an industry that already relies heavily on synthetic data for certain processes, while tech giants like Google and Facebook are also using it, attracted by the extraordinary efficiency it can bring to the work of project managers and data scientists.

In fact, we expect the number of synthetic images and data points to increase tenfold in the coming year and many hundredfold in the coming years.

Limitations of real world data

Those at the forefront of ML are increasingly turning to synthetic data to get around the numerous limitations of original or real-world data. For example company Synthesis AI provides a cloud-based generation platform that delivers millions of perfectly labeled and diverse images of artificial people. Synthesis AI has been able to overcome many challenges associated with the messy realities of original data. For starters, the company makes the data cheaper. It can be too expensive for an organization to generate the required amount and diversity of data.

For example, could you take pictures of someone from every angle imaginable, wearing every possible combination of clothes in all possible lighting conditions? It would be an incredible amount of work to do that in real life, but synthetic data can be designed to account for endless variations.

That also means much easier labeling of data. Imagine trying to locate the light source, brightness and distance to an object in photos to train a shadow development algorithm. It would be next to impossible. With synthetic data you have that data by default, because it is generated with such parameters.

In addition, companies also face severe restrictions on the use of real-world data. In the past, companies have shared data without the layers of cybersecurity that are now expected. GDPR and other data regulations make it complex and challenging, and sometimes illegal, for companies to share real-world data with partners and suppliers.

In other cases, it may not even be possible or safe to generate the data. The real-time 3D engine producer Unigine counts as a customer Daedalean, which is working on urban air mobility. Daedalean has started training its autonomous flying cars Unique Virtual Worlds. This makes perfect sense – it doesn’t yet have a safe, real-world environment to extensively test its products and generate the deep datasets it needs. A similar case is CarMaker software by IPG Automotive. To be 10.0 release introduced enhanced 3D visualization powered by UNIGINE 2 Sim, with physically based rendering and real-world camera parameters.

Synthetic people and synthetic objects have been used much more frequently by tech giants lately. Amazon synthetic data used to train Alexa, Facebook acquired artificial data generator AI.Reverie, and Nvidia realized NVIDIA Omniverse Replicatora powerful synthetic data generation engine that produces physically simulated synthetic data for training deep neural networks.

Fighting prejudice in data

The challenges of real-world data don’t stop there. In some areas, huge historical biases pollute data sets. This is how we end up with global tech giants getting into hot water because their algorithms don’t properly recognize black faces. Even now, with ML technology experts well aware of the bias problem, it can be challenging to collect a real-world dataset that is completely free of bias.

Even if a real-world dataset can solve all of the above challenges, which in reality is hard to imagine, data models must be constantly improved and adapted to remain unbiased and avoid degradation over time. That means a constant need for fresh data.

Understanding the Opportunity

Synthetic data is in the relatively early stages of growth and it is not a panacea for every use case. It still faces technical challenges and limitations, and its tools and standards are not yet standardized.

Nevertheless, synthetic data is definitely an accelerator for ML/AI based products as it continues to expand into every industry and sector, and we are sure to see many new companies and deals in the area. For anyone wanting to dive deeper into the topic of synthetic data, here’s the: Open synthetic data community. Discover a hub for synthetic datasets, papers, code, and people pioneering their use in machine learning.

Sergey Toporov is a partner at Leta Capital.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers