DefinedCrowd’s next-gen platform solves the AI data acquisition problem

With all the hype surrounding artificial intelligence, you would be forgiven for thinking that developing the algorithms powering deep learning are where the toughest challenges in the industry are. The actual challenge for most algorithms though is not their mathematics, but rather their inputs — collating high-quality data that is well-labeled and allows for the training of these models as quickly and efficiently as possible.

That’s where DefinedCrowd comes in. The company, which is based in Seattle and Portugal, was founded in 2015 by Daniela Braga, a data scientist and natural language processing expert, and Amy Du, who has since moved on from the company to start a global entrepreneurship network.

We’ve talked about the company back when it participated in Microsoft’s startup accelerator and also when it was featured in the Battlefield at TechCrunch Disrupt New York this past year.

Today, the company is publicly unveiling its next-generation SaaS platform for data scientists. Using the platform, users can use both a UI and an API to search for and select appropriate datasets for their applications. The company focuses on three horizontal areas: voice recognition, natural language processing, and computational imagery.

“Our value proposition is around quality of data, speed and scale,” Braga explained. “There is a challenge that if you use bad data, then you get garbage out of the [AI] model. We are solving these pain points, which were my own pain points when I started my career as a data scientist.”

For Braga, the mission is personal. She has spent almost 17 years doing data analysis in the corporate world, only to keep running up to the challenges of finding the data she needed to tune her models. “I had all the money in the world to gather data back in those days, but I still couldn’t spend money to get data because the scale I needed just wasn’t available in the market,” she explained. “You try to do this in-house, but that doesn’t scale well.”

Over the past two years, DefinedCrowd has built out an engine for churning out high-quality data at scale. The company started by constructing essentially a crowdsourced mechanical turk for labeling words and sentences in multiple languages. Today, there are more than 20,000 people on the platform working on building better data sets, and experts in that network span 46 different languages.

After collecting data from the market, DefinedCrowd’s goal is to help users with common AI tasks find exactly the data they need. As part of the SaaS offering, the product offers workflows and templates to get users moving quickly. Users can also take existing datasets off the shelf to move rapidly toward training their AI models.

The key value of the platform though is customization. Using the workers on its platform, DefinedCrowd can augment existing datasets with new specialized data that can help tune models for specific applications.

Take, for instance, building a chat bot for the airline industry. While there are general solutions available on the market that can interpret text or voice, it is hard to create a customized dataset for the specific needs of the travel industry while also offering internationalization in multiple languages. With DefinedCrowd, an airline, for instance, could generate multiple customized datasets for the product and feed that into their NLP model, increasing their product’s effectiveness.

The company has developed special pools of workers who can work in specialized and highly-regulated spaces like finance and healthcare — areas where translations have to be precisely accurate.

The company straddles the Seattle and Portugal tech ecosystems, and continues to grow, with 10 employees in Seattle and another 20 employees in Portugal. The platform is available in monthly and annual plans, with pricing depending on the customer.