The field of artificial intelligence has advanced dramatically over the past decade. The massive volumes of data collected by consumer internet providers have been utilized to develop robust machine learning algorithms. Many open-source and proprietary machine learning algorithms are now accessible for use in a variety of commercial settings.
Andrew ng, sm ’98, a pioneer in artificial intelligence and the creator of the Google brain research lab, Coursera, and the former top scientist at Baidu, believes that the time has come to put more attention on the data that powers these systems.
At the recent Gemtech digital conference presented by MIT technology review, ng said that “all that progress in algorithms means it’s actually time to spend more time on the data” for artificial intelligence systems.
According to NG, the potential of ai in fields like healthcare, government technology, and manufacturing might be unlocked by a focus on high-quality data that is consistently labeled.
“I don’t see widespread adoption of ai anywhere I look, whether it be in the healthcare system or in a manufacturing organization,” the author says. Ng, who is also the founder and CEO of landing ai, attributes this to the haphazard approach to data engineering that often depends on the good fortune or expertise of individual data scientists.
According to ng, a data-centric ai workshop he organized in December 2017 was only one place where the concept was debated further. But he highlighted three issues he typically encounters while working with data:
Distinct variations in titles. Artificial intelligence algorithms are being trained to detect flaws in products in industries like manufacturing and pharmaceuticals. However, even well-educated people can have different opinions on whether or not a pill is “chipped” or “scratched,” which can lead to confusion for the ai system. Similarly, there is no standard manner for hospitals to code their electronic medical records. This presents a challenge because artificial intelligence systems learn best from reliable input.
Big data has become the focus. The idea that more information is always preferable is widely held. However, ng argues that lower volumes of high-quality data may be sufficient for several uses, like manufacturing and healthcare. It’s possible that there aren’t many x-rays of a particular medical issue because there aren’t that many patients with it, or if the factory only produced 50 damaged cell phones.
“being able to get things to work with small data, with good data, rather than just a giant dataset,” ng said, “would be key to making these algorithms work” in industries that don’t have access to massive amounts of data.
Curation of data on the fly. Messy, inaccurate data is a common occurrence. People have been trying to identify and resolve issues on their own for decades. In his opinion, “it’s often been the cleverness of an individual’s skill, or luck with an individual engineer, that determines whether it gets done well,” ng added. To this end, “making this more systematic through principles and [the use of tools] will help a lot of teams build more ai systems.”
Opening up the potential of AI
Some of these issues arise naturally due to variations between businesses. One ai system won’t be able to function for everyone, according to ng, because organizations use varying methods of coding and factories produce varying products.
Connecting text
NG said that many industries can’t follow the recipe for ai adoption used by consumer software internet businesses due to limited data sets and the requirement for personalization.
Ng argued that “a custom ai system trained on their data” was necessary for every hospital and healthcare system. The same goes for production. In-depth visual defect inspection is something that varies per manufacturer. Therefore, it’s possible that every factory will need its own image-trained ai model.
However, research and development efforts have thus far concentrated on general-purpose ai systems that can generate enormous economic benefits.
There are “tens of thousands” of projects with budgets between $1 million and $5 million, but “no one is really able to execute them successfully,” as ng put it. “i can’t afford to hire 10,000 machine learning engineers to create 10,000 separate machine learning systems,” said someone like me.
Ng said that data-centric ai is an important part of the solution since it may equip individuals with the resources to engineer data and create a tailor-made ai system. That, he opined, was “the only recipe i know of” that could release much of ai’s potential elsewhere.
How data-centric AI can help
Data-centric ai is still in the “ideas and principles” stage, but ng believes the keys will be tools and education, such as:
First, equipment for spotting discrepancies. Tools could zero down on a specific “slice” of problematic data, allowing developers to fix inconsistencies there. Although reasonable persons may label in different ways, ng argues that disagreements can be reduced if they are identified early and a standardized approach to labeling is adopted.
The second is giving more power to specialists in specific fields. Professional assistance should be sought in highly specialized areas. Cell biologists, for instance, know cells far better than data engineers, so they should be enlisted to help train artificial intelligence to recognize various cellular features. More “domain experts” can now “express their knowledge through the form of data,” as ng put it.
Ng noted that while a shift towards standardization is important, physical infrastructure can sometimes be a limiting factor. There are no reasonable options for ensuring that all hospitals employ X-ray machines of the same age, as a machine that is even seven years old will produce different entries than a brand-new machine. Comparing a plant that produces auto parts to one that produces confectionery has similar challenges.
“he further explained that this fundamental variability in the data was caused by the immutability of the underlying physical environment. “tailor-made ai systems are required for these varied datasets.”