AWS just announced Amazon SageMaker Ground Truth to help companies create training data sets for machine learning. This is a powerful new service for folks who have access to lots of data that hasn’t been consistently annotated. In the past, humans would have to label a massive corpus of images or frames within video to train a computer vision model. Ground Truth uses machine learning in addition to humans to automatically label a training data set.

This is one example of an emerging theme over the past year or so — machine learning for machine learning. Machine-learning data catalogs (MLDCs), probabilistic or fuzzy matching, automated training data annotation, and synthetic data creation all use machine learning to produce or prepare data for subsequent machine learning downstream, often solving problems with data scarcity or dispersion. This is all well and good until we consider that machine learning in and of itself relies on inductive reasoning and is therefore probability-based.

Let’s consider how this may play out in the real world: A healthcare provider would like to use computer vision to diagnose a rare disease. Because of sparse data, an automated annotator is used to create more training data (more labeled images). The developer sets a 90% propensity threshold, meaning only records with a 90% probability of being accurately classified will be used as training data. Once the model is trained and deployed, it is being used on patients whose data is linked together from multiple databases using fuzzy matching on text data fields. Entities from disparate data sets with a 90% chance of being the same are matched. Finally, the model flags images with a 90% or greater likelihood of depicting the disease for diagnosis.

The problem is that, traditionally, data scientists and machine-learning experts only focus on that final propensity score as a representation of the overall accuracy of the prediction. This has worked well in a world where the data preparation leading up to training has been deductive and deterministic. But when you introduce probabilities on top of probabilities, that final propensity score is no longer accurate. In the case above, there’s an argument to be made that the probability of an accurate diagnosis diminishes from 90% to 73% (90% x 90% x 90%) — not ideal in a life-and-death situation.

As the emphasis on the need for explainability in AI increases, there needs to be a new framework for analytics governance that incorporates all the probabilities included in the machine-learning process — from data creation to data prep to training to inference. Without it, erroneously inflated propensity scores will misdiagnose patients, mistreat customers, and mislead businesses and governments as they make critical decisions.

Next week, my colleague Kjell Carlsson is doing a deep-dive session titled “Drive Business Value Today: A Practical Approach To AI” at Forrester’s inaugural Data Strategy & Insights Forum in Orlando. Please join us next Tuesday and Wednesday, December 4 and 5, to discuss this topic and to learn best practices for turning data into insights into actions driving measurable business results.