Why ML Testing Could Be The Future of Data Science Careers


Developers and testers don't explicitly write a machine learning system's logic—and expanding testing and quality assurance into their own subset of data science careers could be the solution.

This article predominantly talks about testing as a distinct career option in data science and machine learning (ML). It gives a brief on testing workflows and process. It also depicts the expertise and top-level skills a tester needs to possess in order to test a ML application.

Testing in Data Science: Opportunity for Expansion

There is a significant opportunity to explore and expand the possibilities of testing and quality assurance into the field of data science and machine learning (ML).

Playing around with training data, algorithms and modeling in data science may be a complex yet interesting activity—but testing these applications is no less.

A considerable amount of time goes into testing and quality assurance activities. Experts and researchers believe 20 to 30% of the overall development time is spent testing the application; and 40 to 50% of a project’s total cost is spent on testing.

Moreover, data science experts and practitioners often complain about having ready-for-production data science models, established evaluation criteria and set templates for report generation—but no teams to help them test it. This unleashes the potential of testing in data science as a full-fledged career option. (Also read: Post-Pandemic Life in the Tech World Looks Pretty Good.)

Testing can be implemented in data science in a totally new context and approach. But, for such systems, this new backdrop consumes even more time, effort and money than the other legacy systems at hand.


To understand this complexity, we first need to understand the mechanics behind machine learning systems

How Machine Learning Systems Work

In machine learning (ML), humans feed desired behavior as examples during the training phase through the training data set and the model optimization process produces the system’s rationale (or logic). (Also read: Debunking the Top 4 Myths About Machine Learning.)

But what lacks is a mechanism to find out if this optimized rationale is going to produce the desired behavior consistently.

This is where testing comes in.

Workflow for Testing Machine Learning Systems

A flowchart showing how to test machine learning systems.Image by author.

Typically, in machine learning, for a trained model, an evaluation report is automatically produced based on established criteria which includes:

  • The model’s performance, based on established metrics on the validation dataset. One common metric is the Accuracy or F1- Score, although many others are used as well.
  • An array of plots, depicting how things like precision-recall curves and AUC-ROC curves perform. This array is, again, not exhaustive.
  • The hyperparameters used to train the model.

Based on the evaluation report, models offering an improvement over the existing model (or baseline) while being executed on the same dataset is promoted and considered for final inclusion.

While reviewing multiple ML models, metrics and plots which summarize model performance over a validation dataset are inspected. Performance between multiple models is compared to make relative judgments—but adequate model behavior cannot be immediately characterized based on this.

Let us take an example to understand.

Case Study: A Hypothetical Data Science Project

Consider a project wherein training data is utilized to develop models. The developed models are tested for performance over a validation dataset and evaluation reports are generated based on accuracy as a metric.

Here are the results:

Table 1

Accuracy (%)
1 85
2 80
3 95.4
4 98.8
5 90.15

So, which model is the best? To determine that, we have to look at model behavior—meaning model testing becomes a priority.

It is recommended to create behavioral tests to evaluate the model on each of its identified capabilities and choose the model which scores highest in terms of these capabilities.

For example, suppose this is a sentiment analysis (NLP) project and the possible capabilities are vocabulary, linguistics, negation, named-entity recognition (NER) and topicalization. That means model performance needs to be tested on each of these capabilities, apart from evaluation metrics and plots/curves.

Let us see the below table to understand.

Table 2

Identified capabilities


Score based on identified capabilities (%)











The above table depicts that, even though Model 4 was highest in accuracy (98.8%), it scores lower (80% in terms of capabilities (which translates to behavioural consistency). Instead, Model 3, which has lesser accuracy (95.4%) but a higher capability score (90%), is considered for further deployment into production.

Again, choosing models for inclusion based on model evaluation and model testing requires both these categories of testing to be concurrent and coordinated with customers’ expectations and needs.

This is the essence of testing in data science—which helps to decide what model to include in final production and deployment.

So, what next?

An Overview of Machine Learning Testing

The cure-all for this is creating a sufficient number of behavioral tests for the model under consideration, which should be able to provide 100% coverage in terms of the software and its capabilities’ optimized rationale. Also, it is advisable to group these tests under different capabilities headings so nothing is missed and you can easily trace your approach.

Traditional software testing has metrics such as the lines of code (LOC), software lines of code (SLOC) or McCabe complexity. But for the parameters of a machine learning model, it becomes harder to set metrics for coverage.

The only possible solution, in this context, is to track model logits and capabilities—and quantify the area each test covers around these output layers—for all tests executed. Complete traceability between behavioral test cases and the model logit and capabilities has to be captured.

But still, a well-established convention is lacking industry-wide in this regard. And testing for machine learning systems is in such an immature state professionals still aren’t taking test coverage seriously.

The Two Main Types of Machine Learning Testing

Considering the above scenarios, we derive two broad categories of testing in machine learning applications.

  1. Model evaluation, which depicts metrics and curves/plots explicitly defining model performance on a validation or test dataset
  2. Model testing, which involves explicit checks for behaviors the model is expected to follow.

For these systems, model evaluation and model testing should be executed in parallel—because both are requisite for building high-quality models.

In practice, most experts are doing a combination of the two—where evaluation metrics are calculated automatically and some level of model “testing” is done manually through the error analysis process (i.e., through failure mode and effect analysis). But this is not sufficient.

Testers pitching in early in the phase and developing model tests exhaustively for machine learning systems can offer a systematic approach—not only towards error analysis, but also in helping achieve complete coverage and automating the entire approach.

Required Competencies for a Data Science Testing Team

A good testing team needs to validate the model’s outcomes to make sure it works as expected. The model will keep changing as customer requirements come in, or changes and implementations are made, but the more the team optimizes the model the better the results will look. This cycle of refinement and modifications continues based on the customer’s needs.

Hence, below are the minimum requirements a data science testing team should possess (Also read: 5 Crucial Skills That Are Needed For Successful AI Deployments.):

  1. Understanding the model in and out. The team needs to know data structure, parameters and schemas. This is very important for validating results and model outputs.
  2. Understanding the parameters they’re working with. Parameters help us know what is in the dataset to help us find patterns and trends based on what the customer needs. The model is a hit-and-miss approach of several algorithms that provide insights and highlight the dataset’s best results.
  3. Understanding how the algorithms work. The core of developing models is algorithms—so understanding them (and under what circumstances they can be used) is crucial.
  4. Collaborating closely. Working together helps a testing team better understand what each of their colleagues is implementing to create test cases for each feature. It also helps conduct exploratory testing and regression testing on new features without breaking down the rest (i.e., breaking baseline results). This is a way to understand how the model’s parameters behave with different datasets and helps form an input to produce test plans.
  5. Knowing if the results are accurate. For validating model results, it is important to set a defined threshold. If values deviate beyond the threshold, there is inaccuracy. Some areas of a model can be random. And, hence, to control such randomness—or the level of deviation—a threshold is applied. This means the result is not wrong as long as it is within the threshold limit percentage.

Top Skills Every Data Science Tester Should Have

While the above competencies are important for a data science testing team overall, each tester should possess a number of individual capabilities. (Also read: Top 5 Highest Paying IT Certifications and How to Get Them.)

Here’s what a data science tester needs to “hit the right spot”:


Machine learning systems are knotty to test because developers and testers are not explicitly writing system’s logic (it’s generated through optimization).

Testers can tackle this issue, as they deal with large sets of data already and know how to use it optimally. Moreover, testers are are experts in looking critically at data and are concerned less with code and more with data and domain knowledge. All this helps testers conveniently embrace data science and machine learning—for them, it is just a matter of changing the lever and fine-tuning the engine for a new route in their ongoing journey.


Related Reading

Related Terms

Supriya Ghosh

I am a Data Science practitioner, mentor, and a researcher with more than 16 years of industry experience. During my professional stint, I have been involved in scaling up and leading cross-functional analytics and data science teams.My expertise includes Supervised Learning, Unsupervised Learning, Predictive Modelling, Recommendation systems, Regression, Classification, Clustering , NLP, Statistics, Supply chain analytics, and Marketing analytics wherein I contribute towards turning Business Problems into Solutions, improving operations, Project Management, and Delivery, support corporate strategy and drive business growth in multiple domains.I have a Bachelor of Technology(Electronics and Communications), and a Master of Business Administration (IT and Operations)…