This article predominantly talks about testing as a distinct career option in data science and machine learning (ML). It gives a brief on testing workflows and process. It also depicts the expertise and top-level skills a tester needs to possess in order to test a ML application.
Testing in Data Science: Opportunity for Expansion
There is a significant opportunity to explore and expand the possibilities of testing and quality assurance into the field of data science and machine learning (ML).
Playing around with training data, algorithms and modeling in data science may be a complex yet interesting activity—but testing these applications is no less.
A considerable amount of time goes into testing and quality assurance activities. Experts and researchers believe 20 to 30% of the overall development time is spent testing the application; and 40 to 50% of a project’s total cost is spent on testing.
Moreover, data science experts and practitioners often complain about having ready-for-production data science models, established evaluation criteria and set templates for report generation—but no teams to help them test it. This unleashes the potential of testing in data science as a full-fledged career option. (Also read: Post-Pandemic Life in the Tech World Looks Pretty Good.)
Testing can be implemented in data science in a totally new context and approach. But, for such systems, this new backdrop consumes even more time, effort and money than the other legacy systems at hand.
To understand this complexity, we first need to understand the mechanics behind machine learning systems
How Machine Learning Systems Work
In machine learning (ML), humans feed desired behavior as examples during the training phase through the training data set and the model optimization process produces the system’s rationale (or logic). (Also read: Debunking the Top 4 Myths About Machine Learning.)
But what lacks is a mechanism to find out if this optimized rationale is going to produce the desired behavior consistently.
This is where testing comes in.
Workflow for Testing Machine Learning Systems
Typically, in machine learning, for a trained model, an evaluation report is automatically produced based on established criteria which includes:
- The model’s performance, based on established metrics on the validation dataset. One common metric is the Accuracy or F1- Score, although many others are used as well.
- An array of plots, depicting how things like precision-recall curves and AUC-ROC curves perform. This array is, again, not exhaustive.
- The hyperparameters used to train the model.
Based on the evaluation report, models offering an improvement over the existing model (or baseline) while being executed on the same dataset is promoted and considered for final inclusion.
While reviewing multiple ML models, metrics and plots which summarize model performance over a validation dataset are inspected. Performance between multiple models is compared to make relative judgments—but adequate model behavior cannot be immediately characterized based on this.
Let us take an example to understand.
Case Study: A Hypothetical Data Science Project
Consider a project wherein training data is utilized to develop models. The developed models are tested for performance over a validation dataset and evaluation reports are generated based on accuracy as a metric.
Here are the results:
Table 1
Model
|
Accuracy (%) |
1 | 85 |
2 | 80 |
3 | 95.4 |
4 | 98.8 |
5 | 90.15 |
So, which model is the best? To determine that, we have to look at model behavior—meaning model testing becomes a priority.
It is recommended to create behavioral tests to evaluate the model on each of its identified capabilities and choose the model which scores highest in terms of these capabilities.
For example, suppose this is a sentiment analysis (NLP) project and the possible capabilities are vocabulary, linguistics, negation, named-entity recognition (NER) and topicalization. That means model performance needs to be tested on each of these capabilities, apart from evaluation metrics and plots/curves.
Let us see the below table to understand.
Table 2
Identified capabilities |
Vocabulary |
Linguistics |
Negation |
NER |
Model |
Score based on identified capabilities (%) |
1 |
70 |
2 |
50 |
3 |
90 |
4 |
80 |
5 |
75 |
The above table depicts that, even though Model 4 was highest in accuracy (98.8%), it scores lower (80% in terms of capabilities (which translates to behavioural consistency). Instead, Model 3, which has lesser accuracy (95.4%) but a higher capability score (90%), is considered for further deployment into production.
Again, choosing models for inclusion based on model evaluation and model testing requires both these categories of testing to be concurrent and coordinated with customers’ expectations and needs.
This is the essence of testing in data science—which helps to decide what model to include in final production and deployment.
So, what next?
An Overview of Machine Learning Testing
The cure-all for this is creating a sufficient number of behavioral tests for the model under consideration, which should be able to provide 100% coverage in terms of the software and its capabilities’ optimized rationale. Also, it is advisable to group these tests under different capabilities headings so nothing is missed and you can easily trace your approach.
Traditional software testing has metrics such as the lines of code (LOC), software lines of code (SLOC) or McCabe complexity. But for the parameters of a machine learning model, it becomes harder to set metrics for coverage.
The only possible solution, in this context, is to track model logits and capabilities—and quantify the area each test covers around these output layers—for all tests executed. Complete traceability between behavioral test cases and the model logit and capabilities has to be captured.
But still, a well-established convention is lacking industry-wide in this regard. And testing for machine learning systems is in such an immature state professionals still aren’t taking test coverage seriously.
The Two Main Types of Machine Learning Testing
Considering the above scenarios, we derive two broad categories of testing in machine learning applications.
- Model evaluation, which depicts metrics and curves/plots explicitly defining model performance on a validation or test dataset
- Model testing, which involves explicit checks for behaviors the model is expected to follow.
For these systems, model evaluation and model testing should be executed in parallel—because both are requisite for building high-quality models.
In practice, most experts are doing a combination of the two—where evaluation metrics are calculated automatically and some level of model “testing” is done manually through the error analysis process (i.e., through failure mode and effect analysis). But this is not sufficient.
Testers pitching in early in the phase and developing model tests exhaustively for machine learning systems can offer a systematic approach—not only towards error analysis, but also in helping achieve complete coverage and automating the entire approach.
Required Competencies for a Data Science Testing Team
A good testing team needs to validate the model’s outcomes to make sure it works as expected. The model will keep changing as customer requirements come in, or changes and implementations are made, but the more the team optimizes the model the better the results will look. This cycle of refinement and modifications continues based on the customer’s needs.
Hence, below are the minimum requirements a data science testing team should possess (Also read: 5 Crucial Skills That Are Needed For Successful AI Deployments.):
- Understanding the model in and out. The team needs to know data structure, parameters and schemas. This is very important for validating results and model outputs.
- Understanding the parameters they’re working with. Parameters help us know what is in the dataset to help us find patterns and trends based on what the customer needs. The model is a hit-and-miss approach of several algorithms that provide insights and highlight the dataset’s best results.
- Understanding how the algorithms work. The core of developing models is algorithms—so understanding them (and under what circumstances they can be used) is crucial.
- Collaborating closely. Working together helps a testing team better understand what each of their colleagues is implementing to create test cases for each feature. It also helps conduct exploratory testing and regression testing on new features without breaking down the rest (i.e., breaking baseline results). This is a way to understand how the model’s parameters behave with different datasets and helps form an input to produce test plans.
- Knowing if the results are accurate. For validating model results, it is important to set a defined threshold. If values deviate beyond the threshold, there is inaccuracy. Some areas of a model can be random. And, hence, to control such randomness—or the level of deviation—a threshold is applied. This means the result is not wrong as long as it is within the threshold limit percentage.
Top Skills Every Data Science Tester Should Have
While the above competencies are important for a data science testing team overall, each tester should possess a number of individual capabilities. (Also read: Top 5 Highest Paying IT Certifications and How to Get Them.)
Here’s what a data science tester needs to “hit the right spot”:
- Probability and statistics
- Any programming language (think Python, R, SQL, Java or MATLAB)
- Data wrangling
- Data visualization
- Machine learning concepts
- Understanding algorithms
Conclusion
Machine learning systems are knotty to test because developers and testers are not explicitly writing system’s logic (it’s generated through optimization).
Testers can tackle this issue, as they deal with large sets of data already and know how to use it optimally. Moreover, testers are are experts in looking critically at data and are concerned less with code and more with data and domain knowledge. All this helps testers conveniently embrace data science and machine learning—for them, it is just a matter of changing the lever and fine-tuning the engine for a new route in their ongoing journey.