Why current testing processes in AI/ML are not enough?

The current notions of quality assurance and testing in AI/ML pipelines is based on the idea of validation using a random set-aside set of data on which the model is tested and metrics computed thereof. Metrics like accuracy on the random set-aside data set termed ambiguously as test data, is the usual rubric for evaluation of the effectiveness of the ML models. But this only gives a partial picture of the quality of the model, which is not sufficient to guarantee good performance on deployment.

Probably it is because of the terminology of “test data” used in the process that the big picture of testing is missed out in the ML life cycle. There are however some additional validation mechanisms also suggested to further boost the evaluation process like

  1. K Fold cross validation
  2. Bootstrap
  3. Leave One Out Cross Validation

However all the above validation approaches including randomized train test split mechanisms are based on the notion that testing the model on randomized unseen data is a good enough validation of the corresponding model. We feel that is an incomplete picture which is not complete to guarantee overall performance of the model in the field.

Here are a set of qualified reasons as to why we need to think beyond current model evaluation and validation approaches to guarantee AI / ML model quality.

  1. The random selection of test set including cross validation based approaches do not guarantee a comprehensive coverage of the input scenarios, especially corner cases which are rare in nature.
  2. Even though cross validation approaches try to cover the overall spectrum via k-fold approach, a systematic approach to understand and debug as to the performance of the model for different variable scenarios of inputs is not possible. Hence detecting what types of input variations are not being sufficiently represented in the model, is impossible in current approaches.
  3. Testing for security, an important non functional IT requirement, is totally absent in current model evaluation approaches. Not to think of application security, now AI models themselves need to be audited for AI specific attacks, hence there is a need for comprehensive security testing of AI/ML models.
  4. In terms of compliance oriented sectors there is an increased push for generation of explanations or rationale for the AI/ML model decisions. So testing for explainability is must for today s AI ML models.
  5. Performance of AI/ML models is to be tested independent of the system in which the AI/ML models are deployed. Because there are specific deployment formats like tinyML which need a comprehensive validation of performance at model level.
  6. Privacy as well as GDPR imposed constraints on data and derived AI models are a huge set of desiderata for AI ML applications. So testing AI ML models for privacy breaches or attacks and leaks forms an important component of the overall requirements to certify and audit AI models.
  7. Testing and assurance of fairness and bias in AI models is an important requirement of AI models to ensure that they do not get recalled or rescinded.
  8. Finally testing of data quality at input level before being fed to the ML process is vital, as a lot of quality issues at model arise due to which we need to ensure testing of quality at input data level before being fed to the AI / ML model.
  9. In several scenarios there is not sufficient data to test the AI models. In those scenarios the data adequacy of the models need to be tested and if need be mechanisms to augment test data, be made available

Overall these desiderata really point us to the requirement of standalone frameworks and processes and products for AI testing which can handle all the abovementioned tests for ML models of all types. To ensure a trustworthy and responsible AI a comprehensive set of tests of all the points above is a mandatory requirement.

— Dr Srinivas Padmanabhuni