Testing AI Systems Without Oracles: Metamorphic Test Generation and Prioritization

Madhusudan Srinivasan

East Carolina Univ.

Abstract

As AI systems are increasingly deployed in high-stakes domains such as healthcare, financial services, and hiring, ensuring their reliability and fairness has become critical. A fundamental obstacle is the oracle problem, the difficulty of defining correct outputs for complex, non-deterministic models. This talk presents metamorphic testing (MT) as a oracle-free framework for testing AI and machine learning systems, where Metamorphic Relations (MRs) define expected behavioral relationships between transformed inputs and outputs, enabling systematic fault detection without ground truth labels.

The talk covers two research contributions. First, a fairness test generation approach for LLMs that covers traditional software test generation techniques to generate diverse test cases targeting fairness faults across sensitive attributes. Evaluated on GPT-4.0 and LLaMA-3.0, it outperforms template-based and grammar-based baselines in fault detection rate and test diversity. Second, a data diversity driven MR prioritization framework that addresses the ineffectiveness of code coverage for ML programs and the high cost of exhaustive MR evaluation. Four diversity metrics are proposed to rank MRs by effectiveness and experiments across multiple ML models demonstrate up to 40% improvement in fault detection over random prioritization and a 26% reduction in time to first fault. The talk concludes with future directions including testing of autonomous AI agents, quantum software, and adaptive MR generation.

About the Speaker

Dr. Madhusudan Srinivasan is a faculty member in the Department of Computer Science at East Carolina University, USA. He is an active researcher in software testing for AI and machine learning systems, with a particular focus on metamorphic testing. His work includes one of the first studies on metamorphic relation prioritization for ML programs, a contribution that introduced data diversity as a principled criterion for selecting and ranking metamorphic relations, establishing a new direction in cost-effective ML testing. His research has been published in reputed venues including the Software Testing, Verification and Reliability journal, the SANER conference, and the IEEE International Conference on AI Testing (AITEST). He is the recipient of the Best Paper Award for his work on trustworthy AI, and his recent publications span fairness testing in LLMs, block chain smart contract validation, and execution profile driven test prioritization, reflecting a growing and broad research program in reliable AI software engineering.