AI Quality Standards
There are growing calls for AI quality standards. This makes sense - it is sometimes difficult for laypeople to judge AI performance, or to understand why an AI makes certain mistakes.
However, there are currently no official AI quality standards. There are initiatives, such as the Legal Bench Project in the English-speaking world, which compares large language models on certain categories.
But as a user, there is currently little to know about the quality of a vendor's AI.
The Challenge of Measuring AI Results
The more complex the issue, the more difficult it is to evaluate the AI output. For simple tasks, such as recognising certain contract clauses, an AI's performance can be quantified by direct comparison with human results. However, as tasks become more unstructured, such as interpreting legal documents, it becomes more difficult to develop standardised testing methods.
High AI Quality Indicators
Based on discussions with experts, the following indicators point to good AI quality at an AI provider
-
One indicator of the AI quality of a provider is, for example, its data protection regulations and security standards of the technology infrastructure (ISO certificates).
-
In-house test sets: There are AI providers who work with their own test sets. They continuously measure and improve the AI output.
-
Experts & AI developers working together: The development of an AI for a specific use case is particularly successful when specialists who can evaluate the AI's output work together with the AI developers.
Figure: At Legartis, only AI results with an accuracy of over 92% (F1 score over 0.92) are made available to the customer.
How To Evaluate an AI for Your Business Case?
There are more and more providers offering AI solutions for different areas of application. How do you find out whether the AI quality meets your requirements?
1. Narrow down the area of application and define AI quality
The more clearly defined and limited the area of application, the easier it is to say how high the quality of the AI needs to be: Should the AI take over a process fully automated, without subsequent verification by a human (autonomous AI)? Then the error tolerance of the AI output should be close to zero.
Or is the AI used for a process that is then briefly checked by a human (AI assistance)? Then the error tolerance is slightly higher.
In order to evaluate an AI, it is important that the area of application is narrowed down and the fault tolerance is defined.
2. Data protection regulations and security guidelines of the company
As with other technology applications, the provider's data protection regulations and security guidelines should be in line with your own organisation.
3. Perform a plausibility check
If you know which process you want to use an AI for, you can contact the various providers and ask them directly:
-
Do you use your own test sets? How do they look like?
-
Are there specialists in your AI development team (in the legal field, for example, lawyers) who check and improve the AI results?
Areas of Application and AI Quality
AI offers significant advantages for repetitive tasks such as spending hours looking through contracts and searching for clauses. It performs these tasks more efficiently and accurately than humans, who are more prone to error due to fatigue.
Examples of good applications for AI:
- Routine tasks with a clear focus: AI works well with routine tasks/work that is clearly defined and not too complex. It completes tasks efficiently and reliably. Example: AI is successfully used for reviewing contracts and analysing clauses. It is very well suited for initial contract reviews.
- Document review and risk analysis: In areas such as Due Diligence, where large volumes of documents need to be searched for specific questions, AI provides valuable preliminary work by extracting and structuring relevant information.
- Creation of text suggestions: AI provides good support when creating text proposals. These may be drafts for e-mails, briefs or wording suggestions in contracts. These proposals are then checked by a human and adapted if necessary.