Marilyn Monroe famously crooned that diamonds were a “girl’s best friend.” But most people don’t want pressurized carbon that comes at the cost of human life — so-called blood or conflict diamonds. To address these concerns, jewelers offer customers ethical certifications for the provenance of their gems.
AI providers are in a similar position. As machine learning and large language models have become embedded in businesses, the origin of the data used to train these AI partners and the ways in which it’s been used are of crucial importance to organizations adopting these technologies.
Wild-harvested data that flagrantly violates copyright and intellectual property laws is increasingly frowned upon. Broader ethical concerns about how these models operate and utilize the data are also becoming legal and regulatory issues. Liability concerns are ballooning.
Companies that offer AI products are now providing their customers with detailed reports — ethical scorecards — that offer an inventory of where the data their models were trained on comes from, how it was processed, and how it is used. These scorecards help organizations build trust with their customers, who can, in turn, present their offerings to the end user with more confidence.
InformationWeek talked to Cindi Howson, chief data and AI officer at ThoughtSpot, and Jamie Hutton, co-founder and chief technology officer at Quantexa, about how ethical AI scorecards can provide companies with the transparency they need to select the right product — and end users with assurance that they are receiving information that has been properly sourced.
Legal Requirements
The data used to train AI models is subject to a patchwork of inconsistently enforced regulations. The EU’s AI Act is the only comprehensive set of legislation to regulate data use by AI platforms and, like other European technological regulations, will likely serve as a template for other jurisdictions. It overlaps with the mandates of the other major body of legislation passed in the EU, the GDPR.
Ethical scorecards leverage the frameworks laid out in this legislation — as well as in non-binding frameworks such as those issued by the Organisation for Economic Co-operation and Development — to report data sources and utilization to users and regulators in a comprehensible fashion. A variety of criteria developed by ethicists and published in academic journals may also be used.
While these scorecards serve as indicators of ethical behavior in general, they are also compliance documents, demonstrating a company’s adherence to rules on data sourcing, privacy, impartiality, and accountability.
Anticipating the wider enactment of AI legislation is increasingly seen as necessary indemnification for users. AI providers such as Anthropic have already been nailed on narrower copyright violations. Other regulatory bodies also police the data that is used in AI.
“The FDA regulates healthcare and medical devices,” Howson said. “There are frameworks for that, but they’re not getting to fine-grained detail.”
In finance, details are key. Howson pointed out that a ZIP code, for example, cannot be used in credit decisions, because it can act as a proxy for race, a form of discrimination known as redlining.
“It’s not just good practice to have models that are explainable and transparent. It’s a requirement,” Smith said. “The regulator wants to make sure the models aren’t biased — that they’re not targeting a particular age range, ethnic background, race, or sex.”
If an AI model violates these regulations because its creators did not adequately consider them, both the vendor and user are exposed to risk. Given the broad geographic application of many models, a generalized approach is advisable — with attention to industry-specific and local laws. Scorecards can, thus, help organizations market their products to clients operating under these constraints and serve as a means of negotiating terms of service.
The volatility of the marketplace, however, complicates the use of scorecards. Not everyone will want the most tightly zipped-up product, Smith noted. “If you tightly regulate in geography A, but you don’t in geography B, then you’ve got competitive advantage challenges,” he said. “It is something that every government is trying to grapple with at the moment.”
Compiling an Ethical Scorecard
Ethical scorecards are complex documents — they are highly specific to industries and individual clients. They surface relevant ethical factors included in the model cards compiled during the model’s creation.
“That documentation will include things like what data it was trained on, what approaches were taken, justifying that a feature is fair,” Smith said. “It gets collected into a huge document that explains all the things that go into the features that go into the model itself.”
An ethical scorecard extracts information regarding data provenance and organization, explainability of how the data is deployed, limitations of the model, potential biases, protection of privacy rights, and the ability of humans to intervene. It then documents the intersection of these issues with compliance.
But the scoring process is also complicated. Standardization and objective metrics for scoring these factors have yet to be widely implemented. And while this information is relatively easily accessible for some machine learning applications, LLMs and other components of agentic AI are more obscure. They operate in ways that are not fully understandable even to their creators, making it challenging to accurately score them.
“They are simply more black box than they have been,” Smith cautioned, referring to advanced AI systems. “What does that mean for explainability? I don’t have a good answer on that yet, but I think it’s going to be a trend that everyone needs to get their heads around.” Howson also sounded the alarm on LLMs. “Originally, LLMs were just tested for accuracy,” she said. How well they could generate correct responses was the primary evaluation metric. The focus on performance often came at the expense of transparency — and ethical considerations.
“For the most part, LLMs are not transparent. We do not know the full body of data that GPT models were trained on,” she said, underscoring the need for companies to adopt “ethics by design,” the practice of embedding ethical principles — transparency, accountability, fairness — into the development process from the beginning.
Benchmarks, such as Stanford’s Holistic Evaluation of Language Models, offer guidance on scoring safety and bias, which may provide value to organizations or clients that rely on these qualities to ensure their reputations.
In the interim, even crudely fashioned ethical scorecards will likely be an asset to vendors and organizations alike as they navigate AI implementation and its consequences.
Ethical Scorecard for AI Systems: Evaluation Criteria
Scoring System
-
Poor performance: Significant improvements needed.
-
Below average: Some criteria met, but major gaps remain.
-
Average: Meets minimum ethical standards.
-
Good: Exceeds basic ethical requirements in most areas.
-
Excellent: Fully aligns with ethical principles and best practices.
Instructions for Use
-
Evaluate each category by answering the key questions and assigning a score from 1 to 5.
-
Provide comments to explain the rationale behind each score or highlight areas for improvement.
-
Use the scorecard to identify strengths and weaknesses in the AI system and prioritize ethical improvements.
SOURCE: The sample scorecard template was generated by Informa TechTarget’s in-house large language model, based on established ethical AI guidelines and frameworks from sources including the European Commission’s Ethics guidelines for trustworthy AI, the IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems, and Stanford’s Holistic Evaluation of Language Models.