
Granite Guardian
IBM built the Granite Guardian model and framework combination as a protective filter for common errors in AI pipelines. First, the model scans for prompts that might contain or lead to answers that include undesirable content (hate, violence, profanity, etc.). Second, it watches for attempts to evade barriers by hoodwinking the LLM. Third, it watches for poor or irrelevant documents that might come from any RAG database that’s part of the pipeline. Finally, if the system is working agentically, it evaluates the risks and benefits of an agent’s function invocations. In general, the model generates risk scores and confidence levels. The tool itself is open source, but it integrates with some of the IBM frameworks for AI governance tasks like auditing.
Claude
As Anthropic built various editions of Claude, it created a guiding list of ethical principles and constraints that it started calling a constitution. The latest version was mainly written by Claude itself, as it reflected upon how to enforce these rules when answering prompts. These include strict prohibitions on dangerous acts like building bioweapons or taking part in cyberattacks as well as more philosophical guidelines like being honest, helpful, and safe. When Claude engages with users, it tries to stay within the boundaries defined by the constitution it helped to create.
WildGuard
The developers of Allen Institute for AI’s WildGuard started with Mistral-7B-v0.3 and used a combination of synthetic and real-world data to fine-tune it for defending against harm. WildGuard is a lightweight moderation tool that scans LLM interactions for potential problems. Its three functions are to identify malicious intent in user prompts; detect safety risks in model responses; and determine the model refusal rate, or how often a model declines to answer. This can be useful for tuning the model to be as helpful as possible while remaining within safe bounds.

