AI Unraveled
AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, Gen AI, LLMs, Prompting, AI Ethics & Bias
🛡️The Future of AI Safety Testing with Bret Kinsella, GM of Fuel iX™ at TELUS Digital
0:00
-15:13

🛡️The Future of AI Safety Testing with Bret Kinsella, GM of Fuel iX™ at TELUS Digital

How a New Method is Red Teaming LLMs

Speaker: Bret Kinsella, GM of Fuel iX at TELUS Digital

Host: Etienne Noumen, P.Eng Creator of AI Unraveled

Watch Full Original Interview video at:

1. Executive Summary

This show explores the evolution of AI safety testing, particularly concerning large language models (LLMs). It highlights the limitations of traditional "pass/fail" red teaming and introduces a novel approach called Optimization by PROmpting (OPRO), which enables an LLM to effectively "red team itself." This new methodology focuses on evaluating the Attack Success Rate (ASR) as a distribution, offering more nuanced insights into an AI model's security. The discussion also touches upon the real-world implications for enterprises, especially in regulated industries like finance, energy and healthcare, and how OPRO can aid in demonstrating regulatory compliance and fostering accountability. Ultimately, the guest looks towards the future of AI safety, identifying upcoming challenges and areas for focused research and development.

2. Bret Kinsella's Journey and the Genesis of Fuel iX™

Bret Kinsella's 30-year career in technology, spanning the internet, RFID, and mobile, has consistently focused on "drivers of adoption and barriers to adoption." For the past 12 years, he has been deeply involved in AI, particularly conversational AI and more recently, generative AI. His work, including founding companies and a research business (Voicebot.ai), led him to TELUS Digital about 18 months prior to the interview.

TELUS Digital, a leading global technology company specializing in digital customer experiences with more than 78,000 employees globally, sought to "harden and extend" its internally developed AI applications and explore external market opportunities for these technologies. Kinsella was brought in to guide this process, leading to the development of Fuel iX, the company’s proprietary generative AI platform and suite of products that help enterprises advance their GenAI pilots to working prototypes and production at scale, quickly, securely and responsibly across multiple environments, applications and clouds.

A key focus for Kinsella at Fuel iX became AI "safety and security," which he distinguishes as separate but equally vital. This focus was driven by the recognition that generative AI, with its "unbounded inputs and outputs systems," introduces significant risks, including "reputational risk," "legal risk," "regulatory risk," and "competitive risk," which could act as a "barrier to adoption."

Fuel iX solutions, such as "Fuel iX Copilots," are general-purpose tools rolled out to "tens of thousands of people internally across our organizations plus some other customers." These tools are used across various functional areas like "finance, HR, marketing, IT, in the contact centre," demonstrating the pervasive integration of generative AI within TELUS Digital's operations. Kinsella stresses the importance of user-led customization and grounding in proprietary data to maximize the efficacy of these tools, empowering frontline workers to "find the efficiency for the task."

3. The Flaws of Traditional Red Teaming for LLMs

Red teaming, a long-standing security practice, involves experts attempting to compromise systems in order to identify vulnerabilities in a safe, controlled environment. The goal of red teaming is to expose weaknesses so that they can be addressed adequately by the “blue team.”

However, Kinsella identifies fundamental flaws when applying traditional red teaming to LLMs:

  • Unbounded Nature of Generative AI: Unlike traditional programmatic systems with a limited number of possible inputs and outputs, generative AI is probabilistic and unbounded on both the input and output sides. This means inputs are by definition variable and outputs can vary across runs, making exhaustive pre-approval or evaluation practically impossible.

  • Over-reliance on Guardrails: Existing safety measures focus heavily on guardrails (intervention technologies like input scanners, output filters, or system prompts) that are reactive and potentially probabilistic. They mitigate some risks and have an important part to play in any LLM security ecosystem, but do not fully prevent vulnerabilities from arising and are more of a stopgap measure.

  • Scalability Mismatch: Co-pilots, bots, and AI assistants are capable of higher volume and scale than human red teamers. Artisanal attacks take time and effort that is better spent on refining novel attack methods than producing broad coverage. This mismatch necessitates automated approaches for vulnerability discovery.

  • Inadequacy of Existing Security Tools: Traditional tools were designed for deterministic, programmatic systems. They are ill-suited for unbounded systems where both inputs and outputs are given in natural languages such as English.

  • Probabilistic Nature of LLM Vulnerabilities: A critical finding from TELUS Digital's research (pre-published on arXiv) shows that repeating the same attack prompt against an LLM application can yield different outcomes. Since LLMs are probabilistic in nature, the same attack may succeed or fail depending on the attempt. This yields a probability of success given an attack against the target system which is stable and discoverable over repeated trials. Since individual attacks have statistical properties, their proper evaluation requires statistical treatment. This probability of attack success serves as an estimate of attack quality as well, as it represents how discoverable the associated vulnerability happens to be.

  • Limited Human Creativity and Maliciousness: Human red teamers, while creative, are bounded by individual imagination. Discomfort with certain malicious scenarios or other internal biases will hold people back from testing a full range of attack options. Attackers in the wild, however, have no such qualms or concerns. Luckily for us, neither do automated systems once calibrated for this purpose.

4. Applying Our Measure of Attack Quality to Optimization by PROmpting (OPRO)

To address these limitations, Kinsella points to “Optimization by PROmpting (OPRO)”, a method introduced by Yang et al. (2024) that treats LLMs as general-purpose optimizers. OPRO is not itself an attack-generation method, it is used in conjunction with our new measurement of attack quality to optimize our automated red teamer. In successive iterations, the technique is capable of optimizing our attacker to produce a higher proportion of high quality attacks given a specific target in question.

Key aspects of our application of OPRO:

  • AI as a Self-Optimizer: OPRO allows us to use the LLM itself as an optimizer for improving our attack generator. This mimics fine-tuning except at the prompt level, gradually locking onto specific vulnerabilities in a given target.

  • Feedback Loop via Contrastive Attack Pairs: Our contribution, called “ASR-delta pair mining”, is used to produce example pairs for our optimizer. We select pairs of the most semantically similar attacks that have the largest difference in evaluated quality. So if two attacks appear to have the same exact technique, objective, overall meaning and one has 90% success with the other sitting at 10%, we use this as an instructive example. What caused one to succeed 90% of the time with the other failing at the same rate? This is what our optimizer is capable of figuring out, adjusting our attacker to isolate and emulate the specific factors driving attack success.

  • Scale and Novelty: Using this method, our generator can be iteratively improved at scale. Unlike manual prompt tweaking, this process systematically makes use of statistical evidence from repeated trials.

  • Blueprint for Mitigation: The output is an optimized, improved automated red team agent that exposes vulnerabilities at a much higher rate. Organizations can then use this information to adjust system prompts, strengthen guard rails, and build layered defenses.

  • Prevention over Reaction: By focusing on improving the generator proactively, our approach helps discover vulnerabilities before deployment. This shifts emphasis from reaction to prevention.

5. Measuring Risk with Attack Success Rate (ASR) as a Distribution

Instead of evaluating attacks by whether they succeed or not on a single attempt, Kinsella’s team evaluates them by probability of success. This changes our evaluation of the automated red teamer from a point-estimate (its attack success rate) to a probability distribution (capturing all of the individual attacks’ success rates). This reflects the probabilistic nature of LLMs and helps surface the discoverability of vulnerabilities across an automated red teamer’s observed output.

  • Multiple Trials per Attack: Each attack is executed repeatedly against a seeded target. The proportion of successes yields an ASR score for that individual attack.

  • Building the Distribution: Collecting ASR scores across many unique attacks produces an ASR distribution, which contains far more information than a single aggregate rate.

  • Higher Fidelity Risk Assessment: The ASR distribution reveals clusters of consistently successful attacks, differences between near-identical attacks, and other exploitable patterns. This allows for more accurate assessments of vulnerability likelihood than traditional approaches to generator evaluation.

  • Guidance for Optimization: Because the ASR distribution helps us identify high versus low performing attacks, it provides the statistical foundation for our ASR-delta pair mining approach. This makes it central to optimizing the red team agent, and ultimately, to a better understanding of risk.

6. Real-World Impact: A New Standard for Enterprise

For "high-stakes industries like finance or healthcare," Kinsella advises a shift in safety testing practices based on three pillars: "comprehensiveness, repetition, and creativity."

  • Comprehensiveness: Go "beyond what you think you need to do." Start with frameworks like "OASP.10" and "MITER attack models" but recognize their limitations as checklists. TELUS Digital has developed "139 attack objectives" categorized into "15 different vulnerable segments." Tailoring is crucial, as "finance, healthcare, energy have different types of specific vulnerability considerations." Organizations can integrate their "code of conduct" or "enter in your own" specific vulnerabilities.

  • Repetition: Conduct tests "multiple times over and over again just to make sure that your first, second, third attempts are representative of what this is likely to be in the field."

  • Creativity (via Automation): Leverage "automation for comprehensiveness, repetition, and ingenuity" to overcome the limitations of human red teamers.

Kinsella also stresses the importance of frequency in testing:

  • Organizations often test "when they launch a product," but fail to re-test when "the model's updated in seven months," "swap out an orchestration tool," or to check for "regression or novelty."

  • Automation allows for "good hygiene," enabling testing "more frequently." A product or project manager can run tests "at any given time" or "schedule it," providing "data at your fingertips" for application owners and security teams. This allows for "proactivity as opposed to reactivity with guardrails" to "close off or mitigate those risks."

7. The Regulatory Landscape: From Policy to Practice

Kinsella acknowledges that current regulations, such as “America’s AI Action Plan and what's going on in Europe," are often "ambiguous" and "vague," making compliance challenging. However, he advises organizations to:

  • Interpret Minimum Requirements: "Guess what these vague regulations mean at a minimum."

  • Anticipate Increased Specificity: Recognize that regulations "are only going to get more specific over time."

  • Proactive Layered Defense: Proactively implement a "layered defense" strategy for both "AI security" and "AI safety." Regulators are increasingly focused on "AI safety issues that might be a reputation hit to you" or "could lead to fines from regulatory bodies."

  • Demonstrate Fiduciary Responsibility: Organizations must "set a standard that you're comfortable with as an organization that you're doing your fiduciary responsibility." OPRO, by providing a detailed vulnerability blueprint, assists companies in "demonstrat[ing] compliance and accountability to regulators."

8. The Future of AI Safety: The Next Frontier

Looking ahead, Kinsella identifies three key areas for focus in AI safety testing:

  • Sophisticated Vulnerability Testing: This is "at the forefront today" because current efforts are "fairly limited." Vulnerability testing will become "much more sophisticated overall so that organizations can proactively close off risk."

  • Supervisor Agents: These "agentic AI system[s]" go "beyond traditional guardrails" by "reviewing all the information that's all the conversations" and looking for "specific things." Kinsella expects them to be "much more common and prevalent" as another layer of defense.

  • Root Cause Identification: Currently lacking focus, understanding the "root cause, why does this come up at the model level, at the data level within your system?" is crucial. This will allow organizations to go "backwards into the model into the data and therefore close off some more of those risks," moving beyond just identifying and protecting against vulnerabilities.

9. The Final Takeaway: Building with Innovation and Responsibility

Kinsella offers practical advice for staying ahead in AI safety, focusing on policy, technology, and process:

  • Policy: Organizations must define "what is important and not important to them." This involves setting clear "governance" particularly around AI safety and security, aligning with "regulation" and acting as a "good corporate citizen" doing "right by your customers."

  • Technology: "Narrow the scope of your instruction to your models and use multiple models to perform different tasks." Avoid overloading single system prompts, as "tokens get lost" and models might "do it" if a "don't" instruction is missed. By using different models for different tasks (e.g., one for "what you're supposed to do" and others for "what you don't do"), you can achieve a broader solution scope while maintaining control.

  • Process: "Everybody really should be testing their systems on a regular basis." Manual red teaming and even technically automated testing "are not going to catch everything." Regular testing, "at least monthly," and after "any type of significant release system upgrade," is essential for "regression testing" and identifying "novelty."

Kinsella concludes by emphasizing the dual challenge and opportunity of AI: "these systems are really extraordinary in many ways but introduce novel risks." Organizations must proactively address "security and safety risk" as "barriers to adoption," ensuring "you set aside that time to do the work to reduce some of these barriers and these harms that could be lurking inside your models."

Learn More:

Discussion about this episode

User's avatar