Banks are deploying large language models to handle compliance review, fraud triage, customer onboarding, and document analysis — but the regulatory scaffolding to govern them has a gap that every technology and compliance leader needs to understand right now.
In April 2026, the OCC, Federal Reserve, and FDIC issued their first major overhaul of model risk management guidance since SR 11-7 in 2011 — and they explicitly carved out generative AI and LLMs from its scope, promising a separate request for information to follow. That carve-out is not a green light. It is a warning that the agencies consider LLM risk distinct enough to warrant its own treatment, and that institutions operating LLMs today are doing so in advance of formal guidance.
Why LLM Compliance Risk Is Different
Traditional SR 11-7 model risk management was built for statistical models with defined inputs, reproducible outputs, and traceable logic. An underwriting model that scores a loan application can be challenged, back-tested, and documented with a clear methodology. A large language model generating a compliance summary or AML triage narrative cannot. Its outputs emerge from billions of parameters encoding statistical associations, not from an auditable rulebook.
This distinction creates three categories of LLM-specific risk that traditional model governance frameworks do not fully address.
1. Hallucination and Factual Reliability
LLM hallucination — the generation of confident but fabricated output — is the most discussed risk, and for good reason. In banking contexts, a hallucinated regulatory interpretation, a fabricated transaction pattern in an AML narrative, or an invented product term in a customer disclosure is not merely a quality problem. It is a compliance event. Because the model cannot point to a source document for any given claim, the institution bears the entire audit risk when that output is acted upon.
Regulators are already paying attention. The FINOS AI Governance Framework for financial services identifies hallucination and inaccurate outputs as a primary risk category, requiring institutions to implement output validation layers that verify AI-generated content against authoritative data sources before it reaches customers or informs decisions. The February 2026 FS AI RMF, published by the Cyber Risk Institute with input from 108 financial institutions, codifies this further: grounding requirements mandate that LLM outputs connect to verified, current source material — actual product documentation, current rate sheets, verified regulatory text — not the model's training memory.
2. Audit Trail Deficits
The CFPB's adverse action notice requirements under ECOA and the Fair Credit Reporting Act require institutions to explain the specific reasons for credit decisions. An LLM assisting in underwriting — even in a co-pilot capacity — creates an attribution problem. Supervisors examining a denial decision need to trace the logic. When that logic passed through a language model, the audit trail breaks unless the institution has deliberately engineered it back in.
Best-practice LLM deployment logs every query, every response, every confidence indicator, and every escalation decision with a timestamp and unique identifier. This infrastructure is not default behavior in most commercial LLM platforms — it must be deliberately built and maintained. Institutions that have not yet instrumented their LLM deployments at this level are operating a compliance exposure that their existing model risk teams may not have visibility into.
3. Regulatory Classification Uncertainty
The April 2026 revised interagency guidance applies to banking organizations with over $30 billion in total assets, though it is explicitly designed to scale based on model risk exposure rather than asset size. For community banks and mid-size institutions, the OCC issued a separate clarification in late 2025 confirming that SR 11-7 principles still apply to their AI deployments even where the revised guidance does not formally bind them.
The question of whether a specific LLM deployment constitutes a "model" under the guidance's definition — one that maps inputs to quantitative outputs used to inform business decisions — is not always straightforward for generative applications. An LLM summarizing regulatory filings for a compliance analyst may not meet the traditional model definition. The same LLM generating AML risk narratives that a compliance officer approves without independent review almost certainly does. Institutions need a documented classification process for each LLM use case, not a blanket determination.
A Governance Framework for LLM Deployments
While formal LLM-specific guidance from banking regulators is pending, institutions can apply a structured governance approach using the FS AI RMF and existing SR 11-7 principles. The following framework covers the key control points.
Use Case Classification and Risk Tiering
Start by cataloguing every LLM deployment — production, pilot, and embedded in third-party vendor tools — against a consistent risk tier framework. High-risk use cases include any application where LLM output directly informs a customer-facing decision, a regulatory filing, or a Suspicious Activity Report. Medium-risk covers internal analytical tools with human review. Low-risk includes employee-facing productivity tools with no customer impact. Each tier carries different validation, monitoring, and documentation requirements.
This mirrors the approach recommended in the FS AI RMF, which calls for institutions to evaluate vendor AI directly, including model performance, bias characteristics, hallucination rates, and security posture against AI-specific attacks such as prompt injection. Third-party LLM tools are not exempt from this scrutiny — fourth-party AI risk is a real and growing supervisory concern. For more on the FS AI RMF's implications, see our FS AI RMF 90-Day Action Plan for Bank Technology Leaders.
Hallucination Detection and Output Validation
Every high-risk LLM deployment should implement a multi-stage output validation pipeline. This includes retrieval-augmented generation (RAG) architecture that grounds model responses in verified internal documents rather than relying on training data alone; automated fact-checking layers that cross-reference generated claims against authoritative sources; confidence scoring that flags low-certainty outputs for human review before they reach a decision point; and systematic red-teaming to probe the model for domain-specific failure modes including regulatory misquotation and financial fabrication.
The EY organization's published guidance on managing hallucination risk in LLM deployments describes embedding auditable data records in the model workflow from the outset as the most effective long-term control — not post-hoc monitoring, but architecture that makes every output traceable to its source material.
Audit Trail Engineering
Build logging infrastructure before deploying any LLM to a production workflow. Required fields include: session ID, user ID, timestamp, full prompt text, full response text, model version, retrieval sources cited, confidence score if available, human review decision if applicable, and final action taken. These logs must be retained on a schedule consistent with the institution's record-management policy and be accessible to internal audit, model risk management, and examiners.
Integrate LLM monitoring into existing model risk dashboards so that hallucination rate trends, escalation volumes, and user override patterns are visible to the model risk committee. Anomalies in these metrics are potential leading indicators of model drift or emerging misuse patterns.
Human-in-the-Loop Requirements
Certain decision categories must retain a human checkpoint regardless of LLM performance metrics. Final credit approval, fraud adjudication, SAR filing decisions, customer adverse action notices, and regulatory submission language all fall into this category. The LLM may draft, analyze, or summarize, but the accountable decision-maker must be a qualified human who can independently verify the output and document their review.
This is not merely a best-practice recommendation — it is increasingly the expectation embedded in the FS AI RMF's accountability provisions and consistent with the consumer protection framework under ECOA, Regulation B, and the Fair Housing Act as applied to AI-assisted lending. For context on how agentic AI governance is evolving in parallel, see our analysis of Agentic AI in Banking and the SR 11-7 Framework.
What the Regulatory Gap Means for Your Institution Right Now
The OCC's decision to exclude generative AI from the April 2026 revised model risk guidance — while simultaneously signaling an RFI is coming — creates a defined window of regulatory uncertainty. Institutions that establish rigorous LLM governance now will have documented evidence of a sound internal control environment when examiners eventually arrive with specific LLM questions. Institutions that wait for formal guidance to act will face the harder task of retrofitting governance onto deployed systems under scrutiny.
Examiners from the Fed, OCC, and FDIC are already asking about generative AI in technology examinations. The absence of formal LLM guidance does not mean the absence of examination risk. SR 11-7 still applies to any LLM that meets the model definition, and broader safety-and-soundness standards apply to all operational risk, including technology risk, regardless of model classification.
The institutions that own this space early — that have a written LLM governance policy, a use-case inventory, documented validation results, and functioning audit trails — will shape their own examination narratives rather than responding to examiner findings. That advantage is available right now, before the regulatory framework fully solidifies.
Key Takeaways
- The April 2026 revised interagency model risk guidance excludes generative AI and LLMs — a separate RFI is forthcoming, but banks operating LLMs today must govern them under existing SR 11-7 principles and safety-and-soundness standards.
- Hallucination is a compliance event, not just a quality problem. Any LLM output that informs a regulatory filing, customer decision, or adverse action must have a verifiable audit trail connecting it to authoritative source material. RAG architecture and output validation pipelines are the primary technical controls.
- Every LLM deployment needs a documented use-case classification — determining whether it constitutes a model under SR 11-7, what risk tier it falls into, and what validation and monitoring obligations follow. Blanket determinations do not satisfy examiner scrutiny.
- Audit trail infrastructure must be built before deployment, not added after. Logging at the session, prompt, and response level — with human review decisions — is the minimum required to support both internal audit and regulatory examination.
- High-stakes decisions must retain human checkpoints. Credit approval, fraud adjudication, SAR filings, and adverse action notices require an accountable human reviewer who can independently verify LLM-generated outputs regardless of model performance metrics.
- Institutions that build LLM governance now, during the regulatory gap, will enter the examination cycle with documented evidence of a sound control environment rather than scrambling to retrofit governance under scrutiny.