Glean and the Enterprise AI Search Attack Class: When Indexing Everything Creates New Risk

Glean has become the dominant enterprise AI search platform by doing something genuinely difficult: making the entirety of a company’s knowledge — Confluence, Google Drive, Slack, Salesforce, GitHub, and 96 other connectors — instantly searchable and synthesizable by anyone with an account. At $200M ARR and a $7.2B valuation as of June 2025, with confirmed deployments at Booking.com, Comcast, eBay, LinkedIn, Samsung, Databricks, Canva, T-Mobile, and Intuit, Glean isn’t a niche enterprise tool. It’s infrastructure.

That scale makes it worth examining carefully. Not because Glean has been compromised — no Glean-specific CVEs exist in public databases as of this writing — but because the structural risks of what Glean does don’t disappear with a clean audit report. This piece uses Glean as a lens to examine the enterprise AI search attack class: what it is, where the real risk surfaces are, and what defenders actually need to do about it.

What Glean Gets Right

Start with credit where it’s due. Glean’s security posture is strong relative to the market. The company holds SOC 2 Type II, ISO 27001, HIPAA compliance, and notably ISO 42001 — the first certifiable AI management standard. They operate a single-tenant architecture, meaning your indexed data doesn’t commingle with another customer’s. Their permission system mirrors source ACLs, so if a document is restricted in Confluence, Glean respects that restriction. They co-developed the AWARE framework for AI risk assessment alongside Databricks and Palo Alto Networks.

Their benchmark claims 97.8% prompt injection detection. That’s a serious engineering investment, and it’s better than most enterprises are doing on their own.

So why is this still a case study in structural risk? Because the attack surface isn’t primarily about whether Glean has a vulnerability. It’s about what becomes possible when you build a system that indexes everything and synthesizes across sources.

The Structural Risks That Certifications Don’t Solve

The permission inheritance gap. Glean mirrors your source permissions — it doesn’t fix them. Enterprise data is chronically over-permissioned. Studies consistently find that most employees have access to far more than their role requires, because access control hygiene degrades over time. Permissions accumulate through project assignments, team moves, and inherited group memberships that nobody cleans up.

The LLM synthesis layer changes the calculus here in a fundamental way. Previously, over-permissioned data was partially protected by friction — a user might technically have access to a document but would never navigate directly to it. Glean eliminates that friction by design. The moment someone asks a natural language question, the system retrieves and synthesizes all the documents they’re authorized to see. The friction that previously obscured sensitive content is now the feature Glean is specifically removing.

The aggregation problem. Individual authorized access doesn’t mean authorized access to the synthesis of all authorized documents. A researcher at Knostic demonstrated this precisely: “An intern technically has access to specific internal docs” — but asking Glean a direct question surfaces a detailed product roadmap that nobody intended to be accessible in that form.

The salary bands example is instructive. An employee may have legitimate access to an HR policy document mentioning that bands exist, a finance spreadsheet containing compensation totals, and a department planning deck referencing headcount. No single document reveals salary bands by role. A natural language query synthesizing all three might. The LLM didn’t break any permissions. The synthesis produced an answer nobody was meant to receive.

Indirect prompt injection via indexed content. Any document in your indexed corpus is a potential attack vector. A Confluence page, a Google Doc, a Slack message, a Jira ticket — any of these could carry hidden instructions that execute when Glean retrieves and processes them.

Glean’s own benchmark detects 90% of indirect injection attempts — which means roughly 1 in 10 gets through on their own testing. EchoLeak (CVE-2025-32711, CVSS 9.3, June 2025) demonstrated that this attack class works at production scale: a zero-click vulnerability in Microsoft 365 Copilot allowed exfiltration of email content through maliciously crafted messages. Microsoft 365 Copilot and Glean share the same fundamental architecture — documents retrieved, synthesized, and acted upon by an LLM. The attack class transfers.

RAG poisoning. USENIX Security 2025 published the PoisonedRAG research showing that injecting just five malicious documents into a retrieval corpus achieves 90% attack success against RAG-based systems. In an enterprise context, that means anyone with write access to any indexed source — any document editor, any wiki contributor — is a potential threat vector. The attack doesn’t require elevated privileges. It requires a Confluence account.

Shadow data discovery. Glean makes it trivial to find forgotten sensitive files. This is working as designed. It becomes dangerous when the data estate hasn’t been properly audited — and most enterprise data estates haven’t. Glean’s own documentation surfaces a Google Drive edge case where documents shared with “Anyone with link” become discoverable under four specific conditions that users typically don’t anticipate. That’s a single example from a single connector. Multiply by 100.

Agentic escalation. Glean is actively deploying autonomous agents with write actions. This changes the risk profile substantially. In a read-only search context, a successful prompt injection produces information disclosure. In an agentic context with write permissions, the same injection can produce action execution — sending emails, modifying documents, triggering workflows. The blast radius scales with the capabilities granted to the agent.

100+ Connectors, 100+ Pivot Points

Glean’s connector ecosystem is a genuine enterprise strength. It’s also a credential attack surface. Each OAuth connection is an authorization grant that, if stolen, provides access to that source system. In August 2025, threat actor UNC6395 compromised OAuth tokens connected to Salesloft and Drift integrations, ultimately affecting 700+ organizations in a single campaign. The attack didn’t require breaking encryption or bypassing MFA — it required stealing tokens that granted existing access.

Glean’s 100+ connectors represent 100+ potential pivot points with the same attack structure. This isn’t a Glean-specific vulnerability. It’s a property of any platform that aggregates OAuth connections at enterprise scale. But the concentration matters: a single Glean credential compromise doesn’t just expose Glean. It potentially exposes every downstream system the connector touches.

What Actually Helps

Fix permissions at the source before indexing, not after. Glean can only mirror what exists. The inheritance gap is a data hygiene problem masquerading as a search problem. Least-privilege audits upstream of the connector are not optional.

Apply least privilege to connectors themselves. Not every connector needs read access to every folder. Scope OAuth grants to what the use case actually requires.

Treat indexed documents as untrusted input. Content in your corpus can carry adversarial instructions. This is a posture shift, not a configuration change — and it should inform how you govern who can write to indexed sources.

Monitor retrieval patterns, not just access logs. Unusual query patterns — broad topic sweeps, repeated synthesis of sensitive domains — may signal reconnaissance or exploitation. Traditional access logs won’t surface this. Behavioral baselines on retrieval will.

Don’t embed secrets in system prompts. Credentials, API keys, and sensitive instructions in system prompts are recoverable through prompt extraction techniques. They don’t belong there.

Red team your own deployment. Ask your Glean instance questions a determined insider or external attacker would ask. You will learn things about your data estate.

The Honest Conclusion

Glean’s security posture is strong relative to the market. That’s not the point.

The point is that a platform which indexes everything and synthesizes across sources creates a risk model that certifications don’t fully address and traditional access controls weren’t designed for. The attack surface lives in the emergent behavior — in what becomes visible when fragmented, individually-authorized data gets synthesized at scale; in what becomes executable when agents gain write actions; in what becomes weaponizable when anyone with a Confluence account can influence what the LLM retrieves.

There are currently 402,599 enterprise AI agent hosts publicly reachable with no authentication, according to Capsule Security’s May 2026 report. Glean, properly deployed, is not in that category. But the attack class it belongs to — enterprise AI search and synthesis — is defining the next generation of data exposure risk. The question for every enterprise deploying these systems isn’t whether the vendor passed its audit. It’s whether the security model has caught up to the architecture.

Similar Posts