Building Nemesis AI Agents: A Tale of Many (Inter)faces
- stelladelpilar
- 5 days ago
- 6 min read
When we set out to integrate AI capabilities into Nemesis, our Breach and Attack Simulation platform, we faced a common challenge: how do you build a flexible AI agent that supports multiple LLM backends (including OpenAI, Claude/Anthropic, other future models) while exposing them through different user interfaces tailored to distinct workflows?
The answer lay in creating a clean separation of concerns with two layers of abstraction: a backend interface that normalizes different AI providers, and dual UI interfaces that serve different user needs.
The Architecture: One Brain, Two Faces
At the core sits NemesisAssistant, our AI agent wrapper that abstracts away the complexity of different LLM backends (OpenAI/AutoGen and Claude Agent SDK). Think of it as a universal translator—it speaks the same language to our applications regardless of which AI model powers it underneath.
This assistant connects to our security platform through the Model Context Protocol (MCP), giving it access to:
Real-time assessment data and results
MITRE ATT&CK technique coverage analysis
Threat intelligence and detection gaps
Compliance reporting capabilities
The key insight? Security teams don't want to write queries or navigate dashboards when under pressure. They want to ask: "Which credential theft techniques are we vulnerable to?" or "Show me prevention gaps from last week's ransomware simulation."
Real-Time Chatbot Interface
Our chatbot interface handles high-concurrency, ephemeral connections—perfect for embedded chat widgets or mobile apps where users expect instant responses. This interface is embedded directly into the Nemesis platform, allowing analysts to access AI assistance without leaving their current workflow. They can ask questions about the assessment they're viewing, request analysis of results on screen, or get guidance on configurations—all within the same browser context.
WebSocket architecture for real-time interaction:
We expose a single WebSocket endpoint that provides bidirectional streaming, enabling both real-time response delivery and cancellation of long-running tasks.
The bidirectional nature also enables a keep-alive mechanism through periodic ping/pong messages. This is essential in enterprise environments where load balancers, firewalls, and proxy servers often terminate idle connections after 30-60 seconds. Without keep-alive, an analyst asking a complex question that takes 2 minutes to fully answer would see their connection silently dropped mid-response.
Resource management:
We support streaming cancellation that gives the ability to immediately terminate the current task and clean up resources.
The interface also implements idle timeouts and maximum connection duration limits to prevent resource leaks from forgotten browser tabs—critical when you're paying per-token for LLM calls.
Advanced AI Operator Interface
For deeper investigative work, our advanced operator interface provides a persistent plus shareable conversation experience backed by PostgreSQL. We built it on using Chainlit, a framework designed for AI chat applications. We had to integrate it deeply with our platform's authentication and authorization system.
The authentication challenge was significant. Chainlit has its own authentication mechanisms when running as a stand-alone application, but we needed it to respect our platform's existing JWT-based authentication and multi-tenant access controls. The solution was implementing Chainlit's authentication callback to intercept the WebSocket handshake, extract our JWT tokens from cookies, validate them against our internal authentication service, and only then create a Chainlit session.
Token validation happens in two steps: first, we extract the JWT from our platform’s auth cookie, then call our validation endpoint. Only after successful validation do we create a chainlit.User object that Chainlit recognizes.
JWT tokens expire, so we implemented proactive token refresh by calling our platform's refresh endpoint before the token expires. This ensures analysts never experience session interruptions during critical investigations.
Message history restoration was another challenge. When analysts share conversation URLs with colleagues, they expect context to carry over. We serialize the full conversation history into a single context-priming message:
transcript = "=== Previous Conversation Context ===\n"
for msg in chat_history:
transcript += f"[{msg['role'].upper()}]: {msg['content']}\n"
This approach works for different backends like Claude and OpenAI without requiring backend-specific state management, though it does increase token consumption proportional to conversation length.
RAG Knowledge Base: Starting Simple, Planning for Scale
To ground the AI agent in Nemesis-specific knowledge, we implemented a Retrieval-Augmented Generation (RAG) system as a custom agent tool. The agent can query platform documentation, best practices, and troubleshooting guides using semantic similarity search.
As the next evolutionary step we're designing a comprehensive knowledge management system with:
Dynamic ingestion pipelines — Automatically generate embeddings from updated documentation and assessment results
Intra-tenant data isolation — Nemesis environments are standalone and even within them you can compartmentalize into isolated organizations. Each organization's custom scenarios, internal runbooks, and threat intelligence are stored separately with proper access controls
Public and private knowledge bases — Shared security best practices remain globally accessible; organization-specific data (vulnerable assets, remediation procedures, incident postmortems) stay private
The beauty of our tool-based architecture is that this upgrade will be transparent to the agent. The query_knowledge_base tool interface remains the same; the backend just gets smarter.
Configuration-Driven Architecture
Everything we've described is fully configurable via YAML. You can swap LLM backends, enable/disable MCP servers, configure built-in tools, adjust system prompts, and control authentication flows without touching code.
Example configuration structure:
agent:
type: claudeagentsdk
claudeagentsdk:
model: claude-sonnet-4-5
max_turns: 1000
disallowed_tools:
- Bash # Block shell execution
- Write # Block file modifications
- Delete # Block file deletion
builtin_tools:
web_search:
enabled: true
read:
enabled: true
system_message: |
You are Nemesis, an expert AI security consultant...
[extensive prompt engineering here]
mcp_servers:
enabled: true
servers:
- name: nemesis_observer
enabled: true
transport: http
url: ${NEMESIS_OBSERVER_MCP_URL}
tools:
- "mcp__nemesis_observer__list_assessments"
- "mcp__nemesis_observer__fetch_atomic_runs"
- name: ti_threat_intel
enabled: true
transport: http
url: ${TI_MCP_URL}
tools:
- "mcp__ti__get_report"
- "mcp__ti__search_actors"
knowledge_base:
...
tools:
query_knowledge_base:
enabled: true
security_tools:
threat_analysis:
enabled: true
vulnerability_assessment:
enabled: true
This configuration-first approach means:
Backend flexibility — Switch LLM backends by changing agent.type
Tool governance — Security teams control which operations the AI can perform
Environment-specific configs — Development uses mock MCP servers; production connects to live platform APIs
Feature toggles — Enable/disable capabilities (web search, code execution, threat intelligence) per deployment
The Backend Abstraction Layer
Both UI interfaces rely on a common AgentBackend abstraction that handles:
Streaming responses — Yielding tokens as they arrive for responsive UX
History management — Maintaining conversation context across turns
Resource cleanup
Here's the contract both backends implement:
class AgentBackend(ABC):
@abstractmethod
async def stream_response(
self, message: str, cancellation_token: Optional[CancellationToken] = None
) -> AsyncGenerator[str, None]:
"""Stream response tokens or control markers like [[TOOL:name]]"""
@abstractmethod
async def restore_conversation_history(self) -> None:
"""Restore prior conversation context"""
@abstractmethod
async def shutdown(self) -> None:
"""
Shutdown the backend and cleanup resources.
This includes:
- Closing connections to LLM APIs
- Terminating MCP server processes (if using stdio transport)
- Releasing memory and file handles
- Cancelling pending async tasks
"""
The real power emerges when the agent calls MCP tools. For example, when an analyst asks about failed prevention tests:
# Agent internally calls via MCP:
mcp__nemesis_observer__get_assessment_atomics_by_run_id(
run_id=latest_run,
filter_status="PREVENTED_FALSE"
)
The UI streams back:
[[TOOL:get_assessment_atomics_by_run_id]] # Tool usage indicator
Then the actual analysis results, formatted naturally by the LLM.
Why This Matters for Security
Traditional BAS platforms force security teams to learn proprietary query languages or click through dozens of filters to answer simple questions. Our AI agent approach inverts this:
Natural language queries replace complex filters
Contextual follow-ups enable investigative workflows ("Now show me mitigation steps")
Threat intelligence integration via web search tools provides external context
Automated report generation from raw assessment data
Most importantly, the dual-interface architecture means:
SOC analysts get instant answers via the chatbot interface during active incidents
Security engineers can deep-dive with persistent operator sessions, sharing findings with stakeholders via URLs
Executives receive AI-generated summaries without touching the underlying complexity
What We Learned
Token refresh must be proactive — Waiting until expiration causes conversation interruptions
Backend abstraction pays off — Switching from OpenAI to Claude takes one config change, no interface rewrites - we’re ready for the future and even custom LLM hosting
Configuration is king — YAML-driven architecture lets security teams control AI behavior without developer involvement
Knowledge is evolving, not static — Start with simple embedding stores, but design for dynamic knowledge ingestion from day one