Architecture Overview¶

Comprehensive system architecture for the AI-Augmented Security Operations Center (AI-SOC) platform.

Executive Summary¶

The AI-SOC platform implements a microservices-based architecture designed for scalability, resilience, and operational intelligence. The system integrates traditional SIEM capabilities with cutting-edge machine learning and large language models to provide autonomous threat detection, analysis, and response capabilities.

Core Design Principles: - Microservices Architecture: Independent, loosely-coupled services enable fault isolation and horizontal scaling - Defense in Depth: Multi-layered security with network segmentation and zero-trust principles - API-First Design: RESTful interfaces enable integration and extensibility - Observable by Default: Comprehensive metrics, logs, and traces for operational visibility - Infrastructure as Code: Complete configuration management via Docker Compose

System Architecture¶

High-Level Architecture¶

┌────────────────────────────────────────────────────────────────────┐
│                        External Data Sources                        │
│  Network Traffic, System Logs, Security Events, Threat Intelligence│
└───────────────────────────────┬────────────────────────────────────┘
                                │
        ┌───────────────────────┴────────────────────────┐
        │                                                  │
        ▼                                                  ▼
┌──────────────────────┐                    ┌─────────────────────────┐
│  Network Analysis    │                    │   External Log Sources  │
│  ─────────────────   │                    │   ──────────────────    │
│  • Suricata IDS/IPS  │                    │   • System Logs         │
│  • Zeek Monitor      │                    │   • Application Logs    │
│  • Packet Capture    │                    │   • Cloud Security Logs │
└──────────┬───────────┘                    └────────────┬────────────┘
           │                                             │
           └─────────────────┬───────────────────────────┘
                             │
                             ▼
            ┌────────────────────────────────┐
            │      SIEM Core (Phase 1)       │
            │  ─────────────────────────     │
            │  • Wazuh Manager (Ingestion)   │
            │  • Wazuh Indexer (Storage)     │
            │  • Wazuh Dashboard (UI)        │
            └───────────┬────────────────────┘
                        │
        ┌───────────────┼────────────────┐
        │               │                 │
        ▼               ▼                 ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│  AI Services │ │ SOAR Stack   │ │   Monitoring     │
│  ─────────── │ │ ──────────── │ │   ──────────     │
│  • ML Models │ │ • TheHive    │ │   • Prometheus   │
│  • LLM Agent │ │ • Cortex     │ │   • Grafana      │
│  • RAG/CTI   │ │ • Shuffle    │ │   • AlertManager │
└──────────────┘ └──────────────┘ └──────────────────┘
        │               │                 │
        └───────────────┴─────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │   Orchestration & Response    │
        │   ───────────────────────     │
        │   • Automated Playbooks       │
        │   • Case Management           │
        │   • Incident Response         │
        └───────────────────────────────┘

Architectural Layers¶

Layer 1: Data Ingestion¶

Purpose: Collect and normalize security telemetry from diverse sources.

Components: - Suricata IDS/IPS - Network-based intrusion detection using signature and anomaly detection - Zeek Network Monitor - Passive network traffic analysis and metadata extraction - Filebeat - Log shipping agent for centralized log collection - Wazuh Agents - Host-based security monitoring and file integrity

Design Rationale: - Multi-source ingestion provides comprehensive visibility across network and host layers - Standard log formats (JSON, CEF, Syslog) enable interoperability - Buffering and retry mechanisms ensure reliable data delivery

Performance Characteristics: - Throughput: 10,000+ events/second sustained - Latency: <100ms from event generation to indexing - Reliability: 99.9% delivery guarantee with persistent queues

Layer 2: SIEM Core¶

Purpose: Centralized log aggregation, correlation, and persistent storage.

Components: - Wazuh Manager - Event processing, correlation engine, API gateway - Wazuh Indexer - OpenSearch-based distributed search and analytics engine - Wazuh Dashboard - Web-based visualization and investigation interface

Technology Stack: - OpenSearch 2.x (distributed search engine) - Wazuh 4.8.2 (security information management) - Kibana fork (visualization framework)

Design Rationale: - OpenSearch provides horizontal scalability for petabyte-scale log storage - Wazuh's rule-based correlation enables real-time threat detection - RESTful API enables programmatic access for automation

Data Flow:

Event → Wazuh Manager → Rule Engine → Correlation → Indexer → Storage
                ↓
          Alert Generation → Webhook → SOAR

Performance Characteristics: - Indexing Rate: 50,000 events/second (3-node cluster) - Query Latency: <500ms for 90th percentile - Retention: 30 days hot storage, 365 days warm/cold tiers - Storage Efficiency: 10:1 compression ratio

Layer 3: AI Services¶

Purpose: Autonomous threat detection, classification, and contextual analysis using machine learning and large language models.

Architecture:

┌──────────────────────────────────────────────────────┐
│              AI Services Layer                        │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌───────────────┐      ┌──────────────────┐        │
│  │ ML Inference  │◄────►│  Alert Triage    │        │
│  │    Engine     │      │    Service       │        │
│  ├───────────────┤      ├──────────────────┤        │
│  │ Random Forest │      │ LLaMA 3.1:8b     │        │
│  │ XGBoost       │      │ Risk Scoring     │        │
│  │ Decision Tree │      │ Prioritization   │        │
│  └───────────────┘      └─────────┬────────┘        │
│                                    │                  │
│                         ┌──────────▼────────┐        │
│                         │  RAG Service      │        │
│                         ├───────────────────┤        │
│                         │ MITRE ATT&CK DB   │        │
│                         │ Threat Intel      │        │
│                         │ ChromaDB Vector   │        │
│                         └───────────────────┘        │
└──────────────────────────────────────────────────────┘

Components:

1. ML Inference Engine - Models: Random Forest (primary), XGBoost (low-FP), Decision Tree (interpretable) - Performance: 99.28% accuracy, 0.8ms inference latency - API: FastAPI with automatic OpenAPI documentation - Deployment: Docker containerized with health checks

2. Alert Triage Service - LLM: LLaMA 3.1:8b via Ollama runtime - Function: Natural language analysis of security alerts - Capabilities: - Risk scoring (0-100 scale) - Attack classification - Recommended response actions - Executive summaries

3. RAG Service - Knowledge Base: 823 MITRE ATT&CK techniques - Vector Database: ChromaDB for semantic search - Retrieval: Top-k context retrieval for LLM augmentation - Latency: <50ms for 5 nearest neighbors

Design Rationale: - Ensemble Approach: Multiple ML models provide redundancy and complementary strengths - Hybrid Intelligence: Traditional ML (fast, deterministic) + LLM (contextual, adaptive) - Offline-First: Models deployed locally, no external API dependencies - Explainability: Decision tree model provides full transparency for compliance

Data Flow:

Alert → ML Classification → Prediction (BENIGN/ATTACK)
                          ↓
                    Alert Triage
                          ↓
              ┌───────────┴──────────┐
              ▼                       ▼
        RAG Retrieval           LLM Analysis
    (MITRE Techniques)       (Natural Language)
              │                       │
              └───────────┬───────────┘
                          ▼
              Enriched Alert (Risk Score,
               Classification, Context)
                          ▼
                      TheHive

Layer 4: SOAR Stack¶

Purpose: Security orchestration, automation, and response.

Components: - TheHive - Collaborative case management platform - Cortex - Observable analysis engine with 100+ analyzers - Shuffle - Workflow automation and playbook execution

Integration Points: - Wazuh → TheHive (webhook-based alert ingestion) - TheHive → Cortex (automated IOC enrichment) - TheHive → Shuffle (workflow triggers) - Shuffle → Response Actions (firewall rules, EDR isolation, notifications)

Design Rationale: - TheHive provides centralized case management for multi-analyst collaboration - Cortex automates repetitive analysis tasks (IP reputation, file hashing, threat intel) - Shuffle enables no-code playbook development for rapid response

Workflow Example:

Wazuh Alert → TheHive Case
                    ↓
          Cortex Analysis (IP reputation, geolocation)
                    ↓
         Shuffle Playbook Execution
                    ↓
         ┌──────────┴──────────┐
         ▼                      ▼
   Block IP (Firewall)    Notify SOC Team

Layer 5: Monitoring & Observability¶

Purpose: Real-time health monitoring, performance metrics, and alerting.

Components: - Prometheus - Time-series metrics database - Grafana - Visualization and dashboards - AlertManager - Alert routing and deduplication - Loki - Log aggregation for troubleshooting - cAdvisor + Node Exporter - Container and host metrics

Metrics Collection: - 13 scrape targets across all services - 15-second scrape interval - 30-day retention for high-resolution data

Dashboards: - SIEM Stack Health (Wazuh Manager, Indexer, Dashboard) - ML Model Performance (inference latency, prediction distribution) - AI Services Metrics (LLM response times, RAG retrieval accuracy) - Infrastructure Resources (CPU, RAM, disk, network)

Alerting Rules: - Service down detection (<30 seconds) - Resource exhaustion (CPU >80%, RAM >90%) - ML model drift detection - Abnormal false positive rates

Design Rationale: - Prometheus provides industry-standard metrics format (compatible with all major tools) - Grafana enables custom dashboards for different stakeholder personas (SOC analyst, engineer, executive) - AlertManager prevents alert fatigue through intelligent grouping and inhibition

Network Architecture¶

Network Segmentation¶

Isolation Strategy: Backend/Frontend network separation per stack.

Network	Subnet	Purpose	Security Posture
siem-backend	172.20.0.0/24	SIEM internal comms	No external exposure
siem-frontend	172.21.0.0/24	SIEM web UI	HTTPS only
soar-backend	172.26.0.0/24	SOAR databases	No external exposure
soar-frontend	172.27.0.0/24	SOAR web UIs	HTTP (reverse proxy recommended)
monitoring	172.28.0.0/24	Observability stack	Internal only
ai-network	172.30.0.0/24	AI/ML services	API gateway protected

Benefits: - Compromised web UI cannot directly access backend databases - Lateral movement requires crossing network boundaries - Simplified firewall rule management - Clear trust boundaries for security policies

Port Allocation¶

Externally Accessible: - 443 (Wazuh Dashboard - HTTPS) - 3000 (Grafana) - 9010 (TheHive) - 9011 (Cortex) - 3001 (Shuffle) - 8500 (ML Inference API) - 8100 (Alert Triage API) - 8300 (RAG Service API)

Internal Only: - 9200 (Wazuh Indexer - OpenSearch) - 55000 (Wazuh Manager API) - 9042 (Cassandra) - 8200 (ChromaDB) - 11434 (Ollama LLM)

See Network Topology for complete port mapping.

Technology Stack¶

Backend Services¶

Component	Technology	Version	Justification
SIEM	Wazuh	4.8.2	Open-source, MITRE ATT&CK mapping, active community
Search Engine	OpenSearch	2.x	Elasticsearch fork, scalable, no licensing restrictions
Case Management	TheHive	5.2.9	Purpose-built for SOC workflows, Cortex integration
Orchestration	Shuffle	1.4.0	Open-source SOAR, drag-drop workflows
Database	Cassandra	4.1.3	Distributed, fault-tolerant, scales horizontally
Vector DB	ChromaDB	Latest	AI-native, embedding support, simple API
Object Storage	MinIO	Latest	S3-compatible, self-hosted

AI/ML Stack¶

Component	Technology	Version	Justification
ML Framework	scikit-learn	1.3+	Industry standard, battle-tested algorithms
LLM Runtime	Ollama	Latest	Local inference, model management, OpenAI-compatible API
LLM Model	LLaMA 3.1	8B params	State-of-the-art open-source, optimal size/performance
API Framework	FastAPI	0.100+	Async support, automatic docs, type safety
Vector Embeddings	sentence-transformers	Latest	Pre-trained models, semantic similarity

Infrastructure¶

Component	Technology	Version	Justification
Container Runtime	Docker	24.0+	Industry standard, mature ecosystem
Orchestration	Docker Compose	V2	Simplified multi-container management
Monitoring	Prometheus	2.48+	De facto standard, extensive integrations
Visualization	Grafana	10.2+	Powerful dashboards, alerting, multi-datasource
Log Aggregation	Loki	2.9+	Prometheus-style log queries, low storage overhead

Scalability Considerations¶

Horizontal Scaling¶

SIEM Stack: - Wazuh Manager: Multi-node cluster with load balancing - Wazuh Indexer: OpenSearch cluster (3+ nodes for HA) - Capacity: 100,000+ events/second with 5-node indexer cluster

AI Services: - ML Inference: Stateless, add replicas behind load balancer - Alert Triage: Horizontal scaling limited by Ollama GPU availability - RAG Service: Stateless, ChromaDB supports distributed deployment

SOAR Stack: - TheHive: Multi-master cluster with Cassandra ring - Shuffle: Worker scaling for parallel workflow execution

Vertical Scaling¶

Resource Limits (per service): - Wazuh Indexer: 16GB RAM (configurable JVM heap) - ML Inference: 1GB RAM, 1 CPU (sufficient for 1,000 req/sec) - Ollama LLM: 8GB RAM minimum (16GB for larger models) - ChromaDB: 4GB RAM for 100K vectors

Performance Targets¶

Metric	Small Deployment	Medium	Large
Event Throughput	1,000/sec	10,000/sec	100,000/sec
Concurrent Analysts	5	25	100
Data Retention	30 days	90 days	365 days
Query Response (p95)	<1s	<500ms	<200ms
ML Inference Latency	<5ms	<2ms	<1ms

High Availability Design¶

Service Redundancy¶

Critical Services (require 99.9% uptime): - Wazuh Manager: 2+ nodes with failover - Wazuh Indexer: 3+ nodes (quorum-based) - Cassandra: 3+ nodes (RF=3)

Non-Critical Services (tolerate brief downtime): - Grafana: Single instance acceptable (read-only impact) - Shuffle: Workflow queue prevents data loss

Data Persistence¶

Volumes: - All stateful services use named Docker volumes - Volume backup strategy: daily snapshots - Retention: 30 days for volume backups

Backup Procedures:

# Wazuh Indexer snapshot
docker exec wazuh-indexer curl -X PUT "localhost:9200/_snapshot/backup"

# Cassandra backup
docker exec cassandra nodetool snapshot

# ChromaDB export
docker exec chromadb curl "http://localhost:8000/api/v1/export"

Security Architecture¶

Defense in Depth¶

Layer 1: Network Segmentation - Isolated Docker networks per stack - No direct backend exposure to internet - Firewall rules restrict inter-service communication

Layer 2: Authentication & Authorization - API key authentication for service-to-service - OAuth2/SAML for user authentication - Role-based access control (RBAC) in TheHive

Layer 3: Encryption - TLS 1.3 for all external communication - Self-signed certificates (development) - Let's Encrypt integration (production)

Layer 4: Secrets Management - Environment variable injection - Docker secrets for production - HashiCorp Vault integration (future)

Layer 5: Audit Logging - All API calls logged to Wazuh - Immutable audit trail - Retention: 365 days minimum

Threat Model¶

Assumed Threats: - External network attackers - Compromised web application - Insider threats (malicious analyst) - Supply chain attacks (vulnerable dependencies)

Mitigations: - Web Application Firewall (WAF) recommended - Principle of least privilege - Audit logging and anomaly detection - Dependency scanning (Dependabot, Snyk)

See Security Guide for detailed hardening procedures.

Integration Patterns¶

Event-Driven Architecture¶

Webhooks: - Wazuh → TheHive: Alert creation on rule match - TheHive → Shuffle: Case status changes trigger workflows - AlertManager → Shuffle: Infrastructure alerts trigger remediation

Benefits: - Loose coupling between services - Asynchronous processing prevents blocking - Retry mechanisms handle transient failures

API-First Design¶

RESTful APIs: - All services expose standardized REST endpoints - OpenAPI/Swagger documentation auto-generated - Consistent error handling (RFC 7807 Problem Details)

Example API Flow:

POST /triage
  → GET /ml-inference/predict (ML classification)
  → GET /rag-service/retrieve (MITRE context)
  → POST /ollama/api/generate (LLM analysis)
  → Response: Enriched alert

Development & Deployment¶

CI/CD Pipeline (Planned)¶

Code Commit → GitHub Actions
                    ↓
              Unit Tests
                    ↓
              Docker Build
                    ↓
         Integration Tests
                    ↓
      Deploy to Staging
                    ↓
         Smoke Tests
                    ↓
    Production Deployment

Configuration Management¶

Environment Variables: - .env file for local development - Docker Compose env_file directive - Secrets injected at runtime

Infrastructure as Code: - All configurations version-controlled - Declarative Docker Compose specifications - Idempotent deployment scripts

Future Architecture Enhancements¶

Short-term (Weeks 3-4)¶

Multi-class ML classification (24 attack types)
Reverse proxy (Nginx/Traefik) for HTTPS termination
Secrets management (HashiCorp Vault)
Automated backups

Medium-term (Months 2-3)¶

Kubernetes migration for production deployments
Multi-region deployment for disaster recovery
Advanced ML models (deep learning, transformers)
Custom Cortex analyzers

Long-term (Months 4-6)¶

Multi-agent collaboration framework
Automated playbook generation via LLM
Predictive threat modeling
Zero-trust network architecture

Appendices¶

A. Service Dependencies¶

Wazuh Dashboard → Wazuh Manager → Wazuh Indexer
TheHive → Cassandra + MinIO
Cortex → Cassandra + TheHive
Shuffle → OpenSearch
Alert Triage → ML Inference + RAG Service + Ollama
RAG Service → ChromaDB
Grafana → Prometheus + Loki
AlertManager → Prometheus

B. Resource Requirements¶

Minimum (Development/Testing): - CPU: 4 cores (8 threads) - RAM: 16GB - Disk: 50GB SSD - Network: 100Mbps

Recommended (Production): - CPU: 8 cores (16 threads) - RAM: 32GB - Disk: 250GB NVMe SSD - Network: 1Gbps

See System Requirements for detailed specifications.

C. Glossary¶

SIEM: Security Information and Event Management
SOAR: Security Orchestration, Automation, and Response
RAG: Retrieval-Augmented Generation
CTI: Cyber Threat Intelligence
MITRE ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge framework
IOC: Indicator of Compromise
EDR: Endpoint Detection and Response

Architecture Documentation Version: 1.0 Last Updated: October 24, 2025 Maintained By: AI-SOC Architecture Team