Skip to content

Architecture Overview

Comprehensive system architecture for the AI-Augmented Security Operations Center (AI-SOC) platform.


Executive Summary

The AI-SOC platform implements a microservices-based architecture designed for scalability, resilience, and operational intelligence. The system integrates traditional SIEM capabilities with cutting-edge machine learning and large language models to provide autonomous threat detection, analysis, and response capabilities.

Core Design Principles: - Microservices Architecture: Independent, loosely-coupled services enable fault isolation and horizontal scaling - Defense in Depth: Multi-layered security with network segmentation and zero-trust principles - API-First Design: RESTful interfaces enable integration and extensibility - Observable by Default: Comprehensive metrics, logs, and traces for operational visibility - Infrastructure as Code: Complete configuration management via Docker Compose


System Architecture

High-Level Architecture

┌────────────────────────────────────────────────────────────────────┐
│                        External Data Sources                        │
│  Network Traffic, System Logs, Security Events, Threat Intelligence│
└───────────────────────────────┬────────────────────────────────────┘
        ┌───────────────────────┴────────────────────────┐
        │                                                  │
        ▼                                                  ▼
┌──────────────────────┐                    ┌─────────────────────────┐
│  Network Analysis    │                    │   External Log Sources  │
│  ─────────────────   │                    │   ──────────────────    │
│  • Suricata IDS/IPS  │                    │   • System Logs         │
│  • Zeek Monitor      │                    │   • Application Logs    │
│  • Packet Capture    │                    │   • Cloud Security Logs │
└──────────┬───────────┘                    └────────────┬────────────┘
           │                                             │
           └─────────────────┬───────────────────────────┘
            ┌────────────────────────────────┐
            │      SIEM Core (Phase 1)       │
            │  ─────────────────────────     │
            │  • Wazuh Manager (Ingestion)   │
            │  • Wazuh Indexer (Storage)     │
            │  • Wazuh Dashboard (UI)        │
            └───────────┬────────────────────┘
        ┌───────────────┼────────────────┐
        │               │                 │
        ▼               ▼                 ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│  AI Services │ │ SOAR Stack   │ │   Monitoring     │
│  ─────────── │ │ ──────────── │ │   ──────────     │
│  • ML Models │ │ • TheHive    │ │   • Prometheus   │
│  • LLM Agent │ │ • Cortex     │ │   • Grafana      │
│  • RAG/CTI   │ │ • Shuffle    │ │   • AlertManager │
└──────────────┘ └──────────────┘ └──────────────────┘
        │               │                 │
        └───────────────┴─────────────────┘
        ┌───────────────────────────────┐
        │   Orchestration & Response    │
        │   ───────────────────────     │
        │   • Automated Playbooks       │
        │   • Case Management           │
        │   • Incident Response         │
        └───────────────────────────────┘

Architectural Layers

Layer 1: Data Ingestion

Purpose: Collect and normalize security telemetry from diverse sources.

Components: - Suricata IDS/IPS - Network-based intrusion detection using signature and anomaly detection - Zeek Network Monitor - Passive network traffic analysis and metadata extraction - Filebeat - Log shipping agent for centralized log collection - Wazuh Agents - Host-based security monitoring and file integrity

Design Rationale: - Multi-source ingestion provides comprehensive visibility across network and host layers - Standard log formats (JSON, CEF, Syslog) enable interoperability - Buffering and retry mechanisms ensure reliable data delivery

Performance Characteristics: - Throughput: 10,000+ events/second sustained - Latency: <100ms from event generation to indexing - Reliability: 99.9% delivery guarantee with persistent queues


Layer 2: SIEM Core

Purpose: Centralized log aggregation, correlation, and persistent storage.

Components: - Wazuh Manager - Event processing, correlation engine, API gateway - Wazuh Indexer - OpenSearch-based distributed search and analytics engine - Wazuh Dashboard - Web-based visualization and investigation interface

Technology Stack: - OpenSearch 2.x (distributed search engine) - Wazuh 4.8.2 (security information management) - Kibana fork (visualization framework)

Design Rationale: - OpenSearch provides horizontal scalability for petabyte-scale log storage - Wazuh's rule-based correlation enables real-time threat detection - RESTful API enables programmatic access for automation

Data Flow:

Event → Wazuh Manager → Rule Engine → Correlation → Indexer → Storage
          Alert Generation → Webhook → SOAR

Performance Characteristics: - Indexing Rate: 50,000 events/second (3-node cluster) - Query Latency: <500ms for 90th percentile - Retention: 30 days hot storage, 365 days warm/cold tiers - Storage Efficiency: 10:1 compression ratio


Layer 3: AI Services

Purpose: Autonomous threat detection, classification, and contextual analysis using machine learning and large language models.

Architecture:

┌──────────────────────────────────────────────────────┐
│              AI Services Layer                        │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌───────────────┐      ┌──────────────────┐        │
│  │ ML Inference  │◄────►│  Alert Triage    │        │
│  │    Engine     │      │    Service       │        │
│  ├───────────────┤      ├──────────────────┤        │
│  │ Random Forest │      │ LLaMA 3.1:8b     │        │
│  │ XGBoost       │      │ Risk Scoring     │        │
│  │ Decision Tree │      │ Prioritization   │        │
│  └───────────────┘      └─────────┬────────┘        │
│                                    │                  │
│                         ┌──────────▼────────┐        │
│                         │  RAG Service      │        │
│                         ├───────────────────┤        │
│                         │ MITRE ATT&CK DB   │        │
│                         │ Threat Intel      │        │
│                         │ ChromaDB Vector   │        │
│                         └───────────────────┘        │
└──────────────────────────────────────────────────────┘

Components:

1. ML Inference Engine - Models: Random Forest (primary), XGBoost (low-FP), Decision Tree (interpretable) - Performance: 99.28% accuracy, 0.8ms inference latency - API: FastAPI with automatic OpenAPI documentation - Deployment: Docker containerized with health checks

2. Alert Triage Service - LLM: LLaMA 3.1:8b via Ollama runtime - Function: Natural language analysis of security alerts - Capabilities: - Risk scoring (0-100 scale) - Attack classification - Recommended response actions - Executive summaries

3. RAG Service - Knowledge Base: 823 MITRE ATT&CK techniques - Vector Database: ChromaDB for semantic search - Retrieval: Top-k context retrieval for LLM augmentation - Latency: <50ms for 5 nearest neighbors

Design Rationale: - Ensemble Approach: Multiple ML models provide redundancy and complementary strengths - Hybrid Intelligence: Traditional ML (fast, deterministic) + LLM (contextual, adaptive) - Offline-First: Models deployed locally, no external API dependencies - Explainability: Decision tree model provides full transparency for compliance

Data Flow:

Alert → ML Classification → Prediction (BENIGN/ATTACK)
                    Alert Triage
              ┌───────────┴──────────┐
              ▼                       ▼
        RAG Retrieval           LLM Analysis
    (MITRE Techniques)       (Natural Language)
              │                       │
              └───────────┬───────────┘
              Enriched Alert (Risk Score,
               Classification, Context)
                      TheHive


Layer 4: SOAR Stack

Purpose: Security orchestration, automation, and response.

Components: - TheHive - Collaborative case management platform - Cortex - Observable analysis engine with 100+ analyzers - Shuffle - Workflow automation and playbook execution

Integration Points: - Wazuh → TheHive (webhook-based alert ingestion) - TheHive → Cortex (automated IOC enrichment) - TheHive → Shuffle (workflow triggers) - Shuffle → Response Actions (firewall rules, EDR isolation, notifications)

Design Rationale: - TheHive provides centralized case management for multi-analyst collaboration - Cortex automates repetitive analysis tasks (IP reputation, file hashing, threat intel) - Shuffle enables no-code playbook development for rapid response

Workflow Example:

Wazuh Alert → TheHive Case
          Cortex Analysis (IP reputation, geolocation)
         Shuffle Playbook Execution
         ┌──────────┴──────────┐
         ▼                      ▼
   Block IP (Firewall)    Notify SOC Team


Layer 5: Monitoring & Observability

Purpose: Real-time health monitoring, performance metrics, and alerting.

Components: - Prometheus - Time-series metrics database - Grafana - Visualization and dashboards - AlertManager - Alert routing and deduplication - Loki - Log aggregation for troubleshooting - cAdvisor + Node Exporter - Container and host metrics

Metrics Collection: - 13 scrape targets across all services - 15-second scrape interval - 30-day retention for high-resolution data

Dashboards: - SIEM Stack Health (Wazuh Manager, Indexer, Dashboard) - ML Model Performance (inference latency, prediction distribution) - AI Services Metrics (LLM response times, RAG retrieval accuracy) - Infrastructure Resources (CPU, RAM, disk, network)

Alerting Rules: - Service down detection (<30 seconds) - Resource exhaustion (CPU >80%, RAM >90%) - ML model drift detection - Abnormal false positive rates

Design Rationale: - Prometheus provides industry-standard metrics format (compatible with all major tools) - Grafana enables custom dashboards for different stakeholder personas (SOC analyst, engineer, executive) - AlertManager prevents alert fatigue through intelligent grouping and inhibition


Network Architecture

Network Segmentation

Isolation Strategy: Backend/Frontend network separation per stack.

Network Subnet Purpose Security Posture
siem-backend 172.20.0.0/24 SIEM internal comms No external exposure
siem-frontend 172.21.0.0/24 SIEM web UI HTTPS only
soar-backend 172.26.0.0/24 SOAR databases No external exposure
soar-frontend 172.27.0.0/24 SOAR web UIs HTTP (reverse proxy recommended)
monitoring 172.28.0.0/24 Observability stack Internal only
ai-network 172.30.0.0/24 AI/ML services API gateway protected

Benefits: - Compromised web UI cannot directly access backend databases - Lateral movement requires crossing network boundaries - Simplified firewall rule management - Clear trust boundaries for security policies

Port Allocation

Externally Accessible: - 443 (Wazuh Dashboard - HTTPS) - 3000 (Grafana) - 9010 (TheHive) - 9011 (Cortex) - 3001 (Shuffle) - 8500 (ML Inference API) - 8100 (Alert Triage API) - 8300 (RAG Service API)

Internal Only: - 9200 (Wazuh Indexer - OpenSearch) - 55000 (Wazuh Manager API) - 9042 (Cassandra) - 8200 (ChromaDB) - 11434 (Ollama LLM)

See Network Topology for complete port mapping.


Technology Stack

Backend Services

Component Technology Version Justification
SIEM Wazuh 4.8.2 Open-source, MITRE ATT&CK mapping, active community
Search Engine OpenSearch 2.x Elasticsearch fork, scalable, no licensing restrictions
Case Management TheHive 5.2.9 Purpose-built for SOC workflows, Cortex integration
Orchestration Shuffle 1.4.0 Open-source SOAR, drag-drop workflows
Database Cassandra 4.1.3 Distributed, fault-tolerant, scales horizontally
Vector DB ChromaDB Latest AI-native, embedding support, simple API
Object Storage MinIO Latest S3-compatible, self-hosted

AI/ML Stack

Component Technology Version Justification
ML Framework scikit-learn 1.3+ Industry standard, battle-tested algorithms
LLM Runtime Ollama Latest Local inference, model management, OpenAI-compatible API
LLM Model LLaMA 3.1 8B params State-of-the-art open-source, optimal size/performance
API Framework FastAPI 0.100+ Async support, automatic docs, type safety
Vector Embeddings sentence-transformers Latest Pre-trained models, semantic similarity

Infrastructure

Component Technology Version Justification
Container Runtime Docker 24.0+ Industry standard, mature ecosystem
Orchestration Docker Compose V2 Simplified multi-container management
Monitoring Prometheus 2.48+ De facto standard, extensive integrations
Visualization Grafana 10.2+ Powerful dashboards, alerting, multi-datasource
Log Aggregation Loki 2.9+ Prometheus-style log queries, low storage overhead

Scalability Considerations

Horizontal Scaling

SIEM Stack: - Wazuh Manager: Multi-node cluster with load balancing - Wazuh Indexer: OpenSearch cluster (3+ nodes for HA) - Capacity: 100,000+ events/second with 5-node indexer cluster

AI Services: - ML Inference: Stateless, add replicas behind load balancer - Alert Triage: Horizontal scaling limited by Ollama GPU availability - RAG Service: Stateless, ChromaDB supports distributed deployment

SOAR Stack: - TheHive: Multi-master cluster with Cassandra ring - Shuffle: Worker scaling for parallel workflow execution

Vertical Scaling

Resource Limits (per service): - Wazuh Indexer: 16GB RAM (configurable JVM heap) - ML Inference: 1GB RAM, 1 CPU (sufficient for 1,000 req/sec) - Ollama LLM: 8GB RAM minimum (16GB for larger models) - ChromaDB: 4GB RAM for 100K vectors

Performance Targets

Metric Small Deployment Medium Large
Event Throughput 1,000/sec 10,000/sec 100,000/sec
Concurrent Analysts 5 25 100
Data Retention 30 days 90 days 365 days
Query Response (p95) <1s <500ms <200ms
ML Inference Latency <5ms <2ms <1ms

High Availability Design

Service Redundancy

Critical Services (require 99.9% uptime): - Wazuh Manager: 2+ nodes with failover - Wazuh Indexer: 3+ nodes (quorum-based) - Cassandra: 3+ nodes (RF=3)

Non-Critical Services (tolerate brief downtime): - Grafana: Single instance acceptable (read-only impact) - Shuffle: Workflow queue prevents data loss

Data Persistence

Volumes: - All stateful services use named Docker volumes - Volume backup strategy: daily snapshots - Retention: 30 days for volume backups

Backup Procedures:

# Wazuh Indexer snapshot
docker exec wazuh-indexer curl -X PUT "localhost:9200/_snapshot/backup"

# Cassandra backup
docker exec cassandra nodetool snapshot

# ChromaDB export
docker exec chromadb curl "http://localhost:8000/api/v1/export"


Security Architecture

Defense in Depth

Layer 1: Network Segmentation - Isolated Docker networks per stack - No direct backend exposure to internet - Firewall rules restrict inter-service communication

Layer 2: Authentication & Authorization - API key authentication for service-to-service - OAuth2/SAML for user authentication - Role-based access control (RBAC) in TheHive

Layer 3: Encryption - TLS 1.3 for all external communication - Self-signed certificates (development) - Let's Encrypt integration (production)

Layer 4: Secrets Management - Environment variable injection - Docker secrets for production - HashiCorp Vault integration (future)

Layer 5: Audit Logging - All API calls logged to Wazuh - Immutable audit trail - Retention: 365 days minimum

Threat Model

Assumed Threats: - External network attackers - Compromised web application - Insider threats (malicious analyst) - Supply chain attacks (vulnerable dependencies)

Mitigations: - Web Application Firewall (WAF) recommended - Principle of least privilege - Audit logging and anomaly detection - Dependency scanning (Dependabot, Snyk)

See Security Guide for detailed hardening procedures.


Integration Patterns

Event-Driven Architecture

Webhooks: - Wazuh → TheHive: Alert creation on rule match - TheHive → Shuffle: Case status changes trigger workflows - AlertManager → Shuffle: Infrastructure alerts trigger remediation

Benefits: - Loose coupling between services - Asynchronous processing prevents blocking - Retry mechanisms handle transient failures

API-First Design

RESTful APIs: - All services expose standardized REST endpoints - OpenAPI/Swagger documentation auto-generated - Consistent error handling (RFC 7807 Problem Details)

Example API Flow:

POST /triage
  → GET /ml-inference/predict (ML classification)
  → GET /rag-service/retrieve (MITRE context)
  → POST /ollama/api/generate (LLM analysis)
  → Response: Enriched alert


Development & Deployment

CI/CD Pipeline (Planned)

Code Commit → GitHub Actions
              Unit Tests
              Docker Build
         Integration Tests
      Deploy to Staging
         Smoke Tests
    Production Deployment

Configuration Management

Environment Variables: - .env file for local development - Docker Compose env_file directive - Secrets injected at runtime

Infrastructure as Code: - All configurations version-controlled - Declarative Docker Compose specifications - Idempotent deployment scripts


Future Architecture Enhancements

Short-term (Weeks 3-4)

  • Multi-class ML classification (24 attack types)
  • Reverse proxy (Nginx/Traefik) for HTTPS termination
  • Secrets management (HashiCorp Vault)
  • Automated backups

Medium-term (Months 2-3)

  • Kubernetes migration for production deployments
  • Multi-region deployment for disaster recovery
  • Advanced ML models (deep learning, transformers)
  • Custom Cortex analyzers

Long-term (Months 4-6)

  • Multi-agent collaboration framework
  • Automated playbook generation via LLM
  • Predictive threat modeling
  • Zero-trust network architecture

Appendices

A. Service Dependencies

Wazuh Dashboard → Wazuh Manager → Wazuh Indexer
TheHive → Cassandra + MinIO
Cortex → Cassandra + TheHive
Shuffle → OpenSearch
Alert Triage → ML Inference + RAG Service + Ollama
RAG Service → ChromaDB
Grafana → Prometheus + Loki
AlertManager → Prometheus

B. Resource Requirements

Minimum (Development/Testing): - CPU: 4 cores (8 threads) - RAM: 16GB - Disk: 50GB SSD - Network: 100Mbps

Recommended (Production): - CPU: 8 cores (16 threads) - RAM: 32GB - Disk: 250GB NVMe SSD - Network: 1Gbps

See System Requirements for detailed specifications.

C. Glossary

  • SIEM: Security Information and Event Management
  • SOAR: Security Orchestration, Automation, and Response
  • RAG: Retrieval-Augmented Generation
  • CTI: Cyber Threat Intelligence
  • MITRE ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge framework
  • IOC: Indicator of Compromise
  • EDR: Endpoint Detection and Response

Architecture Documentation Version: 1.0 Last Updated: October 24, 2025 Maintained By: AI-SOC Architecture Team