Production Deployment Guide for AI-SOC¶

Executive Summary¶

This guide provides a comprehensive production deployment strategy for AI-SOC, incorporating high-availability architectures, disaster recovery procedures, deployment patterns (blue-green, canary), observability best practices, and SLA/SLO definitions for Security Operations Centers.

Based on 2025 industry standards and production-grade practices from leading organizations deploying LLMs at scale.

1. High Availability Architecture¶

1.1 Multi-Zone Kubernetes Deployment¶

Architecture Diagram:

│ └─          │                 │

name="__codelineno-0-1" href="#__codelineno-0-1">┌─────────────────────────────────────────────────────────────┐ Load Balancer (CloudFlare + NGINX) │ ───────────┬────────────────────────────┬────────────────────┘ │ │ ┌───────▼────────┐ ┌──────▼─────────┐ │ Zone A │ │ Zone B │ │ (us-east-1a) │ │ (us-east-1b) │ │ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ LLM Pods │ │ │ │ LLM Pods │ │ │ │ x3 │ │ │ │ x3 │ │ │ └──────────┘ │ │ └──────────┘ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ API Pods │ │ │ │ API Pods │ │ │ │ x5 │ │ │ │ x5 │ │ │ └──────────┘ │ │ └──────────┘ │ └────────┬───────┘ └────────┬───────┘ │ │ ┌────────▼────────────────────────────▼───────┐ OpenSearch Cluster (3 masters) │ │ Master-1 (Zone A) | Master-2 (Zone B) │ │ Data-1,2 (Zone A) | Data-3,4 (Zone B) │ └──────────────────────────────────────────────┘ │ ┌────────▼────────────────────────────────────┐ │ Persistent Storage (EBS/EFS with snapshots)│ └─────────────────────────────────────────────┘

1.2 Kubernetes HA Configuration¶

# k8s-deployment/llm-service-ha.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service
  namespace: ai-soc
spec:
  replicas: 6  # Minimum 6 replicas across zones
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1  # Always maintain at least 5 running pods

  selector:
    matchLabels:
      app: llm-service

  template:
    metadata:
      labels:
        app: llm-service
    spec:
      # Topology spread for HA across zones
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: llm-service

      # Anti-affinity: Don't schedule on same node
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - llm-service
              topologyKey: kubernetes.io/hostname

      containers:
        - name: llm-container
          image: ai-soc-llm:1.0.0
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"

          # Liveness probe: restart if unhealthy
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3

          # Readiness probe: remove from load balancer if not ready
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 2

          # Startup probe: allow slow startup
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 0
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 30  # 5 minutes to start

---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: ai-soc
spec:
  type: LoadBalancer
  selector:
    app: llm-service
  ports:
    - port: 80
      targetPort: 8000
  sessionAffinity: ClientIP  # Sticky sessions for conversation continuity

1.3 Control Plane HA (Multi-Master)¶

# kubeadm-config.yaml (for self-hosted Kubernetes)
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.0
controlPlaneEndpoint: "k8s-api.ai-soc.local:6443"

# High Availability etcd
etcd:
  external:
    endpoints:
      - https://etcd-1.ai-soc.local:2379
      - https://etcd-2.ai-soc.local:2379
      - https://etcd-3.ai-soc.local:2379
    caFile: /etc/kubernetes/pki/etcd/ca.crt
    certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key

# Load balancer for API servers
apiServer:
  certSANs:
    - "k8s-api.ai-soc.local"
    - "10.0.0.100"  # Load balancer IP
  extraArgs:
    enable-admission-plugins: NodeRestriction,PodSecurityPolicy
    audit-log-path: /var/log/kubernetes/audit.log
    audit-log-maxage: "30"

Expected Availability: - Single master: ~99.5% (4.3 hours downtime/month) - Multi-master (3 nodes): 99.95% (22 minutes downtime/month) - Multi-master + multi-zone: 99.99% (4.3 minutes downtime/month)

1.4 Database HA (OpenSearch Cluster)¶

# opensearch-cluster-ha.yaml
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: ai-soc-logs
  namespace: ai-soc
spec:
  general:
    version: 2.11.0
    httpPort: 9200
    serviceName: ai-soc-logs

  # Dedicated master nodes (3 for quorum)
  nodePools:
    - component: masters
      replicas: 3
      diskSize: 50Gi
      roles:
        - cluster_manager
      resources:
        requests:
          cpu: 2
          memory: 8Gi
        limits:
          cpu: 4
          memory: 16Gi

      # Spread masters across zones
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule

    # Data nodes (4 for redundancy + performance)
    - component: data
      replicas: 4
      diskSize: 500Gi
      roles:
        - data
        - ingest
      resources:
        requests:
          cpu: 4
          memory: 32Gi
        limits:
          cpu: 8
          memory: 64Gi

  # HA configuration
  dashboards:
    enable: true
    replicas: 2  # Redundant dashboards

  security:
    tls:
      transport:
        generate: true
      http:
        generate: true

1.5 Network HA with Load Balancing¶

# nginx-ha.conf
upstream llm_backend {
    least_conn;  # Route to least busy server

    # Health checks
    server llm-1.ai-soc.local:8000 max_fails=3 fail_timeout=30s;
    server llm-2.ai-soc.local:8000 max_fails=3 fail_timeout=30s;
    server llm-3.ai-soc.local:8000 max_fails=3 fail_timeout=30s;
    server llm-4.ai-soc.local:8000 max_fails=3 fail_timeout=30s;
    server llm-5.ai-soc.local:8000 max_fails=3 fail_timeout=30s;
    server llm-6.ai-soc.local:8000 max_fails=3 fail_timeout=30s;

    # Backup server (if all fail)
    server llm-backup.ai-soc.local:8000 backup;

    # Connection pooling
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name api.ai-soc.local;

    ssl_certificate /etc/nginx/ssl/ai-soc.crt;
    ssl_certificate_key /etc/nginx/ssl/ai-soc.key;

    # SSL optimization
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;

    # Health check endpoint
    location /nginx-health {
        access_log off;
        return 200 "healthy\n";
    }

    location / {
        proxy_pass http://llm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Timeouts
        proxy_connect_timeout 10s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # Retry on failure
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
    }
}

2. Disaster Recovery Strategy¶

2.1 Backup Strategy¶

3-2-1 Backup Rule: - 3 copies of data - 2 different storage media (EBS snapshots + S3) - 1 offsite backup (different region)

# velero-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: ai-soc-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM UTC
  template:
    includedNamespaces:
      - ai-soc
    includeClusterResources: true
    storageLocation: default
    volumeSnapshotLocations:
      - default
    ttl: 720h  # Retain for 30 days

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: ai-soc-weekly-backup
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # Weekly on Sunday at 3 AM
  template:
    includedNamespaces:
      - ai-soc
    includeClusterResources: true
    storageLocation: s3-backup
    ttl: 2160h  # Retain for 90 days

OpenSearch Snapshots:

# backup/opensearch_snapshots.py
from opensearchpy import OpenSearch
import datetime

os_client = OpenSearch(['https://opensearch:9200'])

# Register snapshot repository (S3)
os_client.snapshot.create_repository(
    repository='ai-soc-backups',
    body={
        "type": "s3",
        "settings": {
            "bucket": "ai-soc-opensearch-backups",
            "region": "us-east-1",
            "base_path": "snapshots",
            "compress": True,
            "server_side_encryption": True
        }
    }
)

# Create daily snapshot
def create_daily_snapshot():
    """Create incremental snapshot of all indices"""
    snapshot_name = f"snapshot-{datetime.date.today()}"

    os_client.snapshot.create(
        repository='ai-soc-backups',
        snapshot=snapshot_name,
        body={
            "indices": "logs-*,alerts-*,threats-*",
            "ignore_unavailable": True,
            "include_global_state": False
        },
        wait_for_completion=False  # Async
    )

    print(f"Snapshot {snapshot_name} initiated")

# Automated retention
def cleanup_old_snapshots(retention_days: int = 30):
    """Delete snapshots older than retention period"""
    snapshots = os_client.snapshot.get(
        repository='ai-soc-backups',
        snapshot='*'
    )

    cutoff = datetime.datetime.now() - datetime.timedelta(days=retention_days)

    for snapshot in snapshots['snapshots']:
        start_time = datetime.datetime.fromtimestamp(snapshot['start_time_in_millis'] / 1000)

        if start_time < cutoff:
            os_client.snapshot.delete(
                repository='ai-soc-backups',
                snapshot=snapshot['snapshot']
            )
            print(f"Deleted old snapshot: {snapshot['snapshot']}")

2.2 Recovery Procedures¶

Recovery Time Objective (RTO): 1 hour Recovery Point Objective (RPO): 24 hours

Kubernetes Cluster Recovery:

#!/bin/bash
# disaster-recovery/restore-cluster.sh

set -e

echo "=== AI-SOC Disaster Recovery ==="
echo "Restoring from backup..."

# 1. Restore Kubernetes resources with Velero
velero restore create ai-soc-restore \
  --from-backup ai-soc-daily-backup-20251022 \
  --wait

# 2. Verify pods are running
kubectl wait --for=condition=ready pod \
  -l app=llm-service \
  -n ai-soc \
  --timeout=300s

# 3. Restore OpenSearch data
python3 restore_opensearch.py --snapshot snapshot-2025-10-22

# 4. Verify services
kubectl get pods -n ai-soc
kubectl get svc -n ai-soc

# 5. Run smoke tests
python3 smoke-tests.py

echo "=== Recovery Complete ==="

OpenSearch Restore:

# disaster-recovery/restore_opensearch.py
def restore_opensearch_snapshot(snapshot_name: str):
    """Restore OpenSearch data from snapshot"""

    # Close indices before restore
    indices_to_restore = ["logs-*", "alerts-*", "threats-*"]

    for index_pattern in indices_to_restore:
        os_client.indices.close(index=index_pattern)

    # Restore snapshot
    os_client.snapshot.restore(
        repository='ai-soc-backups',
        snapshot=snapshot_name,
        body={
            "indices": ",".join(indices_to_restore),
            "ignore_unavailable": True,
            "include_global_state": False
        },
        wait_for_completion=True
    )

    # Reopen indices
    for index_pattern in indices_to_restore:
        os_client.indices.open(index=index_pattern)

    print(f"Restored snapshot: {snapshot_name}")

2.3 Disaster Recovery Testing¶

# .github/workflows/dr-test.yml
name: Disaster Recovery Test

on:
  schedule:
    - cron: '0 4 1 * *'  # Monthly on 1st at 4 AM

jobs:
  dr-test:
    runs-on: ubuntu-latest
    steps:
      - name: Backup Production
        run: |
          velero backup create dr-test-backup \
            --from-schedule ai-soc-daily-backup

      - name: Deploy Test Cluster
        run: |
          terraform apply -var="environment=dr-test"

      - name: Restore to Test Cluster
        run: |
          velero restore create dr-test-restore \
            --from-backup dr-test-backup \
            --wait

      - name: Run Validation Tests
        run: |
          python3 dr-validation-tests.py

      - name: Measure RTO/RPO
        run: |
          python3 measure-recovery-time.py

      - name: Cleanup Test Environment
        run: |
          terraform destroy -var="environment=dr-test" -auto-approve

      - name: Report Results
        run: |
          python3 send-dr-report.py

3. Deployment Patterns¶

3.1 Blue-Green Deployment¶

Use Case: Zero-downtime releases with instant rollback capability

# deployment-patterns/blue-green.yaml
---
# Blue environment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-blue
  namespace: ai-soc
spec:
  replicas: 6
  selector:
    matchLabels:
      app: llm-service
      version: blue
  template:
    metadata:
      labels:
        app: llm-service
        version: blue
    spec:
      containers:
        - name: llm-container
          image: ai-soc-llm:1.0.0  # Current version

---
# Green environment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-green
  namespace: ai-soc
spec:
  replicas: 6
  selector:
    matchLabels:
      app: llm-service
      version: green
  template:
    metadata:
      labels:
        app: llm-service
        version: green
    spec:
      containers:
        - name: llm-container
          image: ai-soc-llm:2.0.0  # New version

---
# Service (switch between blue and green)
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: ai-soc
spec:
  selector:
    app: llm-service
    version: blue  # Change to "green" to switch traffic
  ports:
    - port: 80
      targetPort: 8000

Deployment Script:

#!/bin/bash
# deployment-patterns/blue-green-deploy.sh

set -e

NAMESPACE="ai-soc"
SERVICE_NAME="llm-service"
NEW_VERSION="2.0.0"

echo "=== Blue-Green Deployment ==="

# 1. Deploy green environment
echo "Deploying green environment (version $NEW_VERSION)..."
kubectl apply -f llm-service-green.yaml

# 2. Wait for green pods to be ready
echo "Waiting for green pods to be ready..."
kubectl wait --for=condition=ready pod \
  -l app=llm-service,version=green \
  -n $NAMESPACE \
  --timeout=300s

# 3. Run smoke tests on green
echo "Running smoke tests on green environment..."
GREEN_POD=$(kubectl get pod -l version=green -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n $NAMESPACE $GREEN_POD -- python3 /app/smoke-tests.py

if [ $? -ne 0 ]; then
  echo "Smoke tests failed! Aborting deployment."
  exit 1
fi

# 4. Switch traffic to green
echo "Switching traffic from blue to green..."
kubectl patch service $SERVICE_NAME -n $NAMESPACE \
  -p '{"spec":{"selector":{"version":"green"}}}'

echo "Traffic switched to green!"

# 5. Monitor for 10 minutes
echo "Monitoring green environment for 10 minutes..."
sleep 600

# 6. Check error rates
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(llm_requests_total{status=\"error\"}[5m])" | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
  echo "High error rate detected! Rolling back to blue..."
  kubectl patch service $SERVICE_NAME -n $NAMESPACE \
    -p '{"spec":{"selector":{"version":"blue"}}}'
  exit 1
fi

# 7. Success! Scale down blue
echo "Deployment successful! Scaling down blue environment..."
kubectl scale deployment llm-service-blue -n $NAMESPACE --replicas=0

echo "=== Deployment Complete ==="

3.2 Canary Deployment¶

Use Case: Gradual rollout to minimize risk (5% → 25% → 50% → 100%)

# deployment-patterns/canary.yaml
---
# Stable deployment (95% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-stable
  namespace: ai-soc
spec:
  replicas: 19  # 95% of 20 total pods
  selector:
    matchLabels:
      app: llm-service
      track: stable
  template:
    metadata:
      labels:
        app: llm-service
        track: stable
    spec:
      containers:
        - name: llm-container
          image: ai-soc-llm:1.0.0

---
# Canary deployment (5% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-canary
  namespace: ai-soc
spec:
  replicas: 1  # 5% of 20 total pods
  selector:
    matchLabels:
      app: llm-service
      track: canary
  template:
    metadata:
      labels:
        app: llm-service
        track: canary
    spec:
      containers:
        - name: llm-container
          image: ai-soc-llm:2.0.0  # New version

---
# Service (routes to both stable and canary)
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: ai-soc
spec:
  selector:
    app: llm-service  # Selects both stable and canary
  ports:
    - port: 80
      targetPort: 8000

Automated Canary Progression:

# deployment-patterns/canary-controller.py
import time
import requests

class CanaryController:
    def __init__(self, namespace: str, service: str):
        self.namespace = namespace
        self.service = service
        self.stages = [5, 25, 50, 100]  # Percentage of traffic
        self.stage_duration = 600  # 10 minutes per stage

    def deploy_canary(self, new_version: str):
        """Progressively increase canary traffic"""

        total_replicas = 20

        for stage_pct in self.stages:
            canary_replicas = int(total_replicas * stage_pct / 100)
            stable_replicas = total_replicas - canary_replicas

            print(f"\n=== Canary Stage: {stage_pct}% ===")
            print(f"Canary replicas: {canary_replicas}")
            print(f"Stable replicas: {stable_replicas}")

            # Scale deployments
            self.scale_deployment("llm-service-canary", canary_replicas)
            self.scale_deployment("llm-service-stable", stable_replicas)

            # Wait for pods to be ready
            self.wait_for_ready("llm-service-canary", canary_replicas)

            # Monitor for duration
            print(f"Monitoring for {self.stage_duration}s...")
            time.sleep(self.stage_duration)

            # Check metrics
            if not self.check_canary_health():
                print("Canary health check failed! Rolling back...")
                self.rollback()
                return False

            print(f"Stage {stage_pct}% successful!")

        print("\n=== Canary deployment complete! ===")
        # Cleanup: delete stable deployment
        self.delete_deployment("llm-service-stable")
        return True

    def check_canary_health(self) -> bool:
        """Check if canary is healthy compared to stable"""

        # Query Prometheus for error rates
        canary_error_rate = self.get_error_rate("track=canary")
        stable_error_rate = self.get_error_rate("track=stable")

        print(f"Canary error rate: {canary_error_rate:.4f}")
        print(f"Stable error rate: {stable_error_rate:.4f}")

        # Canary must not have >2x error rate of stable
        if canary_error_rate > stable_error_rate * 2:
            return False

        # Canary must have <5% error rate absolute
        if canary_error_rate > 0.05:
            return False

        return True

    def get_error_rate(self, label_filter: str) -> float:
        """Query Prometheus for error rate"""
        query = f'rate(llm_requests_total{{status="error",{label_filter}}}[5m]) / rate(llm_requests_total{{{label_filter}}}[5m])'
        response = requests.get(
            f"http://prometheus:9090/api/v1/query",
            params={"query": query}
        )
        result = response.json()
        if result['data']['result']:
            return float(result['data']['result'][0]['value'][1])
        return 0.0

    def rollback(self):
        """Rollback canary deployment"""
        self.scale_deployment("llm-service-canary", 0)
        self.scale_deployment("llm-service-stable", 20)
        print("Rolled back to stable version")

# Usage
controller = CanaryController("ai-soc", "llm-service")
controller.deploy_canary("2.0.0")

4. Observability & Monitoring¶

4.1 OpenTelemetry Integration¶

Comprehensive observability with logs, metrics, and traces:

# observability/opentelemetry_config.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import Resource

# Configure resource attributes
resource = Resource(attributes={
    "service.name": "ai-soc-llm-service",
    "service.version": "2.0.0",
    "deployment.environment": "production"
})

# Setup tracing
trace_provider = TracerProvider(resource=resource)
otlp_span_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
trace_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
trace.set_tracer_provider(trace_provider)

# Setup metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4317"),
    export_interval_millis=60000
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

# Create custom metrics
meter = metrics.get_meter(__name__)
llm_latency_histogram = meter.create_histogram(
    name="llm.inference.duration",
    description="LLM inference duration in seconds",
    unit="s"
)

# Create tracer
tracer = trace.get_tracer(__name__)

# Usage in application
@app.post("/analyze")
async def analyze_threat(prompt: str):
    with tracer.start_as_current_span("llm.inference") as span:
        span.set_attribute("llm.model", "Foundation-Sec-8B")
        span.set_attribute("llm.prompt_length", len(prompt))

        start = time.time()
        result = await llm_service.generate(prompt)
        duration = time.time() - start

        # Record metrics
        llm_latency_histogram.record(duration, {"model": "Foundation-Sec-8B"})

        span.set_attribute("llm.response_length", len(result))
        span.set_attribute("llm.duration_ms", duration * 1000)

        return {"result": result}

4.2 Distributed Tracing¶

Trace requests across microservices:

# observability/distributed_tracing.py
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

tracer = trace.get_tracer(__name__)
propagator = TraceContextTextMapPropagator()

@app.post("/analyze-alert")
async def analyze_alert(alert_data: dict, request: Request):
    """
    Endpoint with distributed tracing

    Trace propagates: API Gateway -> LLM Service -> ChromaDB -> OpenSearch
    """
    # Extract trace context from incoming request
    ctx = propagator.extract(carrier=dict(request.headers))

    with tracer.start_as_current_span("analyze_alert", context=ctx) as span:
        span.set_attribute("alert.type", alert_data.get("type"))
        span.set_attribute("alert.severity", alert_data.get("severity"))

        # 1. Query threat intel from ChromaDB (span propagates)
        with tracer.start_as_current_span("chromadb.query") as db_span:
            threat_context = await chromadb_client.query(alert_data["description"])
            db_span.set_attribute("chromadb.results", len(threat_context))

        # 2. LLM analysis (span propagates)
        with tracer.start_as_current_span("llm.analyze") as llm_span:
            analysis = await llm_service.analyze(alert_data, threat_context)
            llm_span.set_attribute("llm.tokens", analysis["tokens_used"])

        # 3. Log to OpenSearch (span propagates)
        with tracer.start_as_current_span("opensearch.index") as os_span:
            await opensearch_client.index(analysis)

        return analysis

4.3 Prometheus Metrics¶

RED Metrics (Rate, Errors, Duration):

# observability/prometheus_metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time

# Rate: Request throughput
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Errors: Error rate
llm_errors_total = Counter(
    'llm_errors_total',
    'Total LLM errors',
    ['error_type', 'model']
)

# Duration: Latency distribution
http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

llm_inference_duration_seconds = Histogram(
    'llm_inference_duration_seconds',
    'LLM inference duration',
    ['model', 'quantization'],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

# Custom business metrics
active_conversations = Gauge(
    'llm_active_conversations',
    'Number of active LLM conversations'
)

chromadb_index_size = Gauge(
    'chromadb_index_size',
    'Number of vectors in ChromaDB'
)

# Middleware for automatic metrics
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    method = request.method
    endpoint = request.url.path

    start = time.time()

    try:
        response = await call_next(request)

        # Record metrics
        duration = time.time() - start
        http_requests_total.labels(
            method=method,
            endpoint=endpoint,
            status=response.status_code
        ).inc()

        http_request_duration_seconds.labels(
            method=method,
            endpoint=endpoint
        ).observe(duration)

        return response

    except Exception as e:
        # Record error
        llm_errors_total.labels(
            error_type=type(e).__name__,
            model="Foundation-Sec-8B"
        ).inc()
        raise

4.4 Log Aggregation (Structured JSON Logs)¶

# observability/structured_logging.py
import logging
import json
import sys
from datetime import datetime

class JSONFormatter(logging.Formatter):
    """Format logs as JSON for easy parsing"""

    def format(self, record: logging.LogRecord) -> str:
        log_data = {
            "@timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "service": "ai-soc-llm-service",
            "environment": "production"
        }

        # Add exception info if present
        if record.exc_info:
            log_data["exception"] = self.formatException(record.exc_info)

        # Add custom fields
        if hasattr(record, "user_id"):
            log_data["user_id"] = record.user_id
        if hasattr(record, "request_id"):
            log_data["request_id"] = record.request_id
        if hasattr(record, "duration_ms"):
            log_data["duration_ms"] = record.duration_ms

        return json.dumps(log_data)

# Configure logger
logger = logging.getLogger("ai-soc")
logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)

# Usage with request context
@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())
    request.state.request_id = request_id

    start = time.time()

    logger.info(
        "Request started",
        extra={
            "request_id": request_id,
            "method": request.method,
            "path": request.url.path,
            "user_id": get_user_id(request)
        }
    )

    response = await call_next(request)

    duration_ms = (time.time() - start) * 1000

    logger.info(
        "Request completed",
        extra={
            "request_id": request_id,
            "status_code": response.status_code,
            "duration_ms": duration_ms
        }
    )

    return response

5. SLA/SLO/SLI Definitions¶

5.1 Service Level Indicators (SLIs)¶

What we measure:

SLI	Description	Measurement
Availability	% of time service is reachable	`(successful_requests / total_requests) * 100`
Latency (P95)	95th percentile response time	`histogram_quantile(0.95, http_request_duration_seconds)`
Error Rate	% of requests returning errors	`(error_requests / total_requests) * 100`
Throughput	Requests per second	`rate(http_requests_total[1m])`
MTTD	Mean Time To Detect threats	Average time from alert creation to detection
MTTR	Mean Time To Respond	Average time from detection to response

5.2 Service Level Objectives (SLOs)¶

What we promise internally:

# slo-definitions.yaml
slos:
  availability:
    target: 99.9%  # 43 minutes downtime per month
    measurement_window: 30d
    error_budget: 0.1%  # 43 minutes per month

  latency_p95:
    target: 2s  # 95% of requests < 2s
    measurement_window: 30d

  error_rate:
    target: 1%  # <1% of requests fail
    measurement_window: 30d

  llm_inference_latency:
    target: 3s  # P95 inference < 3s
    measurement_window: 7d

  alert_processing_latency:
    target: 30s  # P95 alert analysis < 30s
    measurement_window: 7d

  mttd:
    target: 2h  # Detect threats within 2 hours
    measurement_window: 30d

  mttr:
    target: 4h  # Respond to high-severity threats within 4 hours
    measurement_window: 30d

Prometheus Queries for SLOs:

# Availability SLO (99.9%)
(
  sum(rate(http_requests_total{status=~"2.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) * 100 > 99.9

# Latency P95 SLO (<2s)
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[30d])
) < 2

# Error Rate SLO (<1%)
(
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) * 100 < 1

5.3 Service Level Agreements (SLAs)¶

What we promise customers:

# AI-SOC Service Level Agreement (SLA)

## Covered Services
- LLM-powered threat analysis
- Alert triage and prioritization
- Threat intelligence enrichment
- Security recommendations

## Availability Commitment
- **99.5% uptime** (3.6 hours downtime/month)
- Measured on a monthly basis
- Excludes planned maintenance windows (notified 7 days in advance)

## Performance Commitments
- **API Response Time**: 95% of requests complete within 5 seconds
- **Alert Analysis**: 95% of alerts analyzed within 60 seconds
- **Threat Detection**: High-severity threats detected within 4 hours

## Support Response Times
| Severity | First Response | Resolution Target |
|----------|---------------|-------------------|
| P0 - Critical (service down) | 15 minutes | 4 hours |
| P1 - High (degraded) | 1 hour | 24 hours |
| P2 - Medium | 4 hours | 72 hours |
| P3 - Low | 24 hours | 7 days |

## Service Credits
If we fail to meet our SLA commitments:

| Uptime Achievement | Service Credit |
|-------------------|---------------|
| < 99.5% but >= 99.0% | 10% of monthly fee |
| < 99.0% but >= 95.0% | 25% of monthly fee |
| < 95.0% | 50% of monthly fee |

## Exclusions
SLA does not apply to:
- Customer misconfigurations
- Third-party service failures (cloud provider outages)
- DDoS attacks or security incidents
- Planned maintenance (with notice)
- Beta features marked as "experimental"

## Measurement & Reporting
- Uptime calculated from successful health checks every 60 seconds
- Monthly SLA reports provided via customer dashboard
- Real-time status page: status.ai-soc.example.com

5.4 Error Budget Policy¶

# slo/error_budget_policy.py
class ErrorBudgetPolicy:
    """
    Error budget determines how much risk we can take

    99.9% SLO = 0.1% error budget = 43 minutes/month downtime
    """

    def __init__(self, slo_target: float, window_days: int = 30):
        self.slo_target = slo_target  # e.g., 0.999 for 99.9%
        self.error_budget = 1 - slo_target
        self.window_seconds = window_days * 24 * 3600

    def calculate_remaining_budget(self, current_availability: float) -> dict:
        """Calculate remaining error budget"""

        # Time spent in error state
        error_rate = 1 - current_availability
        error_budget_consumed = error_rate / self.error_budget

        # Time remaining
        remaining_budget = 1 - error_budget_consumed

        # Time in seconds
        budget_seconds = self.window_seconds * self.error_budget
        consumed_seconds = budget_seconds * error_budget_consumed
        remaining_seconds = budget_seconds * remaining_budget

        return {
            "error_budget": self.error_budget,
            "consumed_pct": error_budget_consumed * 100,
            "remaining_pct": remaining_budget * 100,
            "consumed_seconds": consumed_seconds,
            "remaining_seconds": remaining_seconds,
            "status": self.get_status(error_budget_consumed)
        }

    def get_status(self, consumed: float) -> str:
        """Determine deployment policy based on error budget"""
        if consumed < 0.5:
            return "HEALTHY - Safe to deploy"
        elif consumed < 0.75:
            return "WARNING - Slow down deployments"
        elif consumed < 1.0:
            return "CRITICAL - Freeze non-critical deployments"
        else:
            return "EXHAUSTED - Emergency freeze, focus on reliability"

# Usage
policy = ErrorBudgetPolicy(slo_target=0.999, window_days=30)
current_availability = 0.9985  # 99.85% (below 99.9% SLO)

budget_status = policy.calculate_remaining_budget(current_availability)
print(f"Error budget consumed: {budget_status['consumed_pct']:.2f}%")
print(f"Status: {budget_status['status']}")

# If budget exhausted, block risky changes
if budget_status['consumed_pct'] > 75:
    print("⚠️ Deployment blocked due to error budget policy")
    sys.exit(1)

6. Production Checklist¶

# AI-SOC Production Deployment Checklist

## High Availability
- [ ] Multi-zone Kubernetes cluster (3+ zones)
- [ ] Multi-master control plane (3+ masters)
- [ ] Pod anti-affinity configured
- [ ] Topology spread constraints applied
- [ ] HPA configured for dynamic scaling
- [ ] VPA configured for resource optimization
- [ ] Pod Disruption Budgets defined
- [ ] Load balancer health checks configured
- [ ] Database clustering (OpenSearch 3+ masters)
- [ ] Network redundancy (multiple subnets/AZs)

## Disaster Recovery
- [ ] Velero backups automated (daily + weekly)
- [ ] OpenSearch snapshots to S3 (daily)
- [ ] Backup retention policy defined (30/90 days)
- [ ] DR runbooks documented
- [ ] RTO/RPO targets defined and tested
- [ ] Cross-region backup replication
- [ ] DR testing scheduled (monthly)
- [ ] Recovery procedures validated

## Deployment Strategy
- [ ] Blue-green deployment pipeline configured
- [ ] Canary deployment automation ready
- [ ] Rollback procedures tested
- [ ] Smoke tests automated
- [ ] Feature flags implemented
- [ ] Database migration strategy defined
- [ ] Zero-downtime deployment verified

## Observability
- [ ] OpenTelemetry instrumentation complete
- [ ] Distributed tracing enabled
- [ ] Prometheus metrics exported
- [ ] Grafana dashboards created
- [ ] Log aggregation (OpenSearch/ELK)
- [ ] Structured JSON logging
- [ ] Alert rules configured
- [ ] On-call rotation established
- [ ] Incident response playbooks created

## SLA/SLO/SLI
- [ ] SLIs defined and measured
- [ ] SLOs set with error budgets
- [ ] SLAs documented for customers
- [ ] Error budget policy enforced
- [ ] Service status page public
- [ ] Monthly SLA reports automated
- [ ] Performance baselines established

## Security (from security-hardening.md)
- [ ] OAuth2 authentication enabled
- [ ] MFA enforced for admins
- [ ] Secrets in HashiCorp Vault
- [ ] Rate limiting configured
- [ ] Network segmentation applied
- [ ] TLS 1.3 enforced
- [ ] Audit logging enabled
- [ ] Security scanning automated

## Performance (from performance-optimization.md)
- [ ] LLM quantization enabled
- [ ] vLLM continuous batching
- [ ] ChromaDB HNSW tuned
- [ ] OpenSearch indexing optimized
- [ ] Docker resources limited
- [ ] Resource requests/limits set
- [ ] Performance benchmarks met

## Documentation
- [ ] Architecture diagrams updated
- [ ] API documentation complete
- [ ] Runbooks for common scenarios
- [ ] Troubleshooting guides
- [ ] On-call procedures
- [ ] Change management process

7. Conclusion¶

This production deployment guide provides a comprehensive framework for deploying AI-SOC with enterprise-grade reliability:

99.99% availability through multi-zone HA architecture
<1 hour RTO, <24 hour RPO disaster recovery
Zero-downtime deployments with blue-green and canary strategies
Comprehensive observability with OpenTelemetry, Prometheus, and structured logging
Customer-facing SLAs with 99.5% uptime commitment

Recommended Implementation Timeline: - Weeks 1-4: HA architecture setup (Kubernetes multi-zone, database clustering) - Weeks 5-6: Disaster recovery (backups, DR testing) - Weeks 7-8: Deployment automation (blue-green, canary pipelines) - Weeks 9-10: Observability (OpenTelemetry, dashboards, alerts) - Weeks 11-12: SLA/SLO implementation, final testing, production cutover

Document Version: 1.0 Last Updated: 2025-10-22 Author: The Didact (AI Research Specialist) Classification: Internal Use