Performance Optimization Guide for AI-SOC¶

Executive Summary¶

This guide provides comprehensive strategies for optimizing AI-SOC performance across LLM inference, vector databases, log management, and infrastructure. Based on 2025 industry best practices and production case studies, these optimizations can achieve:

67.8% latency reduction for LLM inference
4.2x throughput improvement with advanced techniques
75% memory reduction through quantization
2-5x speedup with KV cache optimization
70-90% cost reduction through efficient resource management

1. LLM Inference Optimization¶

1.1 Model Quantization¶

Overview: Quantization converts model weights from higher precision (FP32/FP16) to lower precision (INT8/INT4), reducing memory usage and increasing inference speed with minimal accuracy loss.

Impact: - INT8: 2x memory reduction, ~1.5x speedup, negligible quality degradation - INT4: 4x memory reduction, ~2x speedup, minor quality drop (acceptable for most tasks)

Implementation:

# quantization/quantize_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

def load_quantized_model(model_name: str, quantization: str = "int8"):
    """
    Load model with quantization for efficient inference

    Args:
        model_name: HuggingFace model identifier
        quantization: "int8", "int4", or "fp16"
    """

    if quantization == "int8":
        quantization_config = BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False
        )
    elif quantization == "int4":
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4",  # Normal Float 4
            bnb_4bit_use_double_quant=True
        )
    else:
        quantization_config = None

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config,
        device_map="auto",
        torch_dtype=torch.float16 if quantization == "fp16" else "auto"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    return model, tokenizer

# Usage for Foundation-Sec-8B
model, tokenizer = load_quantized_model(
    "fdtn-ai/Foundation-Sec-8B",
    quantization="int4"  # 4x memory reduction
)

# Inference
def analyze_threat(threat_description: str):
    inputs = tokenizer(threat_description, return_tensors="pt").to(model.device)

    with torch.inference_mode():  # Faster than torch.no_grad()
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            use_cache=True  # Enable KV caching
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

1.2 KV Cache Optimization¶

Overview: KV caching stores key-value tensors from previous tokens, eliminating redundant computation during autoregressive generation.

Impact: - 2-5x speedup for multi-turn conversations - 75% memory reduction with INT8 KV cache quantization - Prefix caching: 90%+ reduction for shared prompts

Implementation:

# kv_cache/optimized_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class KVCacheOptimizedInference:
    def __init__(self, model_name: str):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.float16
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Shared system prompt for all users (prefix caching)
        self.system_prompt = """You are a cybersecurity analyst assistant.
Your role is to analyze security alerts and provide actionable insights.
Always be concise, accurate, and security-focused."""

        # Cache system prompt KV
        self.system_kv_cache = self._compute_system_cache()

    def _compute_system_cache(self):
        """Pre-compute KV cache for system prompt (reused across all requests)"""
        inputs = self.tokenizer(
            self.system_prompt,
            return_tensors="pt"
        ).to(self.model.device)

        with torch.inference_mode():
            outputs = self.model(
                **inputs,
                use_cache=True,
                return_dict=True
            )

        # Store past_key_values for reuse
        return outputs.past_key_values

    def generate_response(self, user_query: str, conversation_history=None):
        """
        Generate response with KV cache optimization

        Args:
            user_query: User's question/prompt
            conversation_history: Optional list of past exchanges
        """
        # Reuse system prompt cache
        past_key_values = self.system_kv_cache

        # Build full prompt
        if conversation_history:
            full_prompt = "\n".join([
                f"User: {ex['user']}\nAssistant: {ex['assistant']}"
                for ex in conversation_history
            ])
            full_prompt += f"\nUser: {user_query}\nAssistant:"
        else:
            full_prompt = f"\nUser: {user_query}\nAssistant:"

        inputs = self.tokenizer(
            full_prompt,
            return_tensors="pt"
        ).to(self.model.device)

        with torch.inference_mode():
            outputs = self.model.generate(
                **inputs,
                past_key_values=past_key_values,  # Reuse cached KV
                max_new_tokens=256,
                use_cache=True,
                do_sample=True,
                temperature=0.7
            )

        response = self.tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )

        return response

# Usage
llm = KVCacheOptimizedInference("fdtn-ai/Foundation-Sec-8B")

# First call: computes system prompt once
response1 = llm.generate_response("What is a phishing attack?")

# Subsequent calls: reuse system prompt cache (90% faster for shared prefix)
response2 = llm.generate_response("How do I detect ransomware?")

1.3 Continuous Batching with vLLM¶

Overview: Traditional batching waits for all sequences to complete. Continuous batching allows new requests to join mid-flight and completed sequences to leave immediately, maximizing GPU utilization.

Impact: - 2.7x throughput improvement (vLLM v0.6.0 benchmark) - 5x latency reduction for time-to-first-token - Near 100% GPU utilization

Implementation:

# vllm_server/deployment.py
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import asyncio

# Initialize vLLM with optimized settings
engine_args = AsyncEngineArgs(
    model="fdtn-ai/Foundation-Sec-8B",
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="float16",
    max_num_seqs=256,  # Continuous batching: handle 256 concurrent requests
    max_num_batched_tokens=4096,
    enable_prefix_caching=True,  # Enable prefix caching
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory
    quantization="awq",  # Activation-aware Weight Quantization
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

async def generate_streaming(prompt: str, request_id: str):
    """
    Streaming generation with continuous batching

    vLLM automatically batches this with other concurrent requests
    """
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256
    )

    results_generator = engine.generate(
        prompt,
        sampling_params,
        request_id
    )

    # Stream results as they're generated
    async for request_output in results_generator:
        if request_output.finished:
            return request_output.outputs[0].text
        else:
            # Yield partial results for streaming
            yield request_output.outputs[0].text

# FastAPI integration
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/v1/analyze")
async def analyze_threat_streaming(prompt: str, request_id: str):
    """
    Streaming endpoint with continuous batching

    Multiple concurrent requests are automatically batched by vLLM
    """
    return StreamingResponse(
        generate_streaming(prompt, request_id),
        media_type="text/event-stream"
    )

Docker Deployment:

# Dockerfile.vllm
FROM vllm/vllm-openai:latest

# Install additional dependencies
RUN pip install fastapi uvicorn prometheus-client

# Copy application code
COPY ./vllm_server /app

# Expose ports
EXPOSE 8000

# Start vLLM server with optimized settings
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "fdtn-ai/Foundation-Sec-8B", \
     "--tensor-parallel-size", "2", \
     "--max-num-seqs", "256", \
     "--enable-prefix-caching", \
     "--gpu-memory-utilization", "0.9"]

1.4 Speculative Decoding¶

Overview: Use a smaller "draft" model to generate candidate tokens, then verify with the larger target model in parallel. Achieves 2-3x speedup.

Impact: - 2-3x inference speedup - Same quality as target model (verification ensures correctness) - Best for: Long-form generation (>256 tokens)

# speculative_decoding/inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class SpeculativeDecoding:
    def __init__(self, target_model: str, draft_model: str):
        # Large target model (Foundation-Sec-8B)
        self.target_model = AutoModelForCausalLM.from_pretrained(
            target_model,
            torch_dtype=torch.float16,
            device_map="cuda:0"
        )

        # Small draft model (Foundation-Sec-1B or similar)
        self.draft_model = AutoModelForCausalLM.from_pretrained(
            draft_model,
            torch_dtype=torch.float16,
            device_map="cuda:1"
        )

        self.tokenizer = AutoTokenizer.from_pretrained(target_model)

    def generate(self, prompt: str, max_tokens: int = 256, lookahead: int = 5):
        """
        Speculative decoding with draft model + verification

        Args:
            prompt: Input prompt
            max_tokens: Maximum tokens to generate
            lookahead: How many tokens draft model generates ahead
        """
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")
        generated = input_ids

        for _ in range(0, max_tokens, lookahead):
            # Step 1: Draft model generates K tokens quickly
            draft_input = generated.to("cuda:1")
            with torch.inference_mode():
                draft_outputs = self.draft_model.generate(
                    draft_input,
                    max_new_tokens=lookahead,
                    do_sample=False  # Greedy for speed
                )

            candidate_tokens = draft_outputs[0][generated.shape[1]:]

            # Step 2: Target model verifies in parallel
            verify_input = torch.cat([generated, candidate_tokens.unsqueeze(0).to("cuda:0")], dim=1)
            with torch.inference_mode():
                target_logits = self.target_model(verify_input).logits

            # Step 3: Accept tokens that match target model predictions
            accepted = 0
            for i in range(len(candidate_tokens)):
                target_prediction = target_logits[0, generated.shape[1] + i - 1].argmax()
                if target_prediction == candidate_tokens[i]:
                    accepted += 1
                else:
                    break

            # Append accepted tokens
            generated = torch.cat([
                generated,
                candidate_tokens[:accepted].unsqueeze(0).to("cuda:0")
            ], dim=1)

            if accepted < lookahead:
                # Draft diverged, add corrected token and continue
                corrected_token = target_logits[0, generated.shape[1] - 1].argmax().unsqueeze(0).unsqueeze(0)
                generated = torch.cat([generated, corrected_token], dim=1)

        return self.tokenizer.decode(generated[0], skip_special_tokens=True)

# Usage
speculative_llm = SpeculativeDecoding(
    target_model="fdtn-ai/Foundation-Sec-8B",
    draft_model="fdtn-ai/Foundation-Sec-1B"  # Hypothetical smaller model
)

result = speculative_llm.generate("Explain how SQL injection works:", max_tokens=512)

1.5 Flash Attention 2¶

Overview: Optimized attention mechanism reducing memory and computation.

Impact: - 2-4x faster attention computation - Memory reduction for long contexts - Supports sequences up to 32k tokens

# Install flash-attention
# pip install flash-attn --no-build-isolation

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "fdtn-ai/Foundation-Sec-8B",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Enable Flash Attention 2
)

1.6 Model Compilation with torch.compile()¶

PyTorch 2.0+ feature: Compile model for optimized execution.

Impact: - 10-30% speedup for inference - Automatic kernel fusion and optimization

import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "fdtn-ai/Foundation-Sec-8B",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Compile model (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

# First inference will be slow (compilation)
# Subsequent inferences will be 10-30% faster

2. ChromaDB Performance Tuning¶

2.1 HNSW Index Configuration¶

Overview: Hierarchical Navigable Small World (HNSW) is ChromaDB's default indexing algorithm. Tuning its parameters balances accuracy vs speed.

Key Parameters:

Parameter	Description	Impact	Recommended Value
`hnsw:construction_ef`	Edge expansion during indexing	Higher = better recall, slower indexing	`200` (default: 100)
`hnsw:M`	Max neighbors per node	Higher = better recall, more memory	`16` (default: 16)
`hnsw:search_ef`	Neighbors explored per query	Higher = better recall, slower search	`100` (default: 10)
`hnsw:batch_size`	Buffering for batch inserts	Higher = faster bulk inserts	`1000`

Implementation:

# chromadb_config/optimized_collection.py
import chromadb
from chromadb.config import Settings

# Initialize ChromaDB with optimized settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",  # Persistent storage with Parquet
    persist_directory="./chroma_data",
    anonymized_telemetry=False
))

# Create collection with HNSW tuning
collection = client.create_collection(
    name="threat_intelligence",
    metadata={
        # HNSW parameters for high-accuracy search
        "hnsw:construction_ef": 200,  # Better recall during indexing
        "hnsw:M": 16,                  # Balanced memory/accuracy
        "hnsw:search_ef": 100,         # High search accuracy
        "hnsw:batch_size": 1000,       # Fast batch inserts
        "hnsw:sync_threshold": 1000    # Sync to disk every 1000 adds
    }
)

# Batch insert for optimal performance
def batch_insert_embeddings(documents: list, embeddings: list, metadatas: list):
    """
    Insert embeddings in batches for optimal performance

    ChromaDB performs best with batches of 1000-5000 documents
    """
    batch_size = 1000

    for i in range(0, len(documents), batch_size):
        batch_docs = documents[i:i+batch_size]
        batch_embeddings = embeddings[i:i+batch_size]
        batch_metadatas = metadatas[i:i+batch_size]

        collection.add(
            documents=batch_docs,
            embeddings=batch_embeddings,
            metadatas=batch_metadatas,
            ids=[f"doc_{j}" for j in range(i, i+len(batch_docs))]
        )

# Usage
batch_insert_embeddings(threat_docs, threat_embeddings, threat_metadata)

2.2 Embedding Model Optimization¶

Overview: Faster embedding models significantly improve ingestion and query speed.

Benchmark (256-token documents):

Model	Dimensions	Speed (docs/sec)	Quality	Recommendation
OpenAI text-embedding-3-small	1536	500	Excellent	Production
nomic-embed-text	768	2000	Excellent	Best for AI-SOC
all-MiniLM-L6-v2	384	5000	Good	Fast, lower quality
BGE-small-en-v1.5	384	3000	Very Good	Balanced

Implementation with Ollama (Local):

# embeddings/optimized_embedding.py
import ollama
import numpy as np

class FastEmbedding:
    def __init__(self, model: str = "nomic-embed-text"):
        """
        Use Ollama for fast local embeddings

        nomic-embed-text: 2000 docs/sec, 768 dimensions
        """
        self.model = model

    def embed_documents(self, documents: list[str]) -> np.ndarray:
        """Batch embed documents"""
        embeddings = []

        # Ollama supports batching
        for doc in documents:
            response = ollama.embeddings(
                model=self.model,
                prompt=doc
            )
            embeddings.append(response["embedding"])

        return np.array(embeddings)

    def embed_query(self, query: str) -> list:
        """Embed single query"""
        response = ollama.embeddings(
            model=self.model,
            prompt=query
        )
        return response["embedding"]

# Usage with ChromaDB
from chromadb.utils import embedding_functions

embedding_function = FastEmbedding("nomic-embed-text")

collection = client.create_collection(
    name="threat_intelligence",
    embedding_function=embedding_function.embed_query,
    metadata={"hnsw:search_ef": 100}
)

# Significantly faster than default ChromaDB embedding

2.3 Query Optimization¶

Best Practices:

# Optimize query performance
def optimized_semantic_search(query: str, n_results: int = 10):
    """
    Optimized semantic search with ChromaDB

    Tips:
    1. Use where filters to reduce search space
    2. Request only needed fields
    3. Use appropriate n_results (larger = slower)
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,

        # Metadata filtering reduces search space dramatically
        where={
            "severity": {"$in": ["high", "critical"]},
            "timestamp": {"$gte": "2025-10-01"}
        },

        # Only retrieve needed fields (faster)
        include=["documents", "metadatas", "distances"]
        # Don't include embeddings unless needed
    )

    return results

# Advanced: Pre-filtering with IVF
# For very large datasets (>1M vectors), consider IVF index
# ChromaDB doesn't support IVF yet, but you can use FAISS

2.4 Data Preprocessing¶

def preprocess_documents(documents: list[str]) -> list[str]:
    """
    Preprocessing improves search quality and reduces index size

    1. Normalize text
    2. Remove redundancy
    3. Truncate to reasonable length
    """
    import re

    processed = []
    for doc in documents:
        # Lowercase normalization
        doc = doc.lower()

        # Remove extra whitespace
        doc = re.sub(r'\s+', ' ', doc)

        # Truncate long documents (embedding models have token limits)
        # nomic-embed-text: 8192 tokens, but 512 is optimal for search
        words = doc.split()
        if len(words) > 512:
            doc = ' '.join(words[:512])

        processed.append(doc.strip())

    return processed

2.5 Persistent Storage Optimization¶

# Use Parquet for efficient storage
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",  # Much faster than SQLite
    persist_directory="./chroma_data",

    # Performance tuning
    chroma_server_grpc_port=None,  # Local mode (faster)
    chroma_server_http_port=None,

    # Resource limits
    chroma_memory_limit_bytes=8 * 1024 * 1024 * 1024,  # 8GB RAM limit
))

# Periodic persistence
collection.add(documents, embeddings, metadatas, ids)
client.persist()  # Write to disk asynchronously

3. OpenSearch Optimization¶

3.1 Hardware & Instance Selection¶

Recommendations for AI-SOC Log Management:

Workload	Instance Type (AWS)	vCPU	RAM	Storage	Notes
Ingestion-Heavy	OR1.large	2	16GB	500GB SSD	Log ingestion, cost-effective
Search-Heavy	r6gd.2xlarge	8	64GB	474GB NVMe	Instance store for speed
Balanced	r6g.xlarge	4	32GB	EBS gp3	General purpose

Java Heap Sizing:

# opensearch.yml
bootstrap.memory_lock: true

# In docker-compose or systemd
environment:
  - "OPENSEARCH_JAVA_OPTS=-Xms16g -Xmx16g"  # 50% of 32GB RAM

Rule: Set heap to 50% of available RAM (max 32GB even if you have more RAM).

3.2 Indexing Performance Tuning¶

Bulk Indexing Optimization:

# opensearch_ingest/optimized_bulk.py
from opensearchpy import OpenSearch, helpers
import time

def bulk_index_logs(os_client: OpenSearch, logs: list[dict], index: str):
    """
    Optimized bulk indexing for high-volume log ingestion

    Best practices:
    1. Batch size: 5-15MB (not document count)
    2. Use helpers.parallel_bulk for multi-threading
    3. Disable refresh during bulk operations
    """

    # Prepare actions
    actions = [
        {
            "_index": index,
            "_source": log
        }
        for log in logs
    ]

    # Bulk insert with optimal settings
    success, failed = helpers.bulk(
        os_client,
        actions,
        chunk_size=5000,  # Documents per batch
        max_chunk_bytes=10 * 1024 * 1024,  # 10MB max per batch
        request_timeout=60,
        raise_on_error=False,
        stats_only=False
    )

    print(f"Indexed {success} documents, {failed} failed")

    return success, failed

# For extreme throughput: parallel bulk
def parallel_bulk_index(os_client: OpenSearch, logs: list[dict], index: str):
    """
    Multi-threaded bulk indexing (2-3x faster)
    """
    actions = [{"_index": index, "_source": log} for log in logs]

    for success, info in helpers.parallel_bulk(
        os_client,
        actions,
        thread_count=4,  # 4 parallel threads
        chunk_size=5000,
        max_chunk_bytes=10 * 1024 * 1024
    ):
        if not success:
            print(f"Failed: {info}")

Index Settings for Write Performance:

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,

      "refresh_interval": "30s",
      "translog": {
        "flush_threshold_size": "2gb",
        "durability": "async"
      },

      "merge": {
        "scheduler": {
          "max_thread_count": 1
        }
      }
    }
  }
}

Explanation: - refresh_interval: 30s - Reduce refresh frequency (default 1s) for faster ingestion - translog.flush_threshold_size: 2gb - Larger translog = fewer flushes - translog.durability: async - Don't wait for fsync (faster, slight data loss risk)

Disable refresh during bulk operations:

# Temporary disable refresh for massive bulk operations
os_client.indices.put_settings(
    index="logs-*",
    body={"index": {"refresh_interval": "-1"}}
)

# Perform bulk indexing
bulk_index_logs(os_client, massive_log_batch, "logs-2025-10")

# Re-enable refresh
os_client.indices.put_settings(
    index="logs-*",
    body={"index": {"refresh_interval": "30s"}}
)

# Manual refresh
os_client.indices.refresh(index="logs-2025-10")

3.3 Shard Management¶

Shard Sizing Best Practices: - Target shard size: 10-50GB per shard - Avoid: Too many small shards (overhead) or too few large shards (imbalance)

Calculate optimal shard count:

def calculate_optimal_shards(daily_log_volume_gb: int, retention_days: int) -> int:
    """
    Calculate optimal shard count for time-series log data

    Example: 100GB/day, 90-day retention
    Total: 9000GB = 9TB
    Shards: 9000GB / 30GB per shard = 300 shards
    """
    total_data_gb = daily_log_volume_gb * retention_days
    target_shard_size_gb = 30  # Sweet spot: 30GB

    optimal_shards = max(1, total_data_gb // target_shard_size_gb)

    return optimal_shards

# Example: AI-SOC logs
daily_volume = 50  # 50GB per day
retention = 90  # 90 days

optimal = calculate_optimal_shards(daily_volume, retention)
print(f"Recommended shards: {optimal}")  # ~150 shards

Use Index Templates for Time-Series Data:

# Create index template for logs
index_template = {
    "index_patterns": ["logs-*"],
    "template": {
        "settings": {
            "number_of_shards": 5,  # Per-day shards
            "number_of_replicas": 1,
            "refresh_interval": "30s",
            "codec": "best_compression"  # Reduce storage by ~30%
        },
        "mappings": {
            "properties": {
                "@timestamp": {"type": "date"},
                "message": {"type": "text"},
                "severity": {"type": "keyword"},
                "source_ip": {"type": "ip"},
                "event_type": {"type": "keyword"}
            }
        }
    }
}

os_client.indices.put_index_template(
    name="logs-template",
    body=index_template
)

3.4 Query Optimization¶

Use Filters Instead of Queries (Cached & Faster):

# SLOW: Full-text query
slow_query = {
    "query": {
        "match": {
            "severity": "high"
        }
    }
}

# FAST: Filter (cached)
fast_query = {
    "query": {
        "bool": {
            "filter": [
                {"term": {"severity": "high"}},
                {"range": {"@timestamp": {"gte": "now-1h"}}}
            ]
        }
    }
}

Avoid Leading Wildcards:

# VERY SLOW: Leading wildcard
bad_query = {"query": {"wildcard": {"message": "*error*"}}}

# FAST: Use ngram tokenizer or term queries
good_query = {"query": {"match": {"message": "error"}}}

Use _source Filtering:

# Retrieve only needed fields (faster)
results = os_client.search(
    index="logs-*",
    body={
        "query": {"match_all": {}},
        "_source": ["@timestamp", "message", "severity"],  # Only these fields
        "size": 100
    }
)

3.5 Force Merge for Read-Heavy Indices¶

Background: Over time, segments accumulate. Force merge consolidates them for faster searches.

# Force merge old indices (read-only)
os_client.indices.forcemerge(
    index="logs-2025-09",  # Old index
    max_num_segments=1,    # Merge to single segment
    request_timeout=300
)

Automate with Index Lifecycle Management:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          },
          "shrink": {
            "number_of_shards": 1
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "box_type": "cold"
            }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

3.6 Monitoring Slow Queries¶

# opensearch.yml
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s

Query slow logs:

tail -f /var/log/opensearch/slowlog.log

4. Docker Resource Optimization¶

4.1 Resource Limits¶

docker-compose.yml with Optimized Resource Allocation:

version: '3.8'

services:
  llm-service:
    image: ai-soc-llm:latest
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 16G
        reservations:
          cpus: '2.0'
          memory: 8G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  opensearch:
    image: opensearchproject/opensearch:2.11.0
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 32G
        reservations:
          cpus: '2.0'
          memory: 16G
    environment:
      - "OPENSEARCH_JAVA_OPTS=-Xms16g -Xmx16g"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536

  chromadb:
    image: chromadb/chroma:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 8G
        reservations:
          cpus: '1.0'
          memory: 4G

  redis:
    image: redis:7-alpine
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 1G
    command: redis-server --maxmemory 1gb --maxmemory-policy allkeys-lru

4.2 Multi-Stage Builds¶

Reduce image size by 80%+:

# Dockerfile.llm (Optimized Multi-Stage Build)

# Stage 1: Builder
FROM python:3.11-slim as builder

WORKDIR /build

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Runtime (Slim)
FROM python:3.11-slim

WORKDIR /app

# Copy only installed packages from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY ./app /app

# Create non-root user
RUN useradd -m -u 1000 appuser && \
    chown -R appuser:appuser /app

USER appuser

# Update PATH
ENV PATH=/root/.local/bin:$PATH

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Result: Image size reduced from ~2GB to ~400MB

4.3 Layer Caching Optimization¶

# Optimize layer caching by ordering from least to most frequently changed

# 1. Install system dependencies (rarely changes)
FROM python:3.11-slim
RUN apt-get update && apt-get install -y curl

# 2. Install Python dependencies (changes occasionally)
COPY requirements.txt .
RUN pip install -r requirements.txt

# 3. Copy application code (changes frequently)
COPY ./app /app

# This ordering maximizes cache hits during rebuilds

5. Kubernetes Scaling Strategies¶

5.1 Horizontal Pod Autoscaler (HPA)¶

Auto-scale based on CPU/Memory or custom metrics:

# hpa-llm-service.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    # Scale based on CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    # Scale based on custom metric (requests per second)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 50  # Scale down max 50% of pods at once
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
        - type: Percent
          value: 100  # Double pods if needed
          periodSeconds: 15

5.2 Vertical Pod Autoscaler (VPA)¶

Automatically adjust resource requests/limits:

# vpa-llm-service.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  updatePolicy:
    updateMode: "Auto"  # Automatically apply recommendations
  resourcePolicy:
    containerPolicies:
      - containerName: llm-container
        minAllowed:
          cpu: 1
          memory: 4Gi
        maxAllowed:
          cpu: 8
          memory: 32Gi
        controlledResources: ["cpu", "memory"]

5.3 Resource Requests vs Limits¶

Best Practices:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: llm-container
          image: ai-soc-llm:latest
          resources:
            requests:
              cpu: "2"       # Guaranteed CPU
              memory: "8Gi"  # Guaranteed memory
            limits:
              cpu: "4"       # Max CPU (can burst)
              memory: "16Gi" # Max memory (hard limit)

          # Important: Set equal requests and limits for memory
          # to avoid OOMKilled in production

For Memory: Set requests = limits to ensure QoS class "Guaranteed"

For CPU: Set limits > requests to allow bursting

5.4 Cluster Autoscaler¶

Auto-add nodes when pods are pending:

# cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --namespace=kube-system
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/ai-soc
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m

Cost Optimization: Combine with Spot Instances

5.5 Pod Disruption Budgets¶

Ensure availability during scaling:

# pdb-llm-service.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-service-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: llm-service

6. Performance Benchmarking¶

6.1 LLM Inference Benchmarking¶

# benchmarks/llm_benchmark.py
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np

def benchmark_llm_inference(model_name: str, num_requests: int = 100):
    """
    Benchmark LLM inference performance

    Metrics:
    - Throughput (requests/second)
    - Latency (ms per request)
    - Time to First Token (TTFT)
    - Tokens per second
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    test_prompts = [
        "Analyze this phishing email: ",
        "What is SQL injection? ",
        "Explain ransomware detection: "
    ] * (num_requests // 3)

    latencies = []
    ttfts = []
    token_counts = []

    print(f"Benchmarking {model_name}...")

    for i, prompt in enumerate(test_prompts):
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Measure time to first token
        start = time.time()
        with torch.inference_mode():
            first_token = model.generate(
                **inputs,
                max_new_tokens=1,
                do_sample=False
            )
        ttft = (time.time() - start) * 1000  # Convert to ms

        # Measure full generation
        start = time.time()
        with torch.inference_mode():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=True,
                use_cache=True
            )
        latency = (time.time() - start) * 1000

        token_count = outputs.shape[1] - inputs.input_ids.shape[1]

        latencies.append(latency)
        ttfts.append(ttft)
        token_counts.append(token_count)

        if (i + 1) % 10 == 0:
            print(f"Progress: {i+1}/{num_requests}")

    # Calculate metrics
    avg_latency = np.mean(latencies)
    p50_latency = np.percentile(latencies, 50)
    p95_latency = np.percentile(latencies, 95)
    p99_latency = np.percentile(latencies, 99)
    throughput = num_requests / (sum(latencies) / 1000)
    avg_ttft = np.mean(ttfts)
    tokens_per_sec = sum(token_counts) / (sum(latencies) / 1000)

    print("\n=== Benchmark Results ===")
    print(f"Model: {model_name}")
    print(f"Requests: {num_requests}")
    print(f"\nLatency:")
    print(f"  Average: {avg_latency:.2f} ms")
    print(f"  P50: {p50_latency:.2f} ms")
    print(f"  P95: {p95_latency:.2f} ms")
    print(f"  P99: {p99_latency:.2f} ms")
    print(f"\nThroughput: {throughput:.2f} requests/sec")
    print(f"Time to First Token: {avg_ttft:.2f} ms")
    print(f"Tokens/sec: {tokens_per_sec:.2f}")

    return {
        "latency_avg": avg_latency,
        "latency_p95": p95_latency,
        "throughput": throughput,
        "ttft": avg_ttft,
        "tokens_per_sec": tokens_per_sec
    }

# Run benchmark
results = benchmark_llm_inference("fdtn-ai/Foundation-Sec-8B", num_requests=100)

6.2 ChromaDB Benchmarking¶

# benchmarks/chromadb_benchmark.py
import chromadb
import time
import numpy as np

def benchmark_chromadb(num_documents: int = 10000, num_queries: int = 100):
    """
    Benchmark ChromaDB performance

    Metrics:
    - Insertion throughput (docs/sec)
    - Query latency (ms)
    - Recall@10
    """
    client = chromadb.Client()
    collection = client.create_collection("benchmark")

    # Generate synthetic data
    documents = [f"Security document {i} about threat detection" for i in range(num_documents)]
    embeddings = np.random.rand(num_documents, 768).tolist()  # 768-dim embeddings

    # Benchmark insertion
    print("Benchmarking insertion...")
    start = time.time()
    collection.add(
        documents=documents,
        embeddings=embeddings,
        ids=[f"id{i}" for i in range(num_documents)]
    )
    insertion_time = time.time() - start
    insertion_throughput = num_documents / insertion_time

    print(f"Insertion: {insertion_throughput:.2f} docs/sec")

    # Benchmark queries
    print("Benchmarking queries...")
    query_embeddings = np.random.rand(num_queries, 768).tolist()
    query_latencies = []

    for query_emb in query_embeddings:
        start = time.time()
        results = collection.query(
            query_embeddings=[query_emb],
            n_results=10
        )
        latency = (time.time() - start) * 1000
        query_latencies.append(latency)

    avg_query_latency = np.mean(query_latencies)
    p95_query_latency = np.percentile(query_latencies, 95)

    print(f"\nQuery Latency:")
    print(f"  Average: {avg_query_latency:.2f} ms")
    print(f"  P95: {p95_query_latency:.2f} ms")

    return {
        "insertion_throughput": insertion_throughput,
        "query_latency_avg": avg_query_latency,
        "query_latency_p95": p95_query_latency
    }

# Run benchmark
results = benchmark_chromadb(num_documents=50000, num_queries=1000)

7. Production Case Studies¶

Case Study 1: Aiera (Financial Services)¶

Challenge: Automated earnings call summarization with LLMs

Solution: - Selected Claude 3.5 Sonnet after benchmarking multiple models - Implemented caching for repeated queries - Used streaming for real-time summaries

Results: - 90% reduction in analysis time - High accuracy maintained through model selection - Cost-effective through smart caching

Lessons for AI-SOC: - Model selection matters (benchmark before deployment) - Caching dramatically reduces costs for repeated queries - Streaming improves user experience

Case Study 2: Klarna (E-Commerce)¶

Challenge: Customer service automation with LLMs

Solution: - Multi-tier LLM architecture (fast model for triage, powerful model for complex queries) - Aggressive rate limiting and abuse detection - Continuous monitoring and feedback loops

Results: - Millions of conversations handled monthly - High customer satisfaction maintained - Scalable architecture

Lessons for AI-SOC: - Use smaller models for simple tasks, reserve large models for complex analysis - Rate limiting essential for production stability - Continuous monitoring critical for LLM systems

Case Study 3: Enterprise Documentation Search (Anonymous)¶

Challenge: RAG system for internal documentation

Solution: - vLLM for 2.7x throughput improvement - Ray Serve for horizontal scaling - ChromaDB with optimized HNSW settings

Results: - 67.8% latency reduction - 4.2x throughput improvement - Scalable to 1000+ concurrent users

Lessons for AI-SOC: - vLLM provides significant performance gains - Horizontal scaling essential for high concurrency - Optimize vector DB settings for your data

8. Performance Monitoring¶

8.1 Prometheus Metrics¶

# monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time

app = FastAPI()

# Define metrics
llm_inference_duration = Histogram(
    'llm_inference_duration_seconds',
    'LLM inference duration',
    ['model', 'quantization']
)

llm_tokens_generated = Counter(
    'llm_tokens_generated_total',
    'Total tokens generated',
    ['model']
)

llm_requests_total = Counter(
    'llm_requests_total',
    'Total LLM requests',
    ['model', 'status']
)

gpu_memory_usage = Gauge(
    'gpu_memory_usage_bytes',
    'GPU memory usage',
    ['gpu_id']
)

chromadb_query_duration = Histogram(
    'chromadb_query_duration_seconds',
    'ChromaDB query duration',
    ['collection']
)

@app.get("/metrics")
def metrics():
    """Prometheus metrics endpoint"""
    return Response(generate_latest(), media_type="text/plain")

# Usage in application
@app.post("/analyze")
async def analyze_threat(prompt: str):
    start = time.time()

    try:
        result = await llm_service.generate(prompt)
        duration = time.time() - start

        # Record metrics
        llm_inference_duration.labels(
            model="Foundation-Sec-8B",
            quantization="int4"
        ).observe(duration)

        llm_tokens_generated.labels(model="Foundation-Sec-8B").inc(
            len(result.split())
        )

        llm_requests_total.labels(
            model="Foundation-Sec-8B",
            status="success"
        ).inc()

        return {"result": result}

    except Exception as e:
        llm_requests_total.labels(
            model="Foundation-Sec-8B",
            status="error"
        ).inc()
        raise

8.2 Grafana Dashboards¶

{
  "dashboard": {
    "title": "AI-SOC Performance Dashboard",
    "panels": [
      {
        "title": "LLM Inference Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(llm_inference_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "LLM Throughput (requests/sec)",
        "targets": [
          {
            "expr": "rate(llm_requests_total[5m])"
          }
        ]
      },
      {
        "title": "GPU Memory Usage",
        "targets": [
          {
            "expr": "gpu_memory_usage_bytes"
          }
        ]
      },
      {
        "title": "ChromaDB Query Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(chromadb_query_duration_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}

9. Performance Optimization Checklist¶

# AI-SOC Performance Optimization Checklist

## LLM Inference
- [ ] Model quantization enabled (INT4/INT8)
- [ ] KV caching configured
- [ ] Prefix caching for system prompts
- [ ] vLLM with continuous batching deployed
- [ ] Flash Attention 2 enabled
- [ ] torch.compile() applied
- [ ] Speculative decoding for long-form generation

## ChromaDB
- [ ] HNSW parameters tuned (search_ef, M, construction_ef)
- [ ] Fast embedding model selected (nomic-embed-text)
- [ ] Batch inserts (1000-5000 docs)
- [ ] Parquet storage backend
- [ ] Documents preprocessed (normalized, truncated)
- [ ] Metadata filtering for queries

## OpenSearch
- [ ] Appropriate instance type selected (OR1, r6gd)
- [ ] Java heap = 50% of RAM (max 32GB)
- [ ] Bulk indexing with 5-15MB batches
- [ ] refresh_interval = 30s for write-heavy indices
- [ ] Shard size 10-50GB
- [ ] Filters instead of queries
- [ ] _source filtering enabled
- [ ] Force merge for old indices
- [ ] Index lifecycle policy configured
- [ ] Slow query logging enabled

## Docker
- [ ] Multi-stage builds for small images
- [ ] Resource limits defined (CPU, memory)
- [ ] Health checks configured
- [ ] Non-root user
- [ ] Layer caching optimized

## Kubernetes
- [ ] HPA configured for dynamic scaling
- [ ] VPA for resource optimization
- [ ] Resource requests = limits for memory
- [ ] Pod Disruption Budgets defined
- [ ] Cluster Autoscaler enabled
- [ ] Spot instances for cost optimization

## Monitoring
- [ ] Prometheus metrics exported
- [ ] Grafana dashboards created
- [ ] Alerts configured (latency, errors, resource usage)
- [ ] Distributed tracing (OpenTelemetry)
- [ ] Performance benchmarks established

## Testing
- [ ] Load testing completed (locust, k6)
- [ ] Latency targets met (P95 < 2s for LLM inference)
- [ ] Throughput targets met (>10 requests/sec)
- [ ] Resource utilization optimized (<70% CPU average)

10. Summary & Recommendations¶

Top 10 Optimizations (Ranked by Impact)¶

vLLM with Continuous Batching - 2.7x throughput, 5x latency reduction
Model Quantization (INT4) - 4x memory reduction, 2x speedup
KV Cache + Prefix Caching - 2-5x speedup, 90% reduction for shared prompts
OpenSearch Bulk Indexing - 100-250K docs/sec (vs 1K with individual inserts)
Kubernetes HPA - 70-90% cost reduction through dynamic scaling
ChromaDB Batch Inserts - 10x faster than individual inserts
Flash Attention 2 - 2-4x faster attention computation
OpenSearch refresh_interval Tuning - 2-3x faster indexing
Docker Multi-Stage Builds - 80% image size reduction
Speculative Decoding - 2-3x speedup for long-form generation

Performance Targets for AI-SOC¶

Metric	Target	Optimized	Notes
LLM Inference Latency (P95)	< 2s	< 1s	With vLLM + quantization
LLM Throughput	10 req/sec	50+ req/sec	With continuous batching
ChromaDB Query Latency (P95)	< 100ms	< 50ms	With HNSW tuning
OpenSearch Indexing	10K docs/sec	100K+ docs/sec	With bulk + tuning
Resource Utilization (CPU)	< 70% avg	< 60% avg	With autoscaling
Cost per 1M tokens	< $5	< $1	With quantization + caching

Document Version: 1.0 Last Updated: 2025-10-22 Author: The Didact (AI Research Specialist) Classification: Internal Use