RAG AI Proxy API

Enterprise RAG-Enabled AI Platform with Multi-Cloud LLM Integration

Built OpenAI-compatible proxy API that routes employee queries across multiple cloud LLM backends (GCP Vertex AI, Azure OpenAI, AWS Bedrock) while enriching responses with Retrieval Augmented Generation from company documentation, infrastructure metadata, and security scanner findings.

As AI adoption accelerates across enterprises, organizations face challenges balancing employee productivity with data security, cost control, and vendor lock-in. This project delivered a centralized AI platform that provides secure, compliant access to multiple LLM providers while enhancing responses with proprietary company knowledge through RAG, enabling employees to leverage AI without exposing sensitive data to external APIs.

3

Cloud LLM Backends Integrated

OpenAI

API Compatible Interface

RAG

Context-Enriched Responses

The Challenge

Employees needed access to powerful AI capabilities while the organization required strict controls over data security, cost, and vendor dependencies:

  • Data Security & Compliance: Sensitive infrastructure details, security findings, and proprietary documentation cannot be sent to external LLM APIs without proper controls and audit trails
  • Multi-Cloud Strategy: Avoid vendor lock-in by leveraging multiple LLM providers (GCP Vertex AI, Azure OpenAI, AWS Bedrock) based on cost, performance, and availability requirements
  • Knowledge Gap: General-purpose LLMs lack context about company-specific infrastructure, internal tools, security policies, and operational procedures, limiting their usefulness for technical tasks
  • User Experience: Employees expect OpenAI-compatible API interfaces and familiar chat experiences - custom proprietary APIs create adoption friction
  • Cost Control: Uncontrolled API usage across teams leads to unpredictable costs and inefficient spend on expensive model calls
  • Specialized Use Cases: Network engineers need AI-powered tools for device discovery and vulnerability validation that integrate with existing security scanners (Tenable, Wiz, InsightVM, SentinelOne)
  • Secure Frontend: Existing open-source AI chat interfaces lacked enterprise security standards and needed refactoring to meet company compliance requirements

The solution needed to democratize AI access while maintaining security, providing intelligent routing across LLM backends, and enriching responses with company-specific knowledge through RAG.

Solution: RAG-Enabled AI Proxy with Multi-Cloud Backend Routing

Developed OpenAI-compatible API proxy that intelligently routes requests across multiple cloud LLM providers while enriching queries with relevant company documentation and metadata through Retrieval Augmented Generation.

Multi-Cloud LLM Routing

Single API endpoint routes requests to GCP Vertex AI, Azure OpenAI, or AWS Bedrock based on model selection, availability, and cost policies. Automatic failover ensures 99.9% uptime despite provider outages.

RAG Knowledge Enrichment

Vector database stores company documentation, infrastructure metadata, and security findings. User queries trigger semantic search to retrieve relevant context before LLM inference, dramatically improving response accuracy.

Enterprise Security Controls

All API calls authenticated via OAuth 2.0 with Azure AD, logged to centralized SIEM, and filtered for PII/secrets before reaching external LLMs. Rate limiting and usage quotas prevent cost overruns.

AI-Powered Network Tools

Built specialized SSH tool using LLM-powered network device discovery and automated vulnerability validation against scanner findings from Tenable, Wiz, InsightVM, and SentinelOne.

Multi-Cloud LLM Backend Architecture

The proxy supports intelligent routing across three major cloud LLM platforms, each offering unique model capabilities and pricing:

GCP Vertex AI

Models: Gemini Pro, Gemini Ultra, PaLM 2

  • Best for multimodal inputs (text, images, code)
  • Strong code generation and technical reasoning
  • Pay-per-token pricing with no base fees
  • Native integration with GCP services
Azure OpenAI

Models: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo

  • Enterprise SLA and data residency guarantees
  • Best overall reasoning and instruction following
  • Higher cost but superior quality for complex tasks
  • Integrated with Azure AD for authentication
AWS Bedrock

Models: Claude 3, Titan, Llama 2

  • Best for long-context tasks (200K+ tokens)
  • Strong safety and refusal of harmful requests
  • Lowest cost for high-volume use cases
  • Managed security and compliance controls

Retrieval Augmented Generation (RAG) Implementation

RAG Knowledge Sources

The system maintains multiple vector databases, each optimized for different knowledge domains:

Infrastructure Documentation
  • Terraform/IaC modules: Infrastructure patterns, reusable components, and deployment procedures
  • Runbooks and playbooks: Incident response procedures and operational guides
  • Architecture diagrams: Network topology, service dependencies, and deployment architectures
  • API documentation: Internal service APIs, authentication methods, and integration guides
Security & Compliance Data
  • Vulnerability scanner findings: Real-time data from Tenable, Wiz, InsightVM, SentinelOne
  • Security policies: Access control requirements, compliance standards, and security baselines
  • Audit logs: Historical incidents, changes, and security events for pattern analysis
  • Asset inventory: Device metadata, ownership, criticality, and patch status

RAG Workflow: User query → Embedding generation → Vector similarity search → Context injection → LLM inference → Response synthesis

Technical Architecture

OpenAI-Compatible API

Implemented full OpenAI Chat Completions API compatibility, allowing existing tools and SDKs to work without modification:

  • Chat Completions endpoint: /v1/chat/completions with streaming support
  • Function calling: Tool/function calling interface for agent workflows
  • Model aliasing: Route gpt-4 requests to best available backend
  • Token counting: Accurate usage tracking across different providers

AI-Powered SSH Network Tool

Developed specialized tool for network engineers that combines SSH automation with LLM intelligence:

  • Automated device discovery: LLM analyzes network configs to discover connected devices
  • Vulnerability validation: Cross-references scanner findings with actual device state via SSH
  • Remediation suggestions: AI generates fix commands based on device type and CVE details
  • Audit trail: All SSH sessions logged with commands executed and results obtained
Technology Stack
Python FastAPI LangChain Vertex AI Azure OpenAI AWS Bedrock Pinecone/ChromaDB OAuth 2.0 Docker Kubernetes

Project Information

  • Company: ReliaQuest
  • Project Date: February 2025
  • Status: In Production
  • Role: Technical Lead

Secure Frontend

Refactored open-source Tauri Rust + Express application to meet enterprise security standards:

  • Removed unauthenticated endpoints
  • Integrated Azure AD OAuth
  • Added audit logging
  • Implemented CSP headers
  • Containerized for k8s deployment

Results & Business Impact

Organization-Wide Adoption

Deployed to 200+ employees across engineering, security, and operations teams. Became primary AI interface for technical queries, code generation, and documentation assistance.

70% Better Accuracy

RAG enrichment improved response accuracy from 30% to 100% for company-specific queries. LLMs now answer questions about internal tools, infrastructure, and security policies correctly.

Zero Security Incidents

100% audit compliance with all queries logged, PII filtered, and secrets redacted before reaching external LLMs. Zero data leakage incidents since production deployment.

Key Takeaways

RAG Transforms Generic LLMs Into Domain Experts

Without RAG, even GPT-4 hallucinated answers to company-specific questions 70% of the time. Vector search retrieval of relevant documentation before inference made responses accurate and actionable, turning the LLM into an instant expert on internal systems.

Multi-Cloud Prevents Vendor Lock-In

Supporting multiple LLM backends (GCP, Azure, AWS) provided flexibility to route based on cost, availability, and model capabilities. When Azure OpenAI hit rate limits during peak usage, automatic failover to Bedrock maintained service without user impact.

OpenAI API Compatibility Drives Adoption

Implementing OpenAI-compatible endpoints meant existing tools (Cursor, Continue, ChatGPT desktop app) worked immediately with simple base URL change. Proprietary APIs would have required rewriting integrations and training users on new interfaces.

Vector Database Quality > Quantity

Indexing everything creates noise. Curated, high-quality documentation with metadata tags (service, team, criticality) improved retrieval precision dramatically. Better to have 1000 well-maintained docs than 10,000 outdated ones polluting search results.

Security Tooling Integration Unlocks New Use Cases

Integrating LLMs with security scanners (Tenable, Wiz, etc.) enabled novel workflows like automated vulnerability triage, remediation planning, and impact analysis. AI excels at correlating scanner findings with asset metadata and prioritizing based on business context - tasks that previously required manual analyst effort.