RAG AI Proxy API - Ben Wagrez Portfolio

Enterprise RAG-Enabled AI Platform with Multi-Cloud LLM Integration

Built OpenAI-compatible proxy API that routes employee queries across multiple cloud LLM backends (GCP Vertex AI, Azure OpenAI, AWS Bedrock) while enriching responses with Retrieval Augmented Generation from company documentation, infrastructure metadata, and security scanner findings.

As AI adoption accelerates across enterprises, organizations face challenges balancing employee productivity with data security, cost control, and vendor lock-in. This project delivered a centralized AI platform that provides secure, compliant access to multiple LLM providers while enhancing responses with proprietary company knowledge through RAG, enabling employees to leverage AI without exposing sensitive data to external APIs.

3

Cloud LLM Backends Integrated

OpenAI

API Compatible Interface

RAG

Context-Enriched Responses

The Challenge

Employees needed access to powerful AI capabilities while the organization required strict controls over data security, cost, and vendor dependencies:

Data Security & Compliance: Sensitive infrastructure details, security findings, and proprietary documentation cannot be sent to external LLM APIs without proper controls and audit trails
Multi-Cloud Strategy: Avoid vendor lock-in by leveraging multiple LLM providers (GCP Vertex AI, Azure OpenAI, AWS Bedrock) based on cost, performance, and availability requirements
Knowledge Gap: General-purpose LLMs lack context about company-specific infrastructure, internal tools, security policies, and operational procedures, limiting their usefulness for technical tasks
User Experience: Employees expect OpenAI-compatible API interfaces and familiar chat experiences - custom proprietary APIs create adoption friction
Cost Control: Uncontrolled API usage across teams leads to unpredictable costs and inefficient spend on expensive model calls
Specialized Use Cases: Network engineers need AI-powered tools for device discovery and vulnerability validation that integrate with existing security scanners (Tenable, Wiz, InsightVM, SentinelOne)
Secure Frontend: Existing open-source AI chat interfaces lacked enterprise security standards and needed refactoring to meet company compliance requirements

The solution needed to democratize AI access while maintaining security, providing intelligent routing across LLM backends, and enriching responses with company-specific knowledge through RAG.

Solution: RAG-Enabled AI Proxy with Multi-Cloud Backend Routing

Developed OpenAI-compatible API proxy that intelligently routes requests across multiple cloud LLM providers while enriching queries with relevant company documentation and metadata through Retrieval Augmented Generation.

Multi-Cloud LLM Routing

Single API endpoint routes requests to GCP Vertex AI, Azure OpenAI, or AWS Bedrock based on model selection, availability, and cost policies. Automatic failover ensures 99.9% uptime despite provider outages.

RAG Knowledge Enrichment

Vector database stores company documentation, infrastructure metadata, and security findings. User queries trigger semantic search to retrieve relevant context before LLM inference, dramatically improving response accuracy.

Enterprise Security Controls

All API calls authenticated via OAuth 2.0 with Azure AD, logged to centralized SIEM, and filtered for PII/secrets before reaching external LLMs. Rate limiting and usage quotas prevent cost overruns.

AI-Powered Network Tools

Built specialized SSH tool using LLM-powered network device discovery and automated vulnerability validation against scanner findings from Tenable, Wiz, InsightVM, and SentinelOne.

Multi-Cloud LLM Backend Architecture

The proxy supports intelligent routing across three major cloud LLM platforms, each offering unique model capabilities and pricing:

GCP Vertex AI

Models: Gemini Pro, Gemini Ultra, PaLM 2

Best for multimodal inputs (text, images, code)
Strong code generation and technical reasoning
Pay-per-token pricing with no base fees
Native integration with GCP services

Azure OpenAI

Models: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo

Enterprise SLA and data residency guarantees
Best overall reasoning and instruction following
Higher cost but superior quality for complex tasks
Integrated with Azure AD for authentication

AWS Bedrock

Models: Claude 3, Titan, Llama 2

Best for long-context tasks (200K+ tokens)
Strong safety and refusal of harmful requests
Lowest cost for high-volume use cases
Managed security and compliance controls

Retrieval Augmented Generation (RAG) Implementation

RAG Knowledge Sources

The system maintains multiple vector databases, each optimized for different knowledge domains:

Infrastructure Documentation

Terraform/IaC modules: Infrastructure patterns, reusable components, and deployment procedures
Runbooks and playbooks: Incident response procedures and operational guides
Architecture diagrams: Network topology, service dependencies, and deployment architectures
API documentation: Internal service APIs, authentication methods, and integration guides

Security & Compliance Data

Vulnerability scanner findings: Real-time data from Tenable, Wiz, InsightVM, SentinelOne
Security policies: Access control requirements, compliance standards, and security baselines
Audit logs: Historical incidents, changes, and security events for pattern analysis
Asset inventory: Device metadata, ownership, criticality, and patch status

RAG Workflow: User query → Embedding generation → Vector similarity search → Context injection → LLM inference → Response synthesis

Technical Architecture

OpenAI-Compatible API

Implemented full OpenAI Chat Completions API compatibility, allowing existing tools and SDKs to work without modification:

Chat Completions endpoint: /v1/chat/completions with streaming support
Function calling: Tool/function calling interface for agent workflows
Model aliasing: Route gpt-4 requests to best available backend
Token counting: Accurate usage tracking across different providers

AI-Powered SSH Network Tool

Developed specialized tool for network engineers that combines SSH automation with LLM intelligence:

Automated device discovery: LLM analyzes network configs to discover connected devices
Vulnerability validation: Cross-references scanner findings with actual device state via SSH
Remediation suggestions: AI generates fix commands based on device type and CVE details
Audit trail: All SSH sessions logged with commands executed and results obtained

Technology Stack

Python FastAPI LangChain Vertex AI Azure OpenAI AWS Bedrock Pinecone/ChromaDB OAuth 2.0 Docker Kubernetes

Project Information

Company: ReliaQuest
Project Date: February 2025
Status: In Production
Role: Technical Lead

Secure Frontend

Refactored open-source Tauri Rust + Express application to meet enterprise security standards:

Removed unauthenticated endpoints
Integrated Azure AD OAuth
Added audit logging
Implemented CSP headers
Containerized for k8s deployment

Results & Business Impact

Organization-Wide Adoption

Deployed to 200+ employees across engineering, security, and operations teams. Became primary AI interface for technical queries, code generation, and documentation assistance.

70% Better Accuracy

RAG enrichment improved response accuracy from 30% to 100% for company-specific queries. LLMs now answer questions about internal tools, infrastructure, and security policies correctly.

Zero Security Incidents

100% audit compliance with all queries logged, PII filtered, and secrets redacted before reaching external LLMs. Zero data leakage incidents since production deployment.

Key Takeaways

RAG Transforms Generic LLMs Into Domain Experts

Without RAG, even GPT-4 hallucinated answers to company-specific questions 70% of the time. Vector search retrieval of relevant documentation before inference made responses accurate and actionable, turning the LLM into an instant expert on internal systems.

Multi-Cloud Prevents Vendor Lock-In

Supporting multiple LLM backends (GCP, Azure, AWS) provided flexibility to route based on cost, availability, and model capabilities. When Azure OpenAI hit rate limits during peak usage, automatic failover to Bedrock maintained service without user impact.

OpenAI API Compatibility Drives Adoption

Implementing OpenAI-compatible endpoints meant existing tools (Cursor, Continue, ChatGPT desktop app) worked immediately with simple base URL change. Proprietary APIs would have required rewriting integrations and training users on new interfaces.

Vector Database Quality > Quantity

Indexing everything creates noise. Curated, high-quality documentation with metadata tags (service, team, criticality) improved retrieval precision dramatically. Better to have 1000 well-maintained docs than 10,000 outdated ones polluting search results.

Security Tooling Integration Unlocks New Use Cases

Integrating LLMs with security scanners (Tenable, Wiz, etc.) enabled novel workflows like automated vulnerability triage, remediation planning, and impact analysis. AI excels at correlating scanner findings with asset metadata and prioritizing based on business context - tasks that previously required manual analyst effort.