Project Phoenix: The Internal AI Assistant

An Alyzom Solution case study on developing a real-time, biometrically-authenticated assistant with a strategic RAG architecture.

The 4-Fold Challenge

Alyzom Solution needed to centralize dynamic internal information, but faced four critical hurdles to creating a truly useful and secure tool.

Policy & Operations

Answering nuanced questions from unstructured data like employee handbooks.

Dynamic Services

Managing real-time, changing data like daily meal subscriptions and menus.

Secure Access

Ensuring internal-only access with a seamless, high-speed biometric (facial) login.

Real-Time Experience

Overcoming high LLM latency to provide a fluid, conversational, and interruptible interface.

The Three-Pillar Architecture

We engineered a full-stack, asynchronous system to manage data ingestion, real-time logic, and an interactive user experience.

PILLAR 1

Data Ingestion Pipeline

Python scripts load, chunk, and create semantic embeddings from handbooks, storing them in a ChromaDB vector database.

PILLAR 2

Core Backend (FastAPI)

Manages WebSocket connections, biometric auth via DeepFace/FAISS, and the core RAG logic using LangChain.

PILLAR 3

Frontend Interface

A pure HTML/Tailwind/JS interface with a state-based CSS avatar and browser-native Web Speech API for voice I/O.

Key Performance Enhancements

To achieve real-time responsiveness and reduce model latency, multiple optimizations were introduced across both model selection and system design layers.

Model Retuning & Selection

High Latency from Heavy Model

The initial setup used a large Llama3 70B model, which caused slow response times and high compute costs, making real-time interactions unfeasible.

Lightweight, Quantized Model Deployment

Replaced the model with Gemma3 27B-IT QAT, reducing latency by over 90% while maintaining comparable accuracy and fluency.

Real-Time Streaming Architecture

Delayed Full-Turn Responses

The earlier synchronous API calls made users wait until the full response was generated, disrupting the flow of real-time conversation.

Asynchronous WebSocket-Based Streaming

Introduced WebSocket connections for incremental response streaming, allowing output to appear token-by-token for a natural, chat-like experience.

Efficient Embedding Retrieval

Slow Query Response During RAG

Each user query triggered multiple retrieval calls from ChromaDB, causing lag due to unoptimized vector search and filtering.

Semantic Pre-Filtering & Caching

Optimized ChromaDB queries with semantic indexing and pre-filtered retrieval to reduce query time by 60% and minimize redundant lookups.

Computation Parallelization

Sequential Inference Bottlenecks

The inference process and data fetching were executed sequentially, creating unnecessary waiting periods between model and database operations.

Multi-Threaded Inference and Prefetching

Distributed model inference and prefetching tasks across multiple threads, enabling concurrent operations and seamless, real-time response flow.

Adaptive Context Windowing

Context Overflow and Token Inefficiency

Long input contexts often exceeded model token limits, increasing latency and reducing efficiency without significantly improving accuracy.

Intelligent Context Summarization

Introduced adaptive truncation and summarization to condense previous interactions while preserving key semantics, optimizing token usage per request.

Project Phoenix: Final Outcomes

The final system successfully delivered on all core goals, transforming a conceptual tool into a high-performance, mission-critical application.

From Concept to Conversation

This project demonstrates Alyzom Solution’s expertise in iterative design, complex AI integration, and strategic performance optimization to deliver mission-critical applications.