On-device Query Router with Chrome's Prompt API background On-device Query Router with Chrome's Prompt API
Image credits: Unsplash

On-device Query Router with Chrome's Prompt API

The web as we know today is undergoing quiet some changes. Finally websites can utilize on-device AI capabilities without downloading GBs of there own model, yes and guess what Google Chrome Devs are pushing for it. They have proposed around 6-7 APIs that utilize the underlying AI capable hardware to run AI experiences and expose some essential tools for make interesting experiences without having to worry about the cost for simple practical use cases. No servers, no API keys, no data leaving your machine.

In this post, I’ll show you how I built a hybrid RAG (Retrieval-Augmented Generation) chatbot that routes queries between Chrome’s local Prompt API and a FastAPI RAG backend Server. A system that knows when to handle queries locally on-device versus when to reach out to more powerful cloud models.

Before we do any deeper dives into the code or demo, you should know a bit about the Chrome AI history just for the sake of it.

From Experiment to Web Standard

Google’s journey to bring AI directly into the browser began publicly at Google I/O 2024 in May, where they announced plans to integrate Gemini Nano, their efficient, on-device language model-directly into Chrome. The vision was to enable web developers to build AI-powered experiences without managing infrastructure, deploying models, or worrying about API costs.

By August 2024, Chrome launched several APIs into origin trials, opening up experimental access to developers worldwide. The Early Preview Program quickly attracted over 13,000 developers eager to explore this new API layer soon to be web standards. (Me and my team back in QED42 were part of the EAP as well, we explored occasionally but didn’t dive deep as its still something being slowly accepted and adopted by MDN and the web folks)

The momentum continued through 2024 and into 2025. Chrome 138, released in early 2025, marked a major milestone by bringing the Summarizer API, Language Detector API, Translator API, and Prompt API for Chrome Extensions into stable release. At Google I/O 2025 in May, Google expanded the offerings further with the Writer, Proofreader and Rewriter APIs entering origin trials, and unveiled multimodal capabilities for the Prompt API in Chrome Canary.

The Push for Web Standards

Chrome isn’t building these APIs in isolation. Google has actively engaged with web standards bodies to make on-device AI a cross-browser reality. The APIs have been proposed to the W3C Web Incubator Community Group, with several-including the Language Detector, Translator, Summarizer, Writer, and Rewriter APIs-already adopted by the W3C WebML Working Group.

Chrome has formally requested feedback from Mozilla (Firefox) and WebKit (Safari) through their respective standards positions processes. While explicit responses from other browser vendors are still pending, the standardization effort signals Chrome’s intent to make this a web-wide capability, not a proprietary feature.

For the latest updates and official documentation, visit:

The Built-in AI API Landscape

Chrome’s AI capabilities span seven distinct APIs, each optimized for specific tasks:

  1. Prompt API (Origin Trial / Stable in Extensions) The most flexible of the bunch, a general-purpose interface to Gemini Nano for natural language tasks. Supports text, image, and audio inputs (multimodal in Canary). Perfect for classification, Q&A, content analysis, and any custom AI workflow. Some use cases are: Chatbots, content classification, semantic search, custom workflows and Query Routers.

  2. Summarizer API (Stable in Chrome 138+) Generates summaries in various formats: single sentences, paragraphs, bullet lists, or custom lengths. Ideal for condensing long articles, meeting transcripts, or user-generated content. Some use cases are: Article TL;DR, meeting notes and forum post summaries.

  3. Writer API (Origin Trial) Creates new content based on specified writing tasks and optional context. Can draft emails, reviews, blog posts, or any text from scratch. Some use cases are: Email drafting, content generation and writing assistance.

  4. Rewriter API (Origin Trial) Refines existing text by adjusting length or tone. Make content more formal, casual, concise, or elaborate. Some use cases are: Tone adjustment, text polishing and feedback improvement.

  5. Proofreader API (Chrome Canary) Grammar and style corrections for polished writing. Some use cases are: Writing quality checks and error detection.

  6. Translator API (Stable in Chrome 138+) Local language translation using expert models (not Gemini Nano). Some use cases are: Multi-language support and accessibility.

  7. Language Detector API (Stable in Chrome 138+) Identifies the language of text input. Some use cases are: Auto-detection for translation and content routing.

Each API is task-specific and optimized for its domain. But here’s the thing: the Prompt API stands apart.

Prompt API for custom Prompting a Simple Query Router

The flexibility of the Prompt API makes it perfect for contextual query routing a use case that’s both practical and underutilized. Instead of blindly sending every query to an expensive cloud API or handling everything with a constrained on-device model, you can create a hybrid system that:

  1. Uses Prompt API to classify the query (simple vs. complex)
  2. Routes simple queries to on-device processing (fast, free, private)
  3. Routes complex queries to powerful cloud models (when needed)

This is the architectural pattern we’ll explore in depth.

Before diving into the implementation, let’s look at the data: why does routing matter?

The Economics of Query Routing

Recent research reveals that query routing isn’t just a nice-to-have-it’s transformational for cost, performance, and user experience. Here’s what the data shows:

ImplementationCost ReductionSource
Routers with confidence-based escalation70%+ reductionIBM Research
Small-to-large model routing50-85% reductionArcee AI
RouteLLM (GPT-4 → Mixtral routing)50% reduction while maintaining 95%+ qualityRouteLLM
Selective model routing on MT Bench75% reduction vs. random baselineAnyscale

Query Complexity Distribution

Not all queries are created equal. In real-world conversational AI deployments, the vast majority of queries are simple-and perfect candidates for on-device routing.

Query Type% of TotalComplexityIdeal Route
Greetings & pleasantries15-20%TrivialOn-device (Prompt API)
Simple follow-ups25-30%LowOn-device (Prompt API)
FAQ-style questions30-35%Low-MediumOn-device with context
Analytical queries10-15%HighCloud API
Multi-step reasoning5-10%Very HighCloud API

What you see right there is a massive more than 50-60% of cost reduction if implemented right.

Why This Matters for Our little Medical RAG

Applied to our medical knowledge chatbot:

  1. Follow-up queries (“Can you elaborate?”) → ~30% of interactions → 100% on-device
  2. Simple context queries (“What is the heart?”) → ~40% of interactions → On-device with local RAG
  3. Complex queries (“Compare complications…”) → ~30% of interactions → Cloud API

Expected outcome: ~70% of queries handled on-device, saving 70%+ on API costs while improving latency and privacy.

The Idea: A Hybrid RAG Chat Router

Traditional RAG systems are all-or-nothing: every query goes through the same pipeline-vector search, context injection, LLM generation. But not all queries need the full treatment.

Consider these questions to a medical knowledge chatbot:

  • “What is the heart?” → Needs context from the knowledge base, but straightforward
  • “Can you elaborate on that?” → Just needs conversation history, no retrieval
  • “How do complications of diabetes interact with hypertension in elderly patients?” → Complex, needs deep retrieval and powerful reasoning

So therefore comes the Three-Tier Routing Strategy

I designed a little system that classifies user queries into three categories and routes them accordingly:

Query Router Diagram

Implementation Deep Dive

Am not gonna document the whole backend code, its a simple RAG not that fancy or anything, in fact its too basic. but you can visit (Github) Chrome AI Demo

Backend: Python RAG API

The backend is a FastAPI service with:

  • FAISS vector index for similarity search
  • embeddinggemma-300m-medical A fine-tuned sentence-transformers embedding model (randomly picked from HuggingFace)
  • OpenRouter for cloud LLM inference (meta-llama/llama-3.3-8b-instruct)
  • Two endpoints:
    • GET /search - Returns top-k relevant chunks for a query
    • POST /chat - Full RAG pipeline with streaming response

This backend runs on localhost:8000 and handles complex queries.

Frontend: Query Router

This happens in web/chat.js. Here’s how it works:

1. Initialize Two Prompt API Sessions

// Router session: Classifies queries
routerSession = await LanguageModel.create({
    initialPrompts: [{
        role: 'system',
        content: `You are a query classifier. Classify queries into:

        1. "followup": Refers to previous conversation
        2. "context": Asks about specific medical topics
        3. "complex": Requires deep analysis or comparisons

        Respond with JSON: {"category": "followup"|"context"|"complex"}`
    }]
});

// Chat session: Handles conversations
chatSession = await LanguageModel.create();

2. Classify the Query with JSON Schema

async function classifyQuery(message) {
    const schema = {
        type: "object",
        properties: {
            category: {
                type: "string",
                enum: ["followup", "context", "complex"]
            }
        },
        required: ["category"]
    };

    const result = await routerSession.prompt(message, {
        responseConstraint: schema
    });

    return JSON.parse(result).category;
}

The JSON Schema constraint ensures we get structured output, no parsing ambiguity.

3. Route Based on Classification

For follow-up queries:

if (category === 'followup') {
    const prompt = buildConversationContext() + `User: ${message}`;
    const stream = chatSession.promptStreaming(prompt);

    for await (const chunk of stream) {
        fullContent += chunk;
        updateMessageContent(assistantMessageId, fullContent);
    }
}

For context-needed queries:

else if (category === 'context') {
    // Fetch relevant chunks
    const searchResponse = await fetch(
        `${API_BASE_URL}/search?query=${encodeURIComponent(message)}&top_k=3`
    );
    const chunks = await searchResponse.json();

    // Build prompt with context
    const context = chunks.map(chunk =>
        `[${chunk.id}]: ${chunk.content}`
    ).join('\n\n');

    const prompt = buildConversationContext() +
                   `Context from knowledge base:\n${context}\n\n` +
                   `User: ${message}`;

    const stream = chatSession.promptStreaming(prompt);
    // Stream response...
}

For complex queries:

else {
    // Use existing Python API
    const response = await fetch(`${API_BASE_URL}/chat`, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages })
    });
    // Handle streaming response...
}

Origin Trial Setup

Since Prompt API is still under “Origin Trial”, you need to get a token on the page to use it.

<meta http-equiv="origin-trial" content="A/tiwlx81CZF7NW3Sk...">

You can read more about it Google Chrome - Dev Docs - Prompt API

See It In Action

Here’s the system working in a real demo:

Video: Hybrid RAG Chat with Chrome Prompt API

The video shows follow-up handling, context retrieval, routing and some of my console logs.

Credits to Cap - Open Source Loom Alternative for the video recording, it was seamless and quick.

Challenges and Limitations

Let’s be honest about the constraints:

Hardware Requirements

Gemini Nano requires significant resources:

  • 22 GB of free disk space
  • 4+ GB VRAM (GPU) or 16 GB RAM + 4 CPU cores (CPU mode)
  • Desktop OS (Windows 10/11, macOS 13+, Linux, ChromeOS on Chromebook Plus)

Not all users will have compatible devices. Mobile support is not yet available.

And yeah am not a windows user but only my windows computer had the said VRAM easy to use.

Model Capabilities

Gemini Nano is optimized for on-device efficiency, not accuracy at all costs. It’s not a replacement for GPT-4 or Claude. Complex reasoning, factual accuracy, and long-context tasks are better suited for cloud models-hence our routing strategy.

Gated model and missing LoRA support

Its Gemini Nano still closed source, would have been better if they just used a Gemma model and also opened a LoRA API which would have allowed folks to have tiny 25-100 MBs of fine-tunes that would allow folks to target more niche use cases, like Generative UI for local and so on.

In the early documentation they had mentioned of a LoRA fine tune APIs, but then it later got scrubbed off the documentations, I assume they are pivoting or changing some plans around it.

I tried doing something tho, or more like my original vision or idea for this blog was different:

Context Window Limits

On-device models have smaller context windows. For very long conversations or large context injections, you may hit limits. The routing logic helps by keeping complex cases on cloud models. But you still would need a logic to keep the context within size, like rolling windows or trimming unnecessary context over time.

Browser Compatibility

This only works in Chrome 138+. Cross-browser support depends on standardization progress and other vendors adopting the APIs. (Opera is the only other browser that supports this I guess, since its also based on Chrome)

Looking Forward

Chrome’s built-in AI APIs are still experimental, but the trajectory is clear: on-device AI is becoming a web platform primitive. As standardization progresses and browser support expands, we’ll see patterns like intelligent query routing become standard practice.

Imagine a future where:

  • Static sites have AI features without backend costs (Like mine, am working on it for fun)
  • Privacy-first AI is the default, not an exception
  • Hybrid architectures seamlessly blend on-device and cloud intelligence
  • Every website can offer personalized, context-aware experiences complete on-device and local

We’re in the early innings, but the potential is enormous. Thanks for reading!

Additional Resources

Share :
comments powered by Disqus