Beyond the Vector Database: Engineering High-Performance On-Device Semantic Search with Expo and PGlite

For the past year, the industry obsession has been clear: RAG (Retrieval-Augmented Generation) architectures hosted on massive cloud vector databases like Pinecone or Weaviate. While these are great for enterprise-scale web apps, they introduce significant friction for mobile developers: latency, costs, and the absolute requirement for a data connection.

I’ve been experimenting with moving the intelligence closer to the user. My goal was to build a fully offline, high-performance semantic search engine inside an Expo (React Native) app. The breakthrough came when I combined PGlite—a WASM-based Postgres build—with its native vector support.

Here’s how I bypassed the cloud and why this architecture is a game-changer for local-first AI.

The Problem with the "Cloud-First" Vector Approach

When we build mobile apps, we often treat them as thin clients. For semantic search, the traditional flow is:

Generate an embedding on a server (or call OpenAI).
Query a remote vector database.
Return IDs and fetch data from a separate API.

On a mobile device, this feels sluggish. If the user is on a spotty 5G connection, the experience breaks. More importantly, many of the use cases I'm interested in—like searching personal journals or private documents—should never leave the device from a privacy standpoint.

Enter PGlite: Postgres in the Browser (and Mobile)

PGlite by the team at ElectricSQL is a revelation. It’s a build of Postgres compiled to WASM, packaged as a lightweight library. It allows you to run a full Postgres instance in-memory or persisted to a filesystem (like IndexedDB or OpFS).

What makes it the "killer app" for local AI is its support for extensions—specifically pgvector. This means we can run cosine similarity searches using standard SQL syntax directly on the mobile device.

The Architecture

In my implementation, I used a three-tier local stack:

Expo (React Native): The host environment.
Transformers.js: For generating embeddings on the client (using a quantized model like all-MiniLM-L6-v2).
PGlite + pgvector: For storing and querying those embeddings.

Setting up the Database

First, I had to configure PGlite to handle the vector extension. Here’s the core initialization logic I used:

typescript

import { PGlite } from "@electric-sql/pglite";
import { vector } from "@electric-sql/pglite/vector";

let db: PGlite | null = null;

export const getDb = async () => {
  if (db) return db;

  db = await PGlite.create({
    dataDir: 'idb://my-local-db', // Or filesystem for native Expo
    extensions: {
      vector,
    },
  });

  // Initialize the vector extension and create the table
  await db.exec(`
    CREATE EXTENSION IF NOT EXISTS vector;
    CREATE TABLE IF NOT EXISTS documents (
      id SERIAL PRIMARY KEY,
      content TEXT,
      embedding vector(384) -- Dimension for all-MiniLM-L6-v2
    );
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
  `);

  return db;
};

The Breakthrough: Performance at Scale

One of my biggest concerns was indexing performance. Running a Hierarchical Navigable Small World (HNSW) index in a WASM environment sounded like a recipe for a frozen UI.

However, I found that for datasets under 10,000 vectors (which covers most personal mobile use cases), the search latency was sub-15ms. The HNSW index in pgvector is highly optimized, and even within the WASM overhead, it outperforms any custom JavaScript-based nearest-neighbor implementation I’ve tried.

Executing a Semantic Search

Once I generate a query embedding locally, the SQL query is as familiar as it gets:

typescript

async function semanticSearch(queryVector: number[]) {
  const db = await getDb();
  const results = await db.query(`
    SELECT content, embedding <=> $1 AS distance
    FROM documents
    ORDER BY distance ASC
    LIMIT 5;
  `, [JSON.stringify(queryVector)]);
  
  return results.rows;
}

Engineering Hurdles & Workarounds

It wasn't all smooth sailing. There are a few things you need to be aware of when implementing this:

WASM Memory Limits: In a mobile environment, memory management is aggressive. I had to ensure the PGlite instance didn't balloon by carefully managing the size of the vectors and avoiding unnecessary large-blob storage in the same table.
Native Threading: Expo’s JS engine (Hermes) is fast, but WASM execution can still block the main thread if you’re doing heavy lifting. I moved the embedding generation and the PGlite queries into a Web Worker (on web) or utilized expo-standard-web-crypto to keep the crypto/math operations off the main loop where possible.
Model Loading: Downloading a 90MB model file to a phone is a one-time cost, but you need to cache it aggressively using expo-file-system to avoid re-fetching on every app boot.

Why This Matters

We are moving toward an era of "Local-First AI." By shifting vector search to the device, we achieve:

Zero Latency: No round-trip to a server in Virginia.
Privacy by Default: User data never leaves the device. If the phone is encrypted, the database is encrypted.
Cost Efficiency: No $50/month bill for a hosted vector DB that's mostly idle.

Building this with Expo and PGlite proved that Postgres isn't just for the server anymore. It’s a formidable mobile database that, with pgvector, turns a standard React Native app into a sophisticated AI tool.

If you’re still piping all your embeddings to the cloud, I highly recommend giving PGlite a spin. The future of AI isn't just in the datacenter—it's in your pocket.

Beyond the Vector Database: Engineering High-Performance On-Device Semantic Search with Expo and PGlite

Here’s how I bypassed the cloud and why this architecture is a game-changer for local-first AI.

The Problem with the "Cloud-First" Vector Approach

When we build mobile apps, we often treat them as thin clients. For semantic search, the traditional flow is:

Generate an embedding on a server (or call OpenAI).
Query a remote vector database.
Return IDs and fetch data from a separate API.

Enter PGlite: Postgres in the Browser (and Mobile)

The Architecture

In my implementation, I used a three-tier local stack:

Expo (React Native): The host environment.
Transformers.js: For generating embeddings on the client (using a quantized model like all-MiniLM-L6-v2).
PGlite + pgvector: For storing and querying those embeddings.

Setting up the Database

First, I had to configure PGlite to handle the vector extension. Here’s the core initialization logic I used:

typescript

import { PGlite } from "@electric-sql/pglite";
import { vector } from "@electric-sql/pglite/vector";

let db: PGlite | null = null;

export const getDb = async () => {
  if (db) return db;

  db = await PGlite.create({
    dataDir: 'idb://my-local-db', // Or filesystem for native Expo
    extensions: {
      vector,
    },
  });

  // Initialize the vector extension and create the table
  await db.exec(`
    CREATE EXTENSION IF NOT EXISTS vector;
    CREATE TABLE IF NOT EXISTS documents (
      id SERIAL PRIMARY KEY,
      content TEXT,
      embedding vector(384) -- Dimension for all-MiniLM-L6-v2
    );
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
  `);

  return db;
};

The Breakthrough: Performance at Scale

One of my biggest concerns was indexing performance. Running a Hierarchical Navigable Small World (HNSW) index in a WASM environment sounded like a recipe for a frozen UI.

Executing a Semantic Search

Once I generate a query embedding locally, the SQL query is as familiar as it gets:

typescript

async function semanticSearch(queryVector: number[]) {
  const db = await getDb();
  const results = await db.query(`
    SELECT content, embedding <=> $1 AS distance
    FROM documents
    ORDER BY distance ASC
    LIMIT 5;
  `, [JSON.stringify(queryVector)]);
  
  return results.rows;
}

Engineering Hurdles & Workarounds

It wasn't all smooth sailing. There are a few things you need to be aware of when implementing this:

WASM Memory Limits: In a mobile environment, memory management is aggressive. I had to ensure the PGlite instance didn't balloon by carefully managing the size of the vectors and avoiding unnecessary large-blob storage in the same table.
Native Threading: Expo’s JS engine (Hermes) is fast, but WASM execution can still block the main thread if you’re doing heavy lifting. I moved the embedding generation and the PGlite queries into a Web Worker (on web) or utilized expo-standard-web-crypto to keep the crypto/math operations off the main loop where possible.
Model Loading: Downloading a 90MB model file to a phone is a one-time cost, but you need to cache it aggressively using expo-file-system to avoid re-fetching on every app boot.

Why This Matters

We are moving toward an era of "Local-First AI." By shifting vector search to the device, we achieve:

Zero Latency: No round-trip to a server in Virginia.
Privacy by Default: User data never leaves the device. If the phone is encrypted, the database is encrypted.
Cost Efficiency: No $50/month bill for a hosted vector DB that's mostly idle.

If you’re still piping all your embeddings to the cloud, I highly recommend giving PGlite a spin. The future of AI isn't just in the datacenter—it's in your pocket.