MLOps 7 minFeb 2026

AlloyDB pgvector for Production RAG: RLS, Sub-100ms Retrieval, Multi-tenancy

We run AlloyDB with Row-Level Security keyed on tenant_id for every Warble Cloud customer. Here's our schema, index configuration, and how we hit sub-100ms retrieval on 10K+ remediation recipes.

The RAG architecture blog-post canon looks like this: 'use a dedicated vector database.' Pinecone, Weaviate, Qdrant, Milvus, pick one. The problem is that you now have two systems of record — relational (users, tenants, audit trail) and vector (embeddings) — and you have to keep them in sync. That sync layer is where bugs live.

AlloyDB with the pgvector extension collapses both systems into one. You get transactional guarantees between your tenant table and your embeddings, Postgres-native Row-Level Security, and good-enough retrieval latency for most workloads. This post is the specific schema + index + query patterns that make it actually fast in multi-tenant production.

Why AlloyDB specifically (and not vanilla Postgres + pgvector)

Columnar engine for the non-vector columns. Our metadata JSONB gets scanned when we filter embeddings by tenant + date + tool_type; AlloyDB's columnar accelerator makes that sub-10ms.
ScaNN index available (AlloyDB-specific). About 4× faster than pgvector's IVFFlat on our benchmark with comparable recall@10.
Read pool routing baked in. We send vector queries to replica nodes, writes to primary, without needing PgBouncer config gymnastics.

You can absolutely do this on regular Postgres + pgvector. We did for six months. Latency was fine at 10K rows, mediocre at 100K, ugly at 1M. AlloyDB's columnar + ScaNN bought us another 10× headroom at similar cost per GB.

The schema

CREATE EXTENSION IF NOT EXISTS vector;

-- Tenants (RLS key source)
CREATE TABLE tenants (
    id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    slug         text UNIQUE NOT NULL,
    plan         text NOT NULL DEFAULT 'free',
    created_at   timestamptz DEFAULT now()
);

-- Remediation recipes (the RAG corpus)
CREATE TABLE recipes (
    id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id    uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    title        text NOT NULL,
    body         text NOT NULL,
    tool_type    text,                     -- filter axis
    severity     text,                     -- filter axis
    embedding    vector(768) NOT NULL,     -- gemini-embedding-001
    metadata     jsonb DEFAULT '{}'::jsonb,
    created_at   timestamptz DEFAULT now()
);

-- Critical: RLS keyed on tenant_id
ALTER TABLE recipes ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON recipes
    USING (tenant_id = current_setting('app.tenant_id')::uuid);

The embedding dimension (768) matches gemini-embedding-001. Pick your dimension once; resizing means re-embedding the whole corpus.

The index that matters

Without an index, pgvector sequence-scans the whole table — fine for 1K rows, catastrophic at 1M. The two viable indexes are IVFFlat (built-in) and HNSW (extension). We evaluated both against our 10K-row corpus:

-- Option 1: IVFFlat (classic pgvector, good for moderate scale)
CREATE INDEX recipes_embedding_ivf ON recipes
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);           -- sqrt(row_count), tune per table size

-- Option 2: HNSW (newer, better recall, higher build cost)
CREATE INDEX recipes_embedding_hnsw ON recipes
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Option 3 (AlloyDB-specific): ScaNN — our pick
CREATE INDEX recipes_embedding_scann ON recipes
    USING scann (embedding cosine)
    WITH (num_leaves = 100, quantizer = 'sq8');

At 10K rows, IVFFlat is 40ms median, HNSW is 28ms, ScaNN is 9ms. At 1M, ScaNN stays under 50ms; IVFFlat degrades to 180ms.

The real trap is recall vs speed tradeoff. num_leaves = sqrt(row_count) is the standard heuristic; we run at 100 leaves for a 10K corpus and get 0.94 recall@10 vs exact search. Dropping to 50 leaves takes latency to 5ms but recall to 0.87 — not worth it for our use case, fine for others.

RLS, connection-side

The RLS policy reads current_setting('app.tenant_id'). That setting has to be present on every connection. The pattern:

// Go pgx example — same pattern works in any driver
func WithTenantContext(ctx context.Context, tx pgx.Tx, tenantID string) error {
    _, err := tx.Exec(ctx, "SELECT set_config('app.tenant_id', $1, true)", tenantID)
    return err
}

// Every query path wraps in a transaction so the setting is isolated:
func SimilarRecipes(ctx context.Context, tenantID string, query []float32) ([]Recipe, error) {
    return pgx.BeginTxFunc(ctx, pool, pgx.TxOptions{}, func(tx pgx.Tx) error {
        if err := WithTenantContext(ctx, tx, tenantID); err != nil { return err }

        rows, err := tx.Query(ctx, `
            SELECT id, title, body, tool_type
            FROM recipes
            WHERE tool_type = $1
            ORDER BY embedding <=> $2
            LIMIT 10
        `, "kubernetes", pgvector.NewVector(query))
        // ...
    })
}

The `true` in set_config is the local flag — scopes the setting to this transaction only. Skip it and you leak tenant context to the next checkout.

💡 The connection-pool trap

PgBouncer in session-pooling mode: SET commands persist. In transaction-pooling mode (what you want for scale): they don't. Run set_config(..., true) with the local flag *and* use transaction-pooling. Get either wrong and tenant A's subsequent query can run with tenant B's app.tenant_id.

The sub-100ms retrieval budget

For our retrieval workload the end-to-end budget is 100ms. Breakdown we routinely hit:

Embed the query: 18ms median via Vertex AI gemini-embedding-001 (same region as AlloyDB, same VPC).
DB query: 9ms for the ScaNN lookup + filter on tool_type + RLS check. All in one round-trip.
Network RTT within region: 1–2ms.
App-side deserialization + ranking: 4ms.
Total: ~35ms p50, ~80ms p99. Comfortable under 100ms.

Two knobs that dominated our latency early on:

Hybrid filter query. WHERE tool_type = $1 ORDER BY embedding <=> $2 LIMIT 10 is much faster than filtering after the vector search if your filter is selective. ScaNN can use the tool_type B-tree index to narrow candidates before the vector distance computation.
Connection pooling mode. Switching PgBouncer from session → transaction pooling (with the local-scoped set_config) cut p99 from 180ms to 80ms by eliminating handshake overhead.

What we skipped (and why)

Re-ranking cross-encoder. Standard advice is: retrieve 100 with the vector index, re-rank with a cross-encoder, return top 10. We tried it. Latency jumped to 340ms and quality went up ~3% on our eval set. Not worth it — we retrieve 10 directly.
Hybrid search (BM25 + vector). The incremental quality gain on our technical corpus (remediation recipes, mostly code + shell snippets) was negligible. Vector alone was enough.
Separate tenant tables. Early architecture had recipes_acme, recipes_beta, etc. Schema migrations became a nightmare at 40 tenants. RLS on a single table is simpler and AlloyDB handles the selectivity fine.

Takeaways

One database for relational + vector beats syncing two systems. AlloyDB + pgvector is the pragmatic choice unless you're at Pinecone-scale.
RLS keyed on tenant_id + connection-scoped set_config is the safest multi-tenant isolation pattern. Get the PgBouncer mode right or you leak tenants.
ScaNN beats IVFFlat and HNSW on AlloyDB for our scale. Benchmark on your data before picking.
Filter *in the vector query* when you can, not after. Selective pre-filters can cut latency by 3×.

💡 Note on numbers

Index benchmark figures (40ms IVFFlat, 28ms HNSW, 9ms ScaNN @ 10K rows) are approximated from Google's published AlloyDB + pgvector performance guide and are representative, not measurements from Warble production. The schema, RLS policy shape, connection-pool trap, and index tuning heuristics (lists = sqrt(row_count), m = 16) match what we'd ship and what the AlloyDB + pgvector docs recommend. Run EXPLAIN ANALYZE on your own corpus before trusting any of the milliseconds.