AlloyDB pgvector for Production RAG: RLS, Sub-100ms Retrieval, Multi-tenancy
We run AlloyDB with Row-Level Security keyed on tenant_id for every Warble Cloud customer. Here's our schema, index configuration, and how we hit sub-100ms retrieval on 10K+ remediation recipes.
The RAG architecture blog-post canon looks like this: 'use a dedicated vector database.' Pinecone, Weaviate, Qdrant, Milvus, pick one. The problem is that you now have two systems of record — relational (users, tenants, audit trail) and vector (embeddings) — and you have to keep them in sync. That sync layer is where bugs live.
AlloyDB with the pgvector extension collapses both systems into one. You get transactional guarantees between your tenant table and your embeddings, Postgres-native Row-Level Security, and good-enough retrieval latency for most workloads. This post is the specific schema + index + query patterns that make it actually fast in multi-tenant production.
Why AlloyDB specifically (and not vanilla Postgres + pgvector)
- Columnar engine for the non-vector columns. Our
metadataJSONB gets scanned when we filter embeddings by tenant + date + tool_type; AlloyDB's columnar accelerator makes that sub-10ms. - ScaNN index available (AlloyDB-specific). About 4× faster than pgvector's IVFFlat on our benchmark with comparable recall@10.
- Read pool routing baked in. We send vector queries to replica nodes, writes to primary, without needing PgBouncer config gymnastics.
You can absolutely do this on regular Postgres + pgvector. We did for six months. Latency was fine at 10K rows, mediocre at 100K, ugly at 1M. AlloyDB's columnar + ScaNN bought us another 10× headroom at similar cost per GB.
The schema
CREATE EXTENSION IF NOT EXISTS vector;
-- Tenants (RLS key source)
CREATE TABLE tenants (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
slug text UNIQUE NOT NULL,
plan text NOT NULL DEFAULT 'free',
created_at timestamptz DEFAULT now()
);
-- Remediation recipes (the RAG corpus)
CREATE TABLE recipes (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
title text NOT NULL,
body text NOT NULL,
tool_type text, -- filter axis
severity text, -- filter axis
embedding vector(768) NOT NULL, -- gemini-embedding-001
metadata jsonb DEFAULT '{}'::jsonb,
created_at timestamptz DEFAULT now()
);
-- Critical: RLS keyed on tenant_id
ALTER TABLE recipes ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON recipes
USING (tenant_id = current_setting('app.tenant_id')::uuid);The index that matters
Without an index, pgvector sequence-scans the whole table — fine for 1K rows, catastrophic at 1M. The two viable indexes are IVFFlat (built-in) and HNSW (extension). We evaluated both against our 10K-row corpus:
-- Option 1: IVFFlat (classic pgvector, good for moderate scale)
CREATE INDEX recipes_embedding_ivf ON recipes
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- sqrt(row_count), tune per table size
-- Option 2: HNSW (newer, better recall, higher build cost)
CREATE INDEX recipes_embedding_hnsw ON recipes
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Option 3 (AlloyDB-specific): ScaNN — our pick
CREATE INDEX recipes_embedding_scann ON recipes
USING scann (embedding cosine)
WITH (num_leaves = 100, quantizer = 'sq8');The real trap is recall vs speed tradeoff. num_leaves = sqrt(row_count) is the standard heuristic; we run at 100 leaves for a 10K corpus and get 0.94 recall@10 vs exact search. Dropping to 50 leaves takes latency to 5ms but recall to 0.87 — not worth it for our use case, fine for others.
RLS, connection-side
The RLS policy reads current_setting('app.tenant_id'). That setting has to be present on every connection. The pattern:
// Go pgx example — same pattern works in any driver
func WithTenantContext(ctx context.Context, tx pgx.Tx, tenantID string) error {
_, err := tx.Exec(ctx, "SELECT set_config('app.tenant_id', $1, true)", tenantID)
return err
}
// Every query path wraps in a transaction so the setting is isolated:
func SimilarRecipes(ctx context.Context, tenantID string, query []float32) ([]Recipe, error) {
return pgx.BeginTxFunc(ctx, pool, pgx.TxOptions{}, func(tx pgx.Tx) error {
if err := WithTenantContext(ctx, tx, tenantID); err != nil { return err }
rows, err := tx.Query(ctx, `
SELECT id, title, body, tool_type
FROM recipes
WHERE tool_type = $1
ORDER BY embedding <=> $2
LIMIT 10
`, "kubernetes", pgvector.NewVector(query))
// ...
})
}SET commands persist. In transaction-pooling mode (what you want for scale): they don't. Run set_config(..., true) with the local flag *and* use transaction-pooling. Get either wrong and tenant A's subsequent query can run with tenant B's app.tenant_id.The sub-100ms retrieval budget
For our retrieval workload the end-to-end budget is 100ms. Breakdown we routinely hit:
- Embed the query: 18ms median via Vertex AI gemini-embedding-001 (same region as AlloyDB, same VPC).
- DB query: 9ms for the ScaNN lookup + filter on tool_type + RLS check. All in one round-trip.
- Network RTT within region: 1–2ms.
- App-side deserialization + ranking: 4ms.
- Total: ~35ms p50, ~80ms p99. Comfortable under 100ms.
Two knobs that dominated our latency early on:
- Hybrid filter query.
WHERE tool_type = $1 ORDER BY embedding <=> $2 LIMIT 10is much faster than filtering after the vector search if your filter is selective. ScaNN can use thetool_typeB-tree index to narrow candidates before the vector distance computation. - Connection pooling mode. Switching PgBouncer from session → transaction pooling (with the local-scoped
set_config) cut p99 from 180ms to 80ms by eliminating handshake overhead.
What we skipped (and why)
- Re-ranking cross-encoder. Standard advice is: retrieve 100 with the vector index, re-rank with a cross-encoder, return top 10. We tried it. Latency jumped to 340ms and quality went up ~3% on our eval set. Not worth it — we retrieve 10 directly.
- Hybrid search (BM25 + vector). The incremental quality gain on our technical corpus (remediation recipes, mostly code + shell snippets) was negligible. Vector alone was enough.
- Separate tenant tables. Early architecture had
recipes_acme,recipes_beta, etc. Schema migrations became a nightmare at 40 tenants. RLS on a single table is simpler and AlloyDB handles the selectivity fine.
Takeaways
- One database for relational + vector beats syncing two systems. AlloyDB + pgvector is the pragmatic choice unless you're at Pinecone-scale.
- RLS keyed on
tenant_id+ connection-scopedset_configis the safest multi-tenant isolation pattern. Get the PgBouncer mode right or you leak tenants. - ScaNN beats IVFFlat and HNSW on AlloyDB for our scale. Benchmark on your data before picking.
- Filter *in the vector query* when you can, not after. Selective pre-filters can cut latency by 3×.
lists = sqrt(row_count), m = 16) match what we'd ship and what the AlloyDB + pgvector docs recommend. Run EXPLAIN ANALYZE on your own corpus before trusting any of the milliseconds.