Blog
Database Schema Chunking for AI Agents
16 min read
Most AI coding agents understand databases badly — not because they cannot write SQL. They can. The problem is that they usually see schemas as flat text:
users(id, email, created_at)
orders(id, user_id, total)
payments(id, order_id, status)That is not enough. A database schema is not just a list of tables — it is an operational map of the system. If you want an agent to reason about queries, migrations, performance, and data access, schema chunking needs to preserve more than column names:
relationships
cardinality
row counts
table size
indexes
query patterns
ownership
risk level
production sensitivityWhy schema chunking matters
When an agent writes SQL, it is not only choosing syntax. It is making assumptions about how much data exists, how tables relate, which joins are safe, which filters are selective, which indexes exist, and what queries could hurt production. If the schema chunk omits that context, you get technically correct SQL that is operationally terrible.
Example: same schema, very different reality
Imagine the agent sees only DDL:
CREATE TABLE customers (
id BIGINT PRIMARY KEY,
email TEXT,
created_at TIMESTAMP
);
CREATE TABLE payments (
id BIGINT PRIMARY KEY,
customer_id BIGINT,
amount DECIMAL,
status TEXT,
created_at TIMESTAMP
);The agent writes:
SELECT *
FROM customers c
JOIN payments p ON p.customer_id = c.id
WHERE p.status = 'FAILED';Now add operational metadata:
customers: 12,000,000 rows, 9.4 GB
payments: 240,000,000 rows, 180 GB
payments.status: low cardinality
payments.customer_id: indexed
payments.created_at: indexed
payments.status: not indexedSame tables — different risk. The agent should probably narrow columns, drive from the large indexed fact table, bound time, and cap rows:
SELECT c.id, c.email, p.id, p.amount, p.created_at
FROM payments p
JOIN customers c ON c.id = p.customer_id
WHERE p.status = 'FAILED'
AND p.created_at >= NOW() - INTERVAL '7 days'
LIMIT 500;Same schema text, different operational understanding.
The hidden problem with schema-only context
A schema tells the agent what exists — not how dangerous it is to query. That missing layer drives bad behavior:
full table scans on massive tables
joins across huge unfiltered datasets
selecting unnecessary columns
ignoring indexes
generating expensive aggregations
running exploratory queries against production
misunderstanding tenant boundaries
missing soft-delete filters
ignoring partition keysThis is why schema chunks need operational metadata, not DDL alone.
What a good schema chunk should include
A useful chunk is not just “table name + columns.” It reads more like a compact runbook for that table:
Table: payments
Purpose: Stores payment attempts for customer orders.
Rows: ~240M
Size: ~180 GB
Growth: ~3M rows/day
Risk: high
Important columns:
- id: primary key
- customer_id: FK to customers.id, indexed
- order_id: FK to orders.id, indexed
- status: enum, low cardinality, not indexed alone
- created_at: indexed, used for time filtering
- provider: stripe/paystack/flutterwave
- external_reference: indexed, unique per provider
Common filters:
- created_at >= ?
- customer_id = ?
- order_id = ?
- external_reference = ?
Danger:
- avoid scanning by status alone
- avoid SELECT *
- always include time range for analytics queriesNow the agent can reason about selectivity, join order, and footguns — not just types.
Row counts change agent behavior
Row counts are one of the strongest signals for safe defaults. Compare three tables that look “similar” in DDL:
countries: 250 rows
payments: 240M rows
events: 4B rowsSELECT * FROM countries may be harmless; SELECT * FROM events at billions of rows is a different class of mistake. Without counts, the agent cannot separate harmless from catastrophic.
Table size matters too
Row count alone is not enough. Narrow rows at high volume can still be cheaper to scan than modest row counts with huge payloads:
audit_logs: 80M rows, 60 GB
documents: 400k rows, 900 GBGood chunks carry row count, table size, index size, average row width, and growth so cost estimates track reality.
Index metadata is critical
The agent should know whether WHERE customer_id = ? is likely indexed — and that WHERE LOWER(email) = LOWER(?) may be unsafe unless a functional index exists.
Indexes:
- payments_pkey(id)
- idx_payments_customer_id(customer_id)
- idx_payments_created_at(created_at)
- idx_payments_external_reference(provider, external_reference)
Missing / dangerous:
- status is not indexed alone
- metadata JSONB queries are expensive unless using metadata_gin_idxRelationship chunking
Tables should not be chunked only as isolated islands. Add relationship-level chunks so join paths and cardinality are first-class retrieval targets — not inferred from foreign keys the model might miss.
Customer Billing Graph
customers.id
→ orders.customer_id
→ payments.order_id
→ payment_events.payment_id
→ refunds.payment_id
Join path: customers → orders → payments
Cardinality: customer has many orders; order has many payments; payment has many payment_events
Preferred: payments.order_id = orders.id AND orders.customer_id = customers.id
Avoid: payments.customer_id if stale or denormalizedThat helps answer queries like “failed payments for customers in Lagos” without the model guessing a join graph from filenames or vibes.
Query pattern chunks
Another high-leverage idea: chunk known safe patterns as their own retrieval units — not just tables.
Pattern: Recent failed payments by customer
Use: payments.customer_id, payments.status, payments.created_at
Safe query shape:
- filter by customer_id
- filter by created_at
- limit results
Avoid: filtering by status onlyRetrieval then returns operational wisdom alongside schema — the difference between “syntax that runs” and “SQL that fits how this system is queried in production.”
Migration-aware schema chunks
Agents need to know what is changing, not only what exists today. Deprecated columns linger in DDL and tempt autocomplete:
Column: customers.legacy_credit_score
Status: deprecated
Replacement: customer_risk_scores.score
Migration: in progress
Do not use in new queries.Chunks should carry deprecated fields, replacements, migration status, backfill state, nullable transitions, and compatibility constraints — otherwise agents keep emitting “valid” SQL against columns you are actively retiring.
Should an agent access production directly?
Usually: not directly, not with write access, and not without guardrails. Raw production access can cause serious damage:
expensive queries
accidental writes
data leakage
locking tables
exposing PII
bypassing app-level authorization
misunderstanding tenant boundaries
generating destructive migrations
reading sensitive customer data unnecessarilyThe answer is rarely a binary yes/no — it is: give grounded database context, not unrestricted database power.
Better: read-only grounding
Prefer a controlled layer where the agent can retrieve schema, row counts, sizes, index metadata, foreign keys, query plans, sampled anonymized values, approved patterns, dictionary entries, and migration notes — without freely running arbitrary production SQL.
Production DB
↓
metadata collector
↓
schema/index/statistics snapshot
↓
redacted data dictionary
↓
agent retrieval layerWhat metadata to collect
Per table, a grounding catalog might include:
table_name, row_estimate, total_size, index_size, growth_rate
primary_key, foreign_keys, indexes, partitioning
sensitive_columns, common_filters, dangerous_query_notes
owner_service, last_analyzed_atPer column:
column_name, type, nullable, cardinality estimate
indexed?, PII?, deprecated?, description
example values (if safe/redacted)That makes agents far more useful without handing over credentials that imply trust the model cannot enforce.
Example: grounded vs DDL-only
Instead of only:
CREATE TABLE customer_events (
id BIGINT,
customer_id BIGINT,
event_type TEXT,
payload JSONB,
created_at TIMESTAMP
);Give the agent a chunk that encodes scale, partitions, indexes, and policy:
Table: customer_events
Purpose: Append-only customer activity stream.
Rows: ~1.8B
Total size: ~1.2 TB
Growth: ~12M rows/day
Partitioned by: created_at (monthly)
Risk: very high
Indexes:
- customer_events_pkey(id)
- idx_customer_events_customer_created(customer_id, created_at)
- idx_customer_events_type_created(event_type, created_at)
- gin_customer_events_payload(payload)
Important:
- Always filter by created_at.
- Prefer customer_id + created_at for customer lookups.
- Avoid payload JSONB scans without specific indexed keys.
- Do not SELECT payload unless needed.
- Contains PII in payload. Redact before showing results.Example: query improvement
User: find recent login failures for this customer.
SELECT *
FROM customer_events
WHERE event_type = 'LOGIN_FAILED';SELECT id, customer_id, event_type, created_at
FROM customer_events
WHERE customer_id = :customer_id
AND event_type = 'LOGIN_FAILED'
AND created_at >= NOW() - INTERVAL '30 days'
ORDER BY created_at DESC
LIMIT 100;Better because the chunk told the agent the table is huge, created_at is required, customer_id + created_at is indexed, payload may contain PII, and SELECT * is unsafe.
Should the agent run EXPLAIN?
It can be useful feedback — with guardrails. EXPLAIN ANALYZE actually executes the query; treat it as higher risk than EXPLAIN alone.
Allowed: EXPLAIN SELECT ...
Riskier: EXPLAIN ANALYZE (executes the query)
Safe agent environment might allow:
EXPLAIN only · read-only · statement timeout · row limit
query cost threshold · blocked tables · PII masking · no production writesA practical permission model
Instead of one database role for “the agent,” tier access so capabilities match blast radius:
Tier 0 — Schema only: DDL, comments, relationships ( codegen, drafting, migrations )
Tier 1 — Metadata grounding: row counts, sizes, index stats, partitions ( default sweet spot )
Tier 2 — Read-only replica: masked data, limits, timeouts ( analytics, debugging, support )
Tier 3 — Production read: narrow tools only — audit logs, RBAC, redaction, approvals, strict timeouts
Tier 4 — Production write: almost never — human approval, diff preview, transactions, rollback planAgents should not casually mutate production. Tier 4, if it exists at all, belongs behind human review — not behind a chat prompt.
Schema chunking should encode risk
Every table-oriented chunk can carry a risk profile so the model calibrates caution — the context normal dumps omit:
Risk level: high
Reason: PII · very large · production-critical · frequent writes · used in billing
→ Agent behavior should differ for countries vs payments vs customer_eventsDatabase grounding is not just for query writing
Rich schema chunks also help API design, migration planning, debugging, performance review, analytics, and data model questions — for example, can I drop this column? The agent should see row usage, downstream references, app code, ETL, dashboards, and migration state — not only ALTER TABLE text. That requires connecting the database to the repository graph.
The best version connects schema to code
The unlock is not isolated metadata — it is schema + code + query usage + runtime behavior. Example linkage:
payments.status
used by:
PaymentService.markFailed()
FailedPaymentRecoveryWorkflow
RevenueDashboard
CustomerSupportSearchNow impact analysis stops being a hallucination about “who might read this field” and becomes a retrieval problem over real references. That is where database chunking graduates into repository cognition.
Final takeaway
Bad schema chunks produce agents that write dangerous SQL. Good chunks produce agents that understand operational constraints. The agent should not only know what tables exist — it should know how big they are, how they relate, what indexes exist, what is sensitive, which query shapes are approved, and which patterns are dangerous.
For production systems, the safest default is: ground the agent with metadata, not unrestricted access. Useful database agents do not need blind power — they need grounded context.