KognitaKognita.

Blog

Database Schema Chunking for AI Agents

16 min read

Most AI coding agents understand databases badly — not because they cannot write SQL. They can. The problem is that they usually see schemas as flat text:

What agents often get
users(id, email, created_at)
orders(id, user_id, total)
payments(id, order_id, status)

That is not enough. A database schema is not just a list of tables — it is an operational map of the system. If you want an agent to reason about queries, migrations, performance, and data access, schema chunking needs to preserve more than column names:

What flat DDL omits
relationships
cardinality
row counts
table size
indexes
query patterns
ownership
risk level
production sensitivity

Why schema chunking matters

When an agent writes SQL, it is not only choosing syntax. It is making assumptions about how much data exists, how tables relate, which joins are safe, which filters are selective, which indexes exist, and what queries could hurt production. If the schema chunk omits that context, you get technically correct SQL that is operationally terrible.

Example: same schema, very different reality

Imagine the agent sees only DDL:

CREATE TABLE customers (
  id BIGINT PRIMARY KEY,
  email TEXT,
  created_at TIMESTAMP
);

CREATE TABLE payments (
  id BIGINT PRIMARY KEY,
  customer_id BIGINT,
  amount DECIMAL,
  status TEXT,
  created_at TIMESTAMP
);

The agent writes:

Looks fine in a vacuum
SELECT *
FROM customers c
JOIN payments p ON p.customer_id = c.id
WHERE p.status = 'FAILED';

Now add operational metadata:

Scale + index reality
customers: 12,000,000 rows, 9.4 GB
payments: 240,000,000 rows, 180 GB
payments.status: low cardinality
payments.customer_id: indexed
payments.created_at: indexed
payments.status: not indexed

Same tables — different risk. The agent should probably narrow columns, drive from the large indexed fact table, bound time, and cap rows:

Grounded shape
SELECT c.id, c.email, p.id, p.amount, p.created_at
FROM payments p
JOIN customers c ON c.id = p.customer_id
WHERE p.status = 'FAILED'
  AND p.created_at >= NOW() - INTERVAL '7 days'
LIMIT 500;

Same schema text, different operational understanding.

The hidden problem with schema-only context

A schema tells the agent what exists — not how dangerous it is to query. That missing layer drives bad behavior:

full table scans on massive tables
joins across huge unfiltered datasets
selecting unnecessary columns
ignoring indexes
generating expensive aggregations
running exploratory queries against production
misunderstanding tenant boundaries
missing soft-delete filters
ignoring partition keys

This is why schema chunks need operational metadata, not DDL alone.

What a good schema chunk should include

A useful chunk is not just “table name + columns.” It reads more like a compact runbook for that table:

Example: payments (illustrative)
Table: payments
Purpose: Stores payment attempts for customer orders.

Rows: ~240M
Size: ~180 GB
Growth: ~3M rows/day
Risk: high

Important columns:
- id: primary key
- customer_id: FK to customers.id, indexed
- order_id: FK to orders.id, indexed
- status: enum, low cardinality, not indexed alone
- created_at: indexed, used for time filtering
- provider: stripe/paystack/flutterwave
- external_reference: indexed, unique per provider

Common filters:
- created_at >= ?
- customer_id = ?
- order_id = ?
- external_reference = ?

Danger:
- avoid scanning by status alone
- avoid SELECT *
- always include time range for analytics queries

Now the agent can reason about selectivity, join order, and footguns — not just types.

Row counts change agent behavior

Row counts are one of the strongest signals for safe defaults. Compare three tables that look “similar” in DDL:

countries: 250 rows
payments: 240M rows
events: 4B rows

SELECT * FROM countries may be harmless; SELECT * FROM events at billions of rows is a different class of mistake. Without counts, the agent cannot separate harmless from catastrophic.

Table size matters too

Row count alone is not enough. Narrow rows at high volume can still be cheaper to scan than modest row counts with huge payloads:

Cost is not just COUNT(*)
audit_logs: 80M rows, 60 GB
documents: 400k rows, 900 GB

Good chunks carry row count, table size, index size, average row width, and growth so cost estimates track reality.

Index metadata is critical

The agent should know whether WHERE customer_id = ? is likely indexed — and that WHERE LOWER(email) = LOWER(?) may be unsafe unless a functional index exists.

Indexes + explicit gaps
Indexes:
- payments_pkey(id)
- idx_payments_customer_id(customer_id)
- idx_payments_created_at(created_at)
- idx_payments_external_reference(provider, external_reference)

Missing / dangerous:
- status is not indexed alone
- metadata JSONB queries are expensive unless using metadata_gin_idx

Relationship chunking

Tables should not be chunked only as isolated islands. Add relationship-level chunks so join paths and cardinality are first-class retrieval targets — not inferred from foreign keys the model might miss.

Example: billing subgraph
Customer Billing Graph

customers.id
  → orders.customer_id
  → payments.order_id
  → payment_events.payment_id
  → refunds.payment_id

Join path: customers → orders → payments
Cardinality: customer has many orders; order has many payments; payment has many payment_events
Preferred: payments.order_id = orders.id AND orders.customer_id = customers.id
Avoid: payments.customer_id if stale or denormalized

That helps answer queries like “failed payments for customers in Lagos” without the model guessing a join graph from filenames or vibes.

Query pattern chunks

Another high-leverage idea: chunk known safe patterns as their own retrieval units — not just tables.

Pattern: Recent failed payments by customer

Use: payments.customer_id, payments.status, payments.created_at

Safe query shape:
- filter by customer_id
- filter by created_at
- limit results

Avoid: filtering by status only

Retrieval then returns operational wisdom alongside schema — the difference between “syntax that runs” and “SQL that fits how this system is queried in production.”

Migration-aware schema chunks

Agents need to know what is changing, not only what exists today. Deprecated columns linger in DDL and tempt autocomplete:

Column: customers.legacy_credit_score
Status: deprecated
Replacement: customer_risk_scores.score
Migration: in progress
Do not use in new queries.

Chunks should carry deprecated fields, replacements, migration status, backfill state, nullable transitions, and compatibility constraints — otherwise agents keep emitting “valid” SQL against columns you are actively retiring.

Should an agent access production directly?

Usually: not directly, not with write access, and not without guardrails. Raw production access can cause serious damage:

Why “just connect prod” fails
expensive queries
accidental writes
data leakage
locking tables
exposing PII
bypassing app-level authorization
misunderstanding tenant boundaries
generating destructive migrations
reading sensitive customer data unnecessarily

The answer is rarely a binary yes/no — it is: give grounded database context, not unrestricted database power.

Better: read-only grounding

Prefer a controlled layer where the agent can retrieve schema, row counts, sizes, index metadata, foreign keys, query plans, sampled anonymized values, approved patterns, dictionary entries, and migration notes — without freely running arbitrary production SQL.

Safe production grounding (conceptual)
Production DB
  ↓
metadata collector
  ↓
schema/index/statistics snapshot
  ↓
redacted data dictionary
  ↓
agent retrieval layer

What metadata to collect

Per table, a grounding catalog might include:

table_name, row_estimate, total_size, index_size, growth_rate
primary_key, foreign_keys, indexes, partitioning
sensitive_columns, common_filters, dangerous_query_notes
owner_service, last_analyzed_at

Per column:

column_name, type, nullable, cardinality estimate
indexed?, PII?, deprecated?, description
example values (if safe/redacted)

That makes agents far more useful without handing over credentials that imply trust the model cannot enforce.

Example: grounded vs DDL-only

Instead of only:

CREATE TABLE customer_events (
  id BIGINT,
  customer_id BIGINT,
  event_type TEXT,
  payload JSONB,
  created_at TIMESTAMP
);

Give the agent a chunk that encodes scale, partitions, indexes, and policy:

Operational chunk
Table: customer_events
Purpose: Append-only customer activity stream.

Rows: ~1.8B
Total size: ~1.2 TB
Growth: ~12M rows/day
Partitioned by: created_at (monthly)
Risk: very high

Indexes:
- customer_events_pkey(id)
- idx_customer_events_customer_created(customer_id, created_at)
- idx_customer_events_type_created(event_type, created_at)
- gin_customer_events_payload(payload)

Important:
- Always filter by created_at.
- Prefer customer_id + created_at for customer lookups.
- Avoid payload JSONB scans without specific indexed keys.
- Do not SELECT payload unless needed.
- Contains PII in payload. Redact before showing results.

Example: query improvement

User: find recent login failures for this customer.

Bad (missing filters + PII surface)
SELECT *
FROM customer_events
WHERE event_type = 'LOGIN_FAILED';
Better (indexed path + time bound + limit)
SELECT id, customer_id, event_type, created_at
FROM customer_events
WHERE customer_id = :customer_id
  AND event_type = 'LOGIN_FAILED'
  AND created_at >= NOW() - INTERVAL '30 days'
ORDER BY created_at DESC
LIMIT 100;

Better because the chunk told the agent the table is huge, created_at is required, customer_id + created_at is indexed, payload may contain PII, and SELECT * is unsafe.

Should the agent run EXPLAIN?

It can be useful feedback — with guardrails. EXPLAIN ANALYZE actually executes the query; treat it as higher risk than EXPLAIN alone.

Allowed: EXPLAIN SELECT ...
Riskier: EXPLAIN ANALYZE (executes the query)

Safe agent environment might allow:
EXPLAIN only · read-only · statement timeout · row limit
query cost threshold · blocked tables · PII masking · no production writes

A practical permission model

Instead of one database role for “the agent,” tier access so capabilities match blast radius:

Tiers (abbreviated)
Tier 0 — Schema only: DDL, comments, relationships ( codegen, drafting, migrations )

Tier 1 — Metadata grounding: row counts, sizes, index stats, partitions ( default sweet spot )

Tier 2 — Read-only replica: masked data, limits, timeouts ( analytics, debugging, support )

Tier 3 — Production read: narrow tools only — audit logs, RBAC, redaction, approvals, strict timeouts

Tier 4 — Production write: almost never — human approval, diff preview, transactions, rollback plan

Agents should not casually mutate production. Tier 4, if it exists at all, belongs behind human review — not behind a chat prompt.

Schema chunking should encode risk

Every table-oriented chunk can carry a risk profile so the model calibrates caution — the context normal dumps omit:

Risk level: high
Reason: PII · very large · production-critical · frequent writes · used in billing

→ Agent behavior should differ for countries vs payments vs customer_events

Database grounding is not just for query writing

Rich schema chunks also help API design, migration planning, debugging, performance review, analytics, and data model questions — for example, can I drop this column? The agent should see row usage, downstream references, app code, ETL, dashboards, and migration state — not only ALTER TABLE text. That requires connecting the database to the repository graph.

The best version connects schema to code

The unlock is not isolated metadata — it is schema + code + query usage + runtime behavior. Example linkage:

Column → callers (illustrative)
payments.status
  used by:
    PaymentService.markFailed()
    FailedPaymentRecoveryWorkflow
    RevenueDashboard
    CustomerSupportSearch

Now impact analysis stops being a hallucination about “who might read this field” and becomes a retrieval problem over real references. That is where database chunking graduates into repository cognition.

Final takeaway

Bad schema chunks produce agents that write dangerous SQL. Good chunks produce agents that understand operational constraints. The agent should not only know what tables exist — it should know how big they are, how they relate, what indexes exist, what is sensitive, which query shapes are approved, and which patterns are dangerous.

For production systems, the safest default is: ground the agent with metadata, not unrestricted access. Useful database agents do not need blind power — they need grounded context.