Blog
Data Engineers Can't Trust the Schema Until They Understand the Application
10 min read
The data engineer opens the schema. orders.payment_method: VARCHAR, nullable. They write a filter to exclude null rows — null means no payment method, those are not real orders. The pipeline runs. The revenue dashboard looks right. Three weeks later, a stakeholder flags that a large category of internal orders — created through the admin panel and never requiring payment — is missing from the report entirely. The data engineer did not know those orders existed. The schema did not tell them. The null was intentional, not an error.
This is not an unusual incident. It is the standard outcome when data engineers work from schema documentation without access to application context. The schema tells you what columns exist and what types they hold. It does not tell you what the application does with them.
The database schema is not the application data model
A database schema is a storage contract. It defines the structure of the data at rest: column names, types, constraints, indexes. It is an accurate representation of what the database will accept. It is not an accurate representation of how the application uses the database.
The gap between these two things is where data pipelines break. The application data model includes everything the schema leaves implicit: which code paths produce which values, what conditions determine whether a nullable column is null, which enum values are used for distinct business states versus which ones encode the same state reached through different paths, which numeric fields change their calculation logic based on other columns or account-level configuration.
None of this is in the schema. All of it is in the application code. And data engineers, by the nature of their role, spend most of their time in the data layer, not in the application layer. They read the schema. They look at sample rows. They ask an application engineer for help when something does not make sense. They build a mental model of the data that is a combination of direct schema knowledge and second-hand application knowledge — and the second-hand knowledge is almost always incomplete.
The consequences are not always immediately visible. A pipeline that produces slightly wrong results is much harder to catch than a pipeline that throws an error. When the numbers are wrong by 3% because a specific order population was excluded by a filter that seemed correct, the error compounds quietly for weeks before anyone notices — and the notice usually comes from a stakeholder, not from a monitoring alert.
What data engineers actually need to know about the application
The schema describes the surface. The application determines the semantics. For a data engineer to build correct pipelines, they need to understand both — and the semantics are the part that requires application knowledge.
orders table — what the schema says vs. what the application does:
Column: payment_method (VARCHAR, nullable)
Schema says: stores the customer's payment method, can be null
Application does:
-> null when order is created via admin panel (internal orders bypass payment)
-> null when order originates from a legacy import script (pre-2022 data)
-> null during checkout abandonment (order created before payment capture)
-> "free" (string literal) when order total is $0 — not null, not a recognized enum value
-> populated with Stripe payment method ID in all other cases
Pipeline risk: joining on payment_method IS NOT NULL silently excludes three distinct
order populations with different business semantics
Column: status (ENUM: 'pending', 'processing', 'fulfilled', 'cancelled')
Schema says: four valid states
Application does:
-> 'pending' set on creation; also reset to 'pending' when a fulfilment retry
is triggered — same value, completely different business state
-> 'processing' used for both "payment captured, awaiting fulfilment" AND
"fulfillment initiated but not yet confirmed" — the distinction is in
a separate fulfillment_started_at column the schema doesn't advertise
-> 'cancelled' set by three different code paths: customer cancellation,
admin cancellation, and automatic expiry — with different downstream effects
that matter for revenue reporting
Pipeline risk: counting by status produces numbers that look correct but are not
Column: total_amount (DECIMAL)
Schema says: numeric, not null
Application does:
-> includes tax in some regions based on a tax_inclusive flag in the accounts table
-> excludes shipping for orders flagged as in-store pickup
-> for subscription orders, reflects the discounted price, not list price
Pipeline risk: summing total_amount for "revenue" mixes three different definitions
of revenue depending on order originThe orders example is not extreme. Every non-trivial table in a production application has equivalent complexity: columns whose semantics depend on the code path that wrote them, enum values that represent distinct business states despite having the same string value, nullable columns where null means different things depending on the row's origin. The schema documents none of this. The application encodes all of it.
A data engineer who knows the schema but not the application will write filters, joins, and aggregations that are structurally valid but semantically wrong. The query executes. The results are incorrect. There is no error to catch the problem.
Null semantics are the highest-risk gap
Nullable columns are the most frequent source of silent pipeline errors because null is overloaded. In most application databases, null does not reliably mean "this value does not exist." It means whatever the engineer who wrote the code decided it should mean at the time they wrote it — which may be "not yet set," "not applicable," "explicitly cleared," or "an upstream system did not send this field." Different rows in the same column can have null for different reasons, and those reasons matter for aggregation.
A data engineer who filters out nulls is making a business logic decision without knowing it. They are saying: "rows where this column is null are not interesting for this analysis." Whether that is correct depends on why those nulls exist — which is application knowledge, not schema knowledge.
Three common failures when data pipelines lack application context
The failure modes are consistent across teams and organizations. They vary in how long they take to surface, but the root cause is the same: the data engineer made an assumption about application behavior that the schema did not contradict.
Incorrect population filters. A filter that was intended to isolate a specific business population — paying customers, active subscriptions, completed transactions — silently includes or excludes rows it should not, because the filter logic is based on schema columns whose values are determined by application logic the data engineer was not aware of. Revenue numbers include test orders. Churn reports exclude a category of cancellations. Retention metrics count inactive accounts. None of these produce errors. All of them produce wrong numbers.
Broken pipelines after business logic changes. The application engineer updates the code that determines when a column is populated or how an enum is used. The database schema does not change. The data pipeline's filters and transformations were built against the old behavior, not the new one. The pipeline does not throw an error — it just silently produces results that reflect the old business logic against new data. This is particularly damaging for historical comparisons: a year-over-year metric where the second year is calculated under new business rules and the first year under old ones, without any indication that the two numbers are not comparable.
Incorrect aggregation logic for calculated fields. A numeric column that is a calculated value — a total, a rate, a balance — uses different calculation logic in different code paths. The data engineer aggregates it assuming consistent semantics. The aggregate is arithmetically correct and semantically wrong: it is adding numbers that are not the same kind of number.
Why asking engineers for context does not scale
The conventional answer to the application context gap is: data engineers should ask application engineers when they have questions about a table or column. This is correct advice and it does not scale.
Application engineers are in sprint. They are building features, responding to incidents, and managing their own backlog. Being asked to explain the full behavioral semantics of a table they wrote two years ago is not a quick task. It requires them to re-read their own code, remember decisions that were made for reasons that may no longer be obvious, and translate implementation details into data engineering language. The best engineers do this willingly. It still takes their time away from their own work.
The data team's needs are also not uniformly predictable. A data engineer building a new pipeline will have ten questions about a table spread across a two-week period as they encounter edge cases in the data. Serializing those questions into synchronous conversations with application engineers produces delays on both sides: the data engineer waits for answers before proceeding, the application engineer handles a trickle of context requests across the sprint.
For teams with a poor ratio of application engineers to data consumers — where one backend team supports a data team, a product team, and external analytics requests simultaneously — this bottleneck becomes the limiting factor on data team output. The data team has capacity. They are waiting for context.
The deeper problem is knowledge loss. Application context lives in engineers' heads, in PR descriptions from two years ago, in Slack conversations that are unsearchable at the moment they are relevant. When the engineer who wrote a table leaves the company, the behavioral semantics of that table go with them. The schema remains. The knowledge of what it means does not.
What application-aware data engineering looks like
Application-aware data engineering means building pipelines against an accurate model of how the application produces data, not just against the shape the database stores it in. The practical difference shows up in every filter, join, and aggregation decision.
Questions a data engineer needs answered that are not in the schema:
Null conditions:
-> Under what application conditions is this column null vs. populated?
-> Does null here mean "not applicable," "not yet set," or "an error occurred"?
-> Is null a permanent state or a transitional one?
Enum and string field semantics:
-> What code paths set each enum value?
-> Are there string literals used as de facto enum values that aren't in the ENUM type?
-> Can the same enum value represent different business states depending on other columns?
Timestamp fields:
-> Is this set on creation or on the most recent update?
-> Is it set in application code or by a database trigger?
-> Is it in UTC or local time? (Yes, this is still a real question in 2026)
Calculation and aggregation fields:
-> Does this numeric field use consistent units across all rows?
-> Are there known exceptions where the calculation logic differs?
-> Has the calculation definition changed over time? If so, when?
Write paths:
-> Which application services write to this table?
-> Are there background jobs, import scripts, or admin tools that bypass normal
validation and write directly?
-> Is there any data in this table that was migrated from another schema with
different conventions?
The schema answers none of these questions.
The application code contains all of them.That list is not theoretical. Every question in it has a concrete answer in the application code. Every question in it is currently answered, on most teams, by one of three methods: asking an application engineer, guessing based on sample data, or writing the wrong pipeline and fixing it when something surfaces.
Application-aware data engineering does not require data engineers to become application engineers. It requires that the behavioral knowledge embedded in the application code be accessible to data engineers without reading the full codebase or blocking on application engineer availability. The knowledge already exists. The access point does not.
Write path visibility changes pipeline design
One of the highest-value pieces of application context for data engineers is the complete set of code paths that write to a table. A table written by five different services — the primary application, an admin tool, a legacy importer, a background worker, and a subscription renewal job — is a fundamentally different modeling challenge than a table written by one service. The rows do not share a single consistent business logic. They share a schema.
Knowing that five write paths exist, and understanding what each one does differently, is the prerequisite for designing a pipeline that handles the data correctly across all five populations. This is not information the schema provides. It is information that requires tracing the full write path through the application code.
Write path visibility also protects against future breakage. When a data engineer knows that the LegacyOrderImporter is responsible for a specific class of null values in payment_method, they can write their pipeline filter with that knowledge explicit. When a future application engineer changes the importer behavior, the data engineer has the context to recognize that the change is relevant to their pipeline — rather than finding out three weeks later when the numbers shift.
How Kognita gives data teams application context
Kognita indexes application code semantically and makes it queryable without requiring data engineers to read the application repository themselves. The index understands the relationship between code behavior and data output: which code paths write which values, what conditions determine nullable column behavior, what the full write path to a given table looks like across all services.
How Kognita serves application context to data teams:
Query: "Under what conditions is orders.payment_method null?"
Answer: Three distinct conditions in application code:
1. Admin-created orders (OrderController#create_admin) — always null, by design
2. Legacy import job (LegacyOrderImporter) — null for orders before 2022-03-15
3. Checkout abandonment (CheckoutService#create_pending) — null until payment
captured; payment_captured_at is the reliable indicator of whether null
is intentional or transitional
Query: "What code paths set orders.status to 'pending'?"
Answer: Two distinct paths:
1. Order creation (OrderService#create) — initial state
2. Fulfillment retry (FulfillmentService#retry_failed) — reset after failure
Differentiator: retry path also sets fulfillment_retry_count > 0
and fulfillment_last_failed_at is not null
Query: "Which services write to the orders table?"
Answer: 5 write paths identified:
-> OrderService (primary application writes)
-> AdminController (internal orders, bypasses payment validation)
-> LegacyOrderImporter (historical data, pre-2022 conventions)
-> FulfillmentService (status updates only)
-> SubscriptionRenewalWorker (recurring orders, different total_amount logic)
Result: data engineer writes pipelines against the application model,
not assumptions derived from column names and types.The answers above are not derived from the schema. They are derived from the application code: the actual OrderController, FulfillmentService, and LegacyOrderImporter as they are written today, not as they were documented at some point in the past.
Because the index is managed and updated continuously, the application context data engineers access reflects the current codebase. When a sprint introduces a new code path that writes to a table differently, the indexed knowledge updates. A data engineer querying after that sprint gets the current answer, not the answer from six months ago when someone last updated a database documentation page.
The Jira MCP integration adds the in-flight dimension: a data engineer can see not just what the application does today but what application work is currently in progress that will change data behavior. If a sprint includes a ticket that changes how the order.status enum is set, that context surfaces alongside the current behavioral description. The data engineer knows before building the pipeline that the behavior is about to change — not after the pipeline is in production and the numbers shift.
Non-technical access to technical context
Kognita's plain-language interface means this context is not only available to data engineers who are comfortable reading application code. Analytics engineers asking about metric definitions, data leads auditing pipeline logic for business review, product managers verifying whether a specific customer behavior is captured in the data — all of them can query application context directly without routing every question through the application team.
This reduces the load on application engineers without reducing the access data teams have to application knowledge. The bottleneck was never that the knowledge did not exist — it was that the knowledge lived exclusively with people who had their own work to do.
Final take
Data pipelines built on schema knowledge alone are fragile by construction.The schema is necessary but insufficient. It describes the data at rest. The application determines what the data means — which nulls are intentional, which enum values encode distinct business states, which numeric fields use consistent calculation logic, which rows were written by which code path and with which assumptions.
The pattern of asking application engineers for context one question at a time is not a sustainable model for teams where data consumers outnumber application engineers or where the knowledge was built over years and lives in code that nobody is actively maintaining. The knowledge needs to be queryable directly, from the source, by the people who need it.
When data engineers have direct access to application behavioral context, the pipelines they build are more accurate, more resilient to business logic changes, and less dependent on application engineer availability. The schema tells you what exists. The application tells you what it means. Both are required for pipelines that produce numbers the business can trust.