Blog

Operations Managers Are Running Processes They Cannot See Into

10 min read

The operations manager gets a Slack message at 10am: “The bulk export job hasn't run in two days.” They check the job status dashboard — it shows “queued.” They do not know if that is a bug, a configuration change, a capacity issue, or expected behavior after a deployment last Thursday. They file a ticket to engineering. Engineering resolves it in ten minutes — there was a config change that adjusted the job schedule. The ops manager spent three hours waiting for that answer.

That three-hour wait is not an anomaly. It is the default. Operations managers at software companies are accountable for processes that live in software — customer onboarding flows, billing cycles, integration pipelines, data export jobs, scheduled reconciliations. When those processes behave unexpectedly, ops has no way to see into the system to understand whether it is a configuration issue, a code problem, or a data problem. The answer always requires engineering.

The dependency is not because operations managers are incapable. It is because the systems they run were never designed to be visible to them. Behavioral understanding of how a process works, what it calls, what triggers it, and what configuration governs it — that information lives in code and configuration files that ops was never expected to read.

What operations managers own that lives in software

The scope of what operations teams are accountable for has expanded significantly as software companies automate more of their core processes. Ops is no longer managing spreadsheets and email workflows. Ops is managing software-defined processes, and the accountability is real.

Customer onboarding pipelines

The sequence of steps that moves a new customer from contract signed to active account — provisioning, welcome email, Salesforce record creation, billing setup, permission grant. Each step is defined in code. When a customer reports they never received their welcome email, ops needs to know which step failed and why. That answer is not in the onboarding dashboard. It is in the code that defines the pipeline.

Scheduled jobs and batch processes

Nightly reconciliation jobs, monthly billing exports, daily data sync operations, weekly cohort calculations. Operations teams are responsible for these running correctly and on time. When one does not run, ops needs to understand whether the job is delayed, failed, misconfigured, or intentionally disabled. None of those questions are answered by “status: queued.”

Integration workflows

The pipelines that connect the core platform to Salesforce, HubSpot, Stripe, Avalara, Zendesk, Intercom, and whatever else the company has integrated. When a Salesforce record does not get updated after a customer upgrade, ops needs to know which integration is responsible for that sync, whether it ran, and what error it returned if it failed.

SLA monitoring and data quality

Operations teams often own SLA compliance and data quality checks for customer-facing processes. A report that shows incorrect totals, a billing statement that is missing line items, an export that includes stale data — these are ops problems even though the root cause is almost always in code or configuration.

The engineering dependency loop

Every question ops cannot answer themselves becomes a ticket. Every ticket interrupts engineering. The interruptions are usually for questions that are, in principle, simple. Not “please fix this defect in the reconciliation logic” — which is legitimately engineering work. But questions like:

“Is the nightly reconciliation job still scheduled to run daily?” “Which integration is responsible for syncing cancellations to Salesforce?” “What triggers the bulk export pipeline?” “Did anything change in the billing job this week?”

These are behavioral questions about configuration and process definition. They require someone with code access to answer, but they do not require engineering judgment to interpret. The answer is either yes or no, or a description of how a process works. Engineering should not be the only path to that information.

The cost accumulates in both directions. Operations managers wait for answers, which means escalations slow down, decisions get delayed, and customers sometimes wait longer than necessary for problems to be acknowledged. Engineering spends time on context questions instead of on the actual code changes that require their expertise. Neither side is getting the better end of this arrangement.

What ops managers currently do instead

Three workarounds that do not scale

Three workarounds ops teams currently use — and why they fail:

1. Ask the engineering lead directly
   Works: yes, usually
   Cost: interrupts an engineer, often takes hours to get a response
   Scale: breaks immediately under volume
   Truth: the answer may be from memory, not from the current codebase

2. Check the dashboard
   Works: for current state only ("is the job running right now?")
   Misses: behavioral questions ("what does the job call?", "why is it queued?")
   Limitation: dashboards show metrics, not configuration or logic

3. Read the documentation
   Works: sometimes, for stable processes that have not changed
   Problem: documentation describes the system as it was when written
   Reality: the job schedule, dependencies, or destination may have changed
   since the last doc update — with no indicator that it drifted

The pattern across all three workarounds is the same: they are all proxies for understanding that should be directly available. Asking an engineer gets you an answer, but through an expensive human intermediary. Checking the dashboard gets you current state, not behavioral understanding. Reading the documentation gets you a snapshot of how the system worked when someone wrote it down.

None of these give operations managers what they actually need: the ability to ask a behavioral question about a process and get an accurate answer that reflects how the system works right now.

The specific questions operations managers need answered

Ops questions that should be self-serve vs. what currently requires a ticket

Questions ops should be able to answer without a ticket:

  Job scheduling:
  -> Is the nightly reconciliation job scheduled to run?
  -> What triggers it — a cron, an event, a manual flag?
  -> Did the schedule change in the last two weeks?

  Integration behavior:
  -> Which external systems does the customer onboarding pipeline touch?
  -> In what order? What happens if one of them fails?

  Configuration:
  -> What environment flag controls whether the welcome email sends immediately?
  -> Is that flag set to batch or immediate in production?

  Data flow:
  -> When a customer cancels, what systems get updated?
  -> Does Salesforce get notified, or does ops have to do that manually?

  Incident triage:
  -> Did anything touch the export pipeline this week?
  -> Is the job silently failing or just queued?

Questions that legitimately require an engineering ticket:
  -> There is a code defect in the reconciliation logic
  -> A job is hitting an infrastructure limit engineering owns
  -> A third-party API is returning unexpected errors

The questions in the first category are not engineering questions. They are behavioral questions about configured processes. The answer does not require judgment about code quality, architectural trade-offs, or technical risk. It requires visibility into how a process is defined and what it currently does.

The questions in the second category are legitimately engineering work. When there is a code defect, an infrastructure limit, or an unexpected third-party API error, ops should file a ticket. But the majority of ops escalations to engineering are not in that second category. They are in the first.

Why “just give ops access to the dashboard” does not work

Dashboards are often proposed as the solution to ops visibility gaps. If operations managers can see job statuses, queue depths, and error rates, they should be able to diagnose most issues without engineering involvement. This sounds reasonable and is incorrect in practice.

Dashboards show current state. Operations needs behavioral understanding. The bulk export job showing “queued” is current state. Why it is queued, what it is waiting for, whether it is waiting because of a deliberate schedule change or because of an upstream failure, what it will do when it eventually runs — that is behavioral understanding. None of that appears in a status dashboard.

The information operations managers actually need to self-serve is: what is this process supposed to do, how does it work, what does it call, what triggers it, what configuration governs its behavior. That information lives in the code and configuration files that define the process. A dashboard does not expose that. A Grafana panel showing job execution duration does not tell you whether the job schedule was changed last Thursday.

Some companies add runbooks. Runbooks are an improvement over nothing, but they are a documentation artifact — which means they face the same decay problem as all documentation. A runbook that describes the bulk export pipeline as of eighteen months ago is not useful when the export pipeline has since gained two new dependencies and a changed schedule.

What plain-language system access looks like for operations

The useful interface for operations managers is not a code browser. It is not a Confluence page. It is the ability to ask a plain-language behavioral question and receive a plain-language answer that is derived from the actual current system.

Example ops queries with system answers in plain language

Ops query: "How does the nightly reconciliation job work?"

System answer:
  Trigger: cron at 02:00 UTC, defined in reconciliation-service/config/schedule.yml
  Reads from: orders table (PostgreSQL), filters for status = 'pending_reconcile'
  Calls: accounting-service via internal REST API at /reconcile/batch
  Writes to: reconciliation_log table, updates order status to 'reconciled'
  On failure: retries 3 times, then sends alert to #ops-alerts Slack channel
  Average runtime: 4-7 minutes for normal batch sizes (< 10,000 rows)
  Known issue: batches > 50,000 rows have caused timeouts in the past (see incident-2025-09)

Ops query: "What is the export pipeline for the monthly billing report?"

System answer:
  Trigger: first day of month at 06:00 UTC, or manual trigger via admin panel
  Reads from: billing-service, subscription-service, usage-service
  Generates: CSV export to S3 bucket billing-exports/monthly/
  Notifies: finance@company.com via SendGrid after successful export
  Dependencies: requires subscription-service to have completed its month-close job first
  Current status of dependency job: available in job-scheduler dashboard

That kind of answer tells an operations manager what they need to know to triage the situation. If the export job shows “queued,” and the system answer explains that the job requires the subscription-service month-close job to complete first, the operations manager now knows to check whether that dependency ran — without filing a ticket.

This is what Kognita provides for operations teams: a managed semantic index of the actual codebase that translates behavioral questions into plain-language answers. The index is always current because it is re-indexed from the live codebase automatically. When the export schedule changes, the answer changes. When a new dependency is added to the onboarding pipeline, the answer reflects the new dependency.

Operations managers do not need to read code. They need the code to answer their questions. That is a different interface, and it is the right one for the role.

Jira and codebase together: the second layer of ops visibility

System behavioral questions are one half of what operations managers need. The other half is work-in-progress visibility: understanding what is changing right now that might affect the processes they run.

Connected to Jira, Kognita lets operations managers ask questions like: “Is there anything in progress that touches the export pipeline?” before filing a new ticket about a job behaving unexpectedly. If there is a Jira ticket in the current sprint that modifies the export job's schedule or dependencies, the operations manager finds out before filing a duplicate ticket and waiting three hours for a response that amounts to “yes, we know, we changed it.”

This also works in the other direction. When ops identifies an actual process failure that needs engineering work, they can file a more complete and accurate ticket because they have already ruled out the configuration and behavioral explanations. Engineering receives a ticket that says “the reconciliation job is failing and it is not a configuration issue or a schedule change — here is what we already checked” instead of “the reconciliation job is not working.” That is a cleaner handoff and faster resolution.

Final take

Operations managers are not asking for engineering access. They are asking for behavioral visibility into the processes they are accountable for. That is a legitimate need that no dashboard satisfies, that stale runbooks cannot reliably serve, and that the engineering dependency loop handles at a cost neither side should accept.

The question “is the nightly reconciliation job configured to run daily?” should not require an engineering ticket. The question “what does the customer onboarding pipeline call, and in what order?” should not require an engineering ticket. The question “did anything change in the billing export job this week?” should not require an engineering ticket.

These are behavioral questions about processes that operations manages. Plain-language system queries answer them accurately, immediately, and without pulling engineering away from the work that actually requires their expertise. That is the correct division of responsibility, and it is achievable now.