Investigations
Ask Rounds to figure out what's wrong. The investigation agent collects evidence, evaluates hypotheses, checks configured telemetry and Kubernetes context, and writes a structured report you can share with the rest of the team.
What you can do
- Trigger from a symptom — "Investigate the spike in 5xx errors at 14:30 UTC" or "Why is checkout latency up?"
- Auto-correlate with deployment events — the agent pulls
changes.list_recentto check whether anything shipped near the incident window. - Multi-signal evidence — agent queries metrics, logs, recent changes, and Kubernetes state when configured; cites every datapoint in the report.
- Cited evidence — every claim in the report is anchored to a specific query, log line, or change event so reviewers can verify it. The provenance header on each AI-written section names the model, prompt version, and evidence set used.
- Recommend fixes — if the likely cause is environmental, the agent can propose a remediation. Today this is a Kubernetes plan; planned integrations (GitHub PR, CI/CD rollback, Argo / Flux re-sync) will let the same loop produce non-K8s remediations.
- Approval-gated remediation — interactive runs surface mutating steps inline as Run / Confirm / Apply. Background-agent runs (auto-investigation off a firing alert) emit a
RemediationPlanwith Approve / Reject / Modify controls and notify the owning team / on-call. See Auto-remediation for the background flow. - Read the report — sectioned write-up: Summary → Symptoms → Hypotheses → Evidence → Conclusion → Recommended actions.
- Continue the conversation — ask follow-up questions inside the investigation thread; the agent has the full evidence loaded.
How to use it
Start an investigation
In the chat panel:
Investigate why the order-service p99 latency jumped at 09:15
The agent runs investigation.create with a title + description, then iterates:
metrics.range_queryto plot the symptom and find when it startedmetrics.label_valuesto find related dimensions (handler, region, instance)logs.queryfor error patterns in the affected windowchanges.list_recentto look for deploys / config changes- Kubernetes inspection tools to check pods, events, rollouts, and resource pressure when a cluster connector is configured
investigation.add_sectionrepeatedly as evidence accumulatesinvestigation.completewhen the analysis converges
Read the report
Open the investigation from the sidebar. Each section is rendered with:
- Markdown narrative
- Embedded panels (live-querying the same data)
- Links to the underlying datasource + query
Continue investigating
Type a follow-up in the same thread:
Did this also affect the EU region?
The agent reuses the loaded evidence and runs additional queries scoped to the new question.
List past investigations
Sidebar → Investigations. Filter by date, status, or tag. investigation.list is also exposed via API for dashboards or external tools.
Examples
| Prompt | Investigation focus |
|---|---|
Why did the alert "high-error-rate" fire at 03:14? | Pulls alert evaluation logs, queries the alert's metric, correlates with change events |
What's causing the slow queries on the API last hour? | Range queries on duration histogram, log search for slow-query patterns, deploy diff |
Why are pods restarting after the deploy? | Checks restart metrics, pod events, rollout status, logs, and resource limits |
Compare today's traffic with last week's same time | Range queries with offset, deltas plotted per handler |
Limits
- The investigation agent has the same
allowedToolsas the orchestrator plus investigation-specific ones. Read-only inspection is allowed when permitted; mutating infrastructure actions require approval. - Time windows default to ±2h around the prompt's time reference. Specify explicitly for longer ranges: "investigate the last 24 hours of error rate spikes".
- Logs queries are limited to your datasource's native limits (Loki: 5000 lines per query by default).
- The agent stops when it has a confident conclusion or after a token budget; you can always ask "what else could it be?" to push further.
Related
- Datasources — connecting metrics + logs backends
- Alert rules — start an investigation from a firing alert; automatic investigation is the next product loop
- Permissions —
investigations:readandchat:usefor viewer access