Docs

About

Resources

Get alpha access

All blogs

Thoughts & Ideas

Mar 27, 2025

0 Min Read

AI in Incident Response: Myth vs. Reality

Eric Abruzzese

Software Engineer @ Aptible

AI-powered incident response tools are popping up left and right, but most of them are overpromising and underdelivering. They promise to revolutionize how outages are detected and resolved, but… how?

What we’ve gotten from the marketing claims is a mix of excitement, skepticism, and general misunderstanding. In this blog, we want to debunk some of the hype around how AI is actually being used in incident response.

Let’s separate myth from reality and look at real use cases, case studies, and AI adoption trends, mainly focusing on the following:

Myth #1: AI will replace on-call engineers
Myth #2: AI will ultimately eliminate incidents
Myth #3: AI is just a tech fad

AI will replace on-call engineers ❌

A common misconception is that AI systems will fully replace engineers. Within the context of incident response, that would mean that an AI Agent is magically handling incidents end-to-end. But in reality, the best (current) AI tools are being used as copilots to assist engineers rather than doing their jobs for them.

PagerDuty is one example that’s emphasizing the need for AI to focus on helping vs. solving and that combining automation with an AI copilot yields the best results in terms of helping incident responders get answers to key questions quicker. This approach augments human precision and speed but does not remove humans from the loop.

So let’s be clear about one thing: AI in its current state IS NOT a reliable source for root cause identification and therefore SHOULD NOT (and some would argue CANNOT) be used as an engineer’s replacement. Not only will it hallucinate if it has to make a guess, but it will return that guess confidently and ultimately make the remediation process take even longer.

AI is a copilot, not an employee ✅

The reality is that AI can (and should) be used to rapidly surface likely causes or anomalies, but it doesn’t inherently understand business context or impact. It provides clues, not final answers.

We should be thinking of today’s AI as an extremely fast junior engineer: it will read through logs, metrics, and past incidents in seconds and highlight patterns, but a senior engineer needs to validate those and guide the resolution. Misconceptions aside, AI is genuinely useful as a force-multiplier for an on-call team, handling tasks at speeds and scales that humans can’t.

AI will ultimately eliminate incidents ❌

Unfortunately for all of us, software systems will always have failures. And there’s nothing AI can do about it. Like we hinted at before, AI is not a magic wand that we can wave at failures and say “be gone now”; it’s a useful tool for some tasks, but it (likely) won’t ever be able to catch and resolve every potential failure before it happens.

AI will reduce the duration and impact of incidents ✅

Organizations that are strategically and tactfully putting AI to work in their incident response processes have reported seeing fewer large-scale outages over time. The Ops team at FreeWheel, for example, has used BigPanda to help reduce their MTTR by 78%.

Another example: Netflix has spoken about using automated canaries and AI analysis to prevent bad code pushes, effectively catching issues before they hit customers. And Etsy has mentioned using ML models to detect payment anomalies or site crashes and route them to the right team immediately, shaving minutes off response.

The Pragmatic Engineer reported on how tech companies handle incidents and found that many are building internal tools with AI-like capabilities (one example: an internal chatbot that developers can query during outages for relevant metrics or runbook steps). Logz.io’s AI agent is one of the new tools in this space – it can summarize alerts, pinpoint contributing events, and even suggest remediation steps (like “roll back the latest deployment”) as part of its output.

Finally, there’s post-incident AI analysis that could potentially be useful for reducing the duration and impact of future incidents. There’s a tool now that transcribes incident call audio and uses NLP to extract who did what and when then automatically drafts postmortem documentation from incident data. Pretty neat!

AI is just a tech fad ❌

The majority of readers here probably haven’t fallen prey to this particular myth, but it is a popular one nonetheless. Of course, we’ve seen plenty of AI tech companies rise and fall already (one report says 90% of AI startups in 2024 failed). There are many complex reasons for these failures (as you can read in the report), but it’s not because AI is a useless technology or a tech fad. It’s more likely that these startups were using AI incorrectly. We’re of the opinion that it’s only going to get more sophisticated and useful with time.

AI is only going to get better ✅

We’re seeing most major incident management platforms add AI features – Splunk’s AI-driven insights, Datadog’s Watchdog anomaly detection, and PagerDuty’s Event Intelligence to name a few.

This type of competition and rapid development is already driving improvements in AI.

Many organizations in 2023-2025 are in the evaluation stage, piloting AI assistants in test environments or on non-critical incidents to see how well they perform. About 57% of companies in one survey had GenAI in at least a proof-of-concept or limited production use for incident response.

Sure, full adoption will take time (due to valid concerns around security and accuracy), but there is certainly an overwhelming push for organizations to adopt more AI tools. As with other technological trends, we can probably infer that AI will only get better as engineers work quickly to make it as useful as possible to stay ahead of the competition.

Over time, we can expect AI to handle more of the “mechanics” of incidents: detection, correlation, data gathering, even initial triage. Humans will then be freer to do what they do best: creative problem-solving, complex decision-making, and improving system design to avoid future incidents.

Wrapping up: how to use AI in incident response

If nothing else, here is what you should take away from this blog: AI should be used for certain incident response tasks, but NOT ALL. Here’s what it excels at and why:

Anomaly Detection: AI/ML algorithms are very effective at monitoring streams of metrics or logs and flagging unusual patterns. Instead of static thresholds, they learn normal behavior and can catch subtle deviations (e.g. a slight latency creep or error rate spike) that might escape manual dashboards. This helps with early detection of incidents. The upside is catching issues sooner; the caveat is these systems can be noisy until tuned. In other words, you have to give the AI a ton of context in order for it to do anomaly detection well. To reduce the noise, we advise starting with some of the battle-tested statistical models (like KNN or Isolation Forest algorithms) as a precursor before using AI for anomaly detection.
Alert Correlation and Noise Reduction: Incident response often starts with a flood of alerts. AI is really good at correlating and deduplicating alerts to reduce noise. For example, PagerDuty’s platform uses AI to group related alerts into a single incident, so engineers aren’t bombarded by dozens of pages for what is essentially one problem. However, similar to above, the best practice is to implement tried and true non-AI approaches to do the “pre-work” before handing alert correlation over to your AI model.
Contextual Data Gathering: AI can use the tools at its disposal to pull together information from many sources faster than any human. An incident copilot might automatically fetch recent deployment history, relevant runbook entries, past incident reports, log excerpts around the error timeframe, and so on. Similarly, summarization is another strength — models like GPT-4 can read through lengthy logs or support tickets and summarize the important points. This can generate a quick incident synopsis or user impact report that would take a human much longer to write.
Suggesting Likely Root Causes: AI can correlate clues to point toward probable causes. If every alert that fired is related to Service A, and a code change went out to Service A just before the incident, the AI will flag that. It might not always be right, but it dramatically narrows the search space. For example, Dynatrace’s tool traces transactions across services and when an error arises, it uses AI to pinpoint which component in the chain is the likely bottleneck or failure – something that’s very hard to do manually in distributed systems.

If you’re part of one of the many organizations looking to implement a useful form of AI into your incident response processes, we’d love to show you our own AI Agent 😄 For more info on how we built it, you can read this blog. To see how it works, go ahead and get a demo.

Let’s separate myth from reality and look at real use cases, case studies, and AI adoption trends, mainly focusing on the following:

Myth #1: AI will replace on-call engineers
Myth #2: AI will ultimately eliminate incidents
Myth #3: AI is just a tech fad

AI will replace on-call engineers ❌

AI is a copilot, not an employee ✅

AI will ultimately eliminate incidents ❌

AI will reduce the duration and impact of incidents ✅

AI is just a tech fad ❌

AI is only going to get better ✅

We’re seeing most major incident management platforms add AI features – Splunk’s AI-driven insights, Datadog’s Watchdog anomaly detection, and PagerDuty’s Event Intelligence to name a few.

This type of competition and rapid development is already driving improvements in AI.

Wrapping up: how to use AI in incident response

If nothing else, here is what you should take away from this blog: AI should be used for certain incident response tasks, but NOT ALL. Here’s what it excels at and why:

Anomaly Detection: AI/ML algorithms are very effective at monitoring streams of metrics or logs and flagging unusual patterns. Instead of static thresholds, they learn normal behavior and can catch subtle deviations (e.g. a slight latency creep or error rate spike) that might escape manual dashboards. This helps with early detection of incidents. The upside is catching issues sooner; the caveat is these systems can be noisy until tuned. In other words, you have to give the AI a ton of context in order for it to do anomaly detection well. To reduce the noise, we advise starting with some of the battle-tested statistical models (like KNN or Isolation Forest algorithms) as a precursor before using AI for anomaly detection.
Alert Correlation and Noise Reduction: Incident response often starts with a flood of alerts. AI is really good at correlating and deduplicating alerts to reduce noise. For example, PagerDuty’s platform uses AI to group related alerts into a single incident, so engineers aren’t bombarded by dozens of pages for what is essentially one problem. However, similar to above, the best practice is to implement tried and true non-AI approaches to do the “pre-work” before handing alert correlation over to your AI model.
Contextual Data Gathering: AI can use the tools at its disposal to pull together information from many sources faster than any human. An incident copilot might automatically fetch recent deployment history, relevant runbook entries, past incident reports, log excerpts around the error timeframe, and so on. Similarly, summarization is another strength — models like GPT-4 can read through lengthy logs or support tickets and summarize the important points. This can generate a quick incident synopsis or user impact report that would take a human much longer to write.
Suggesting Likely Root Causes: AI can correlate clues to point toward probable causes. If every alert that fired is related to Service A, and a code change went out to Service A just before the incident, the AI will flag that. It might not always be right, but it dramatically narrows the search space. For example, Dynatrace’s tool traces transactions across services and when an error arises, it uses AI to pinpoint which component in the chain is the likely bottleneck or failure – something that’s very hard to do manually in distributed systems.

Let’s separate myth from reality and look at real use cases, case studies, and AI adoption trends, mainly focusing on the following:

Myth #1: AI will replace on-call engineers
Myth #2: AI will ultimately eliminate incidents
Myth #3: AI is just a tech fad

AI will replace on-call engineers ❌

AI is a copilot, not an employee ✅

AI will ultimately eliminate incidents ❌

AI will reduce the duration and impact of incidents ✅

AI is just a tech fad ❌

AI is only going to get better ✅

We’re seeing most major incident management platforms add AI features – Splunk’s AI-driven insights, Datadog’s Watchdog anomaly detection, and PagerDuty’s Event Intelligence to name a few.

This type of competition and rapid development is already driving improvements in AI.

Wrapping up: how to use AI in incident response

If nothing else, here is what you should take away from this blog: AI should be used for certain incident response tasks, but NOT ALL. Here’s what it excels at and why:

Anomaly Detection: AI/ML algorithms are very effective at monitoring streams of metrics or logs and flagging unusual patterns. Instead of static thresholds, they learn normal behavior and can catch subtle deviations (e.g. a slight latency creep or error rate spike) that might escape manual dashboards. This helps with early detection of incidents. The upside is catching issues sooner; the caveat is these systems can be noisy until tuned. In other words, you have to give the AI a ton of context in order for it to do anomaly detection well. To reduce the noise, we advise starting with some of the battle-tested statistical models (like KNN or Isolation Forest algorithms) as a precursor before using AI for anomaly detection.
Alert Correlation and Noise Reduction: Incident response often starts with a flood of alerts. AI is really good at correlating and deduplicating alerts to reduce noise. For example, PagerDuty’s platform uses AI to group related alerts into a single incident, so engineers aren’t bombarded by dozens of pages for what is essentially one problem. However, similar to above, the best practice is to implement tried and true non-AI approaches to do the “pre-work” before handing alert correlation over to your AI model.
Contextual Data Gathering: AI can use the tools at its disposal to pull together information from many sources faster than any human. An incident copilot might automatically fetch recent deployment history, relevant runbook entries, past incident reports, log excerpts around the error timeframe, and so on. Similarly, summarization is another strength — models like GPT-4 can read through lengthy logs or support tickets and summarize the important points. This can generate a quick incident synopsis or user impact report that would take a human much longer to write.
Suggesting Likely Root Causes: AI can correlate clues to point toward probable causes. If every alert that fired is related to Service A, and a code change went out to Service A just before the incident, the AI will flag that. It might not always be right, but it dramatically narrows the search space. For example, Dynatrace’s tool traces transactions across services and when an error arises, it uses AI to pinpoint which component in the chain is the likely bottleneck or failure – something that’s very hard to do manually in distributed systems.