i-Rays | IBM i is an island

Context: IBM i as a blind spot in modern IT

In many organizations, IBM i is mission-critical — but it often lives slightly outside the rest of the IT landscape. Teams invest in modern observability for cloud, microservices, and web apps, while IBM i is still monitored in a traditional way: a few metrics, a few thresholds, maybe logs, maybe some scripts and dashboards.

And it works… until it doesn’t.

The problem is that IBM i incidents rarely show up as “CPU > 90% for 5 minutes.” More often, they show up as subtle signals: a mild slowdown, growing queues, odd job behaviour, a batch window that starts slipping, performance degradation that appears only in a specific timeframe. In classic monitoring, that looks like noise. From the business perspective, it’s risk.

The problem: alert fatigue and no answer to "why?"

Classic monitoring is good at detecting threshold breaches. But on IBM i, the real issue is often not that something is "too high" - it's that something behaves differently than it normally does.

That creates two painful outcomes:

Too many alerts, not enough meaning.
Alerts fire, but they don’t lead to diagnosis. Over time, teams learn to ignore them because most don’t result in action.
The most expensive part of an incident is time-to-cause.
Teams don’t need 100 signals. They need one: what is wrong and what most likely caused it.

In practice it looks like this: an incident hits, and the manual investigation begins. Someone checks CPU, someone checks memory, someone checks disks, someone looks at jobs. Best case, you get a decent hypothesis within an hour. Worst case, you're left with another workaround and a lingering mystery.

The insight: thresholds are blind to behaviour

Thresholds assume "high = bad." But IBM i can have high values that are perfectly normal - and low values that are abnormal. Performance is context-dependent: time of day, batch windows, seasonality, specific job mixes, workload patterns.

So the more useful question isn’t: “Is CPU high?” It’s: “Is CPU behaving differently than it typically does in this context?”

That’s the difference between an alarm and a diagnosis.

What behavioural observability means (in practice)

Behavioural observability is an approach where the system learns what "normal" looks like — and then detects meaningful deviations from that baseline.

A simple model looks like this:

Baseline — what normal behaviour looks like across contexts
Anomaly — what deviates, and by how much
Context — what else was happening at that time (jobs, workload, dependencies, events)
Recommendation — the most likely cause and what to check first

What the difference looks like during a real incident

Let’s take a familiar scenario:

Users report that the application is slow.
Classic monitoring says: “CPU OK, RAM OK, disk OK.”
And yet the system still feels “stuck.”

With behavioural observability, the first question is: What changed its pattern?

For example:

a specific queue starts building up in a way that’s unusual for this time window,
certain job classes start taking longer than their normal baseline,
activity correlates strongly with one subsystem,
or I/O behaviour looks “off” compared to similar days.

You don’t just get “there is a problem.” You get “this problem looks like X and most often points to Y.” That’s the difference between monitoring and diagnostics.

Proof: what teams typically gain (without hype)

Behavioural observability is an approach where the system learns what "normal" looks like — and then detects meaningful deviations from that baseline.

When you move from thresholds to baseline + context, three things usually happen:

Fewer false alarms → less alert fatigue, more trust in signals
Shorter time to a solid hypothesis → faster clarity on where to look
Faster MTTR → less wandering, less “ghost hunting”

Most importantly: IBM i stops being a black box and becomes part of an observable IT system.

FAQ

Does this replace monitoring?

No. Monitoring tells you what is happening. Behavioural observability helps explain why it's happening.

Do I need full microservices-style instrumentation?

Not always. On IBM i, you can extract a lot of value from system-level signals - as long as you analyse them as behavioural patterns, not isolated numbers.

Is this just "Al magic"?

It shouldn't be. A good platform shows: what it detected, why it thinks it's abnormal, how confident it is, and what to check first.

What you can do right now

If you want a quick reality check on whether your IBM i is suffering from "alert noise without diagnosis," ask one question:

When an incident happens, do your alerts lead you to the cause — or do they mostly trigger another manual investigation?

If it's mostly the second: thresholds and dashboards aren't enough. You need observability that's built around behaviour and context.

If you want, we can review 2-3 recent incidents and map which behavioural patterns would have been tected - and what data you need to reduce time-to-root-cause.

Book a 30-min review

IBM i is an island — until it isn't