Surviving the IBM i skills gap

Richard Berman (Moderator):

I'd like to start by thanking everyone for coming today to talk about a very important topic facing the IBM i market. And we have some really great people here who are experts in the field who will be able to answer questions. This is really an opportunity just to share thoughts on this really important topic related to the skills gap. So before we start, I'd like to very quickly introduce a few people who are going to be part of this conversation. First, I'd like to introduce Marek Walczak from i-Rays.

Marek has worked in the midrange space for a very long time, and this is something that he thinks about a lot, and he was the one who came up with the idea for this as a topic for what we wanted to talk about today. We have Marius who consults in the IBM i space and has spoken and thought about this topic extensively and has helped organizations deal with this very real problem. We have from the other side of the pond, Alan. Again, he’s a real expert in IBM i. I think a lot of people on this call probably know Alan, and he has a lot of thoughts about the skills gap in this area. And we have Dawn, of course, who has been very involved in COMMON, and I think pretty much everyone in the IBM i space knows her.

We know that everyone in the IBM i universe is in some way dealing with this subject or the ramifications of this subject. I'm going to start by asking Alan the big question that I think needs to be asked, which is, “How bad do you think the problem really is and how is that problem manifesting itself?”

Alan Seiden:

Thanks very much. You can divide it into a couple of areas. One is in applications. Another area is in knowledge of how the business actually runs. That knowledge is leaving too. And then there's in the administration of the systems. People have focused so long on the technology part of programming, “We can't find RPG people, it's impossible to learn.” I think companies who've stuck around are starting to realize that they can train, they can recruit people and teach them RPG now, but that's not really the hardest part. It's definitely possible. You just have to commit to it. And then after that, you've got the business logic, how that works, which apparently falls to IT, because business people outside of IT are retiring, too. And then the administration and performance and making sure the system runs effectively.

Richard Berman:

Let me throw that same question over to Dawn. I know this is something you've been talking about and writing about for a long time. How do you see the skills gap manifesting itself?

Dawn May:

Yep, thank you. Yeah, so one of the things that I see, I'm out in the world doing hands-on technical work with businesses that I'm helping make their systems run better. And if there's a lack of skills, and I've seen it happen, things just get missed and we don't have problems that would be easily corrected, get overlooked. Maybe the tuning was done decades ago and hasn't been revisited. And I tend to see downtime as a consequence of the skills gap as the implication of what's happening. And once you talk with the ... And I deal mostly with the administration side of the house, and once you start sharing the things that this is all information that you can get at, you can fix the problem and just do that mentoring and skills, education as you work, you can help these problems go away.

Richard Berman:

Marius, what are some specific areas where you're seeing that skills gap affecting on the technical side where organizations are having trouble meeting their technical missions because they're not addressing the skills gap effectively?

Marius Le Roux:

Certainly, usually in my experience, that go- to person that everybody always ran to an organization suddenly isn't there anymore. People might not panic now or they've given a good farewell, but then the system starts missing that specific person. And that person or that team might have had very intricate knowledge around how the system behaves - like when you listen to a car engine that starts doing a certain noise - and they knew exactly what types of workload all of these systems are running on. Now, if you don't have that skill anymore, then the engine is going to start becoming louder and louder as time goes on. The same thing applies on IBM i. And what makes it different is that this is normally a very specialized skillset that has got lots of depth into the system. They know the different parts of the system, for example, how the memory behaves at certain times of the day and how these structures are moving around. So, one needs to really start looking at that point. And as Alan has mentioned, no one is looking at the administration side of things, the technical, what does the box do.

Richard Berman:

Going back to Alan, how are companies trying to fix this problem? I think everyone acknowledges that it's an issue, but what are some of the concrete steps that you're seeing organizations take to fix the problem?

Alan Seiden:

In terms of programming and applications, they can bring people to training. There's even free training. Kisco has a fellowship to get free RPG training. We do mentoring with our developer support. There are also free meetings, such as our “Code for i Fridays” Meetings, for example. And there's also outsourcing, to us and many other companies. As far as administration, companies either try to hire someone, or have a managed service provider try to handle things. They have helpers, Marius, Dawn, other people to come in, but often they wait until there's a major problem and then call them in when there's an emergency. And really, this kind of work is best done over a period of time, to set a baseline and then learn what's normal and then continuing from there. And same for the business knowledge and logic. There's a similar pattern there.

Richard Berman:

Marek, what is the risk to the IBM i ecosystem if this problem isn't fixed?

Marek Walczak:

First of all, the risk is growing. I don’t know what the level of the risk is, but the problem is that the companies running their mission critical application on IBM i, they don't have any means or technical capabilities to recognize the risk. They may just be walking on the edge, not even seeing it. And the problem is that this problem is a little misunderstood because the role of IBM i has changed over time. When I started my career over 30 years ago, I was a programmer in the bank. So there was a solid monolithic ecosystem of applications made in RPG. So there was a front-end green screen, backend, also RPG. And so the management of data, the database DB2 was controlled by RPG programs, and there was very little connections to outside world. Now, RPG applications have been removed by Java and whatever. So, we have now the bunch of SQLs grabbing data from IBM i, which is just serving as a database server most of the situation. And the complexity of the connection is growing. So, you have just IBM i being a part of the very heterogeneous ecosystem. So, the situation where the IBM i was just designed as a no-touch box, so it was just set up by IBM and it was running continuously without any disruptions for years, even not maintained is gone. But now you should adjust the configuration anytime something has changed. And still companies do not understand that. They still believe that their risk is not applicable to them. We try to explain this to the market now with our business,management experience and also technical experience that if IBM i is HILP type of system, so it's high impact, low probability risk characteristics. So it's very easy to overlook that risk.

The consequences can be catastrophic. So what we are saying, instead of waiting for the problems, which they come anyway one day, try to do something with that. So fine-tune your configuration accordingly. Something has changed, look at what has changed in your system behavior and do something with that. And we are just providing that tool that is just helping with that respect.

Marius Le Roux:

I just want to add to what Marek said is that it's true that a lot of people just see it as a database and it's been stable. It doesn't need a lot of attention to it. As long as the little light is going on, CPU is fine, everybody's happy, but we all know business demands change and the integration points has just started becoming more and more complicated over the years as it goes by. And for example, a rogue query that was not a problem before suddenly becomes a problem. If your skill does not have that capability or that insight to see how the system behaves as a pattern, that becomes a ticking time bomb that can have disastrous consequences in an environment. And normally when it happens, everybody is really trying to solve it, but the clock is ticking.

Richard Berman:

I have a question here for Alan, and then also for Dawn: how is this manifesting in the real world? I mean, we can talk about this from an internal standpoint or a development standpoint, but are there cost overruns? Are there product delays? Downtime? What does this look like for users of IBM i in that ecosystem?

Alan Seiden:

There can be downtime. An idea or architecture that works fine at low scale becomes not fine when it's at high scale. It's efficient up to a point, and then suddenly you hit a certain threshold and it's not efficient anymore. And then you call us, you call Dawn, Marius, and we might say, "You have too many full opens” or whatever we might say or this or that. Then you have a workaround to get them over the crisis and then think about it. "Oh, your architecture should be like this." That's one thing. But I think overall, lack of attention means that systems become inflexible and hard to actually work with and modify. People become afraid to change things in the system too. So business agility is hurt, business growth and agility are harmed.

Dawn May:

I agree. The phrase I always use is IBM i does amazing work under adverse conditions, and it works until it doesn't. And to the story of Morris's example of a rogue query, I have a real world example where a rogue query was consuming temporary storage for five hours and no one noticed. And guess what happens when that goes on for five hours? The system crashes because you're out of disc space. There's no excuse for that kind of situation to happen, but you have to know what to be looking for, and there's so many things you can be looking for. And that's why, in my experience, it really is catastrophic events that happen because the skills aren't there to know everything that you need to be watching or know what your signatures are so that you can identify that curve as it's happening when it's small before it becomes too big to handle. So, it doesn't tend to show up as downtime or ... And it might not be a system crash, but services aren't available that the business depends

Marek Walczak:

Here comes the point because the whole modern world of IT is looking at observability, the tools that allow you to observe holistically the whole infrastructure from the moment when somebody is clicking the mobile application or the web application to the very end of the execution of the transaction. And if something happens for the users, they notice the problem, then you need to react quickly to find the root cause. But it's not anymore just on IBM i, it just is somewhere and the problem can be hidden from the view of IT stuff. And you have very, very different type of skills to manage this.. You have database guys, you have whatever, hardware guys and of each of the domains and how you can then find the problem. If they look at their tools and everybody usually has some tools to observe their part of IT, they may not see it. And then it's just observability that comes into place, but there is no observability for IBM i now. There are some platforms like Dynatrace, Instana, Datadog, whatever. They try to look at IBM i, but usually they don't understand it because it's different by nature. So observing it the same way as you observe other systems usually leads to the wrong conclusions. So that's again, why we have designed i-Rays to just fill the gap.

Richard Berman:

Mark, if you want to talk for just a minute about what I-Rays does and maybe just explain a little bit about what the product does.

Marek Walczak:

Yeah, it looks like we are just jumping in. I've just made it. I just jumped into the conclusion that we have the magic pill for the illness, but it's never like that. But yes, in fact, i-Rays can be kind of the relief or help in that respect. So first of all, we started thinking about this product when we talked to our customers that have Dynatrace as a platform at the same time running the mission critical applications on IBM i, and it was a blind spot for them. They cannot connect the dots between the front end and the backend. So if something happens, it was very difficult for them to justify, what is the source of the problem. So, we started thinking about how can we visualize IBM i in Dynatrace? Over time, we have just ended up with the software that can connect to any platform, but first of all, is doing the data collection.

Then we work on the data with our many years of algorithmized experience. And so we have some already being patented algorithms to do that. And we just push the data that are the conclusions to be visualized in any platform. We have our own GUI, but also we can push the data to Dynatrace natively or to another platform via open telemetry. We provide the holistic view on the system. Those not even understanding IBM i can clearly see what's going on if the problems are coming and how to avoid them. Because at the end, i-Rays is providing the guidance for admins with the form of commands, what to do to get out of problems before they start to be problems.

Richard Berman:

So, a question for Alan, and probably Alan and Marius on this one, and then the follow-up is for Dawn, which is how are companies trying to fix this problem? I mean, obviously you talked about it a little bit earlier, but I want to know a little bit more about over the last 10, 15 years, organizations have tried everything. What are they doing? And then the follow-up question for Dawn is, are those approaches working?

Alan Seiden:

Organizations are learning. They hear so many different things from vendors, from other people and things about the problems. They think they're unique. They think IBM i is unique in terms of the kinds of challenges they have, about aging out, and knowledge being lost, and “legacy.” But that's really true everywhere. We think we have a special problem. We don't. The reason you have legacy is because you've been successful in your business and you got to this point. So then where are we? People have realized, oh, I have to take the responsibility myself to solve this problem. IBM is giving us the options. Now we've got to do something about it. That's in terms of recruitment, in terms of software developer recruitment, mentoring, outsourcing and training and all those things. And in terms of this other area, I hope people come to this seminar and hear what we're saying and meet the people who can help them, but you need the expertise.

Some decide to have a Managed Service Provider (MSP) do things. That's fine. But IBM i does have the best instrumentation to tell us what's going on in the system. And I think what we're also learning is that AI and agents have great potential there, because the system is so consistent, single-level store is there, everything's an object, has better security. The platform’s functions are wrapped around by SQL, even the performance data now. So I think there's actually more potential for us to find solutions than on other platforms.

Marius Le Roux:

IBM i has patented data around its own instrumentation of what it surfaces. Now, for those who might have been involved in ever hearing about IBM iDoctor. If you jump into that, you are going to have a fun time. The problem though is that normally iDoctor is used only after the fact. It's very difficult to try and understand to become proactive in that. Normally that tool just gets evoked that something is wrong or something is clearly going wrong now. So there's always a reactive part to that. But what is happening now with the newest versions of the operating system, it's actually telling you, okay, we can surface all of this data for you, but you need some sort of way to start tying it all together. But that's just one part of the equation. Understanding that data provides is where knowledge and experience come in. What does it mean if your IO starts driving up in your core hours? Sure, you're doing workload, but what type of workload? Where are these jobs coming from? Which systems? You need to put it into context.

That can give tremendous insights into a business to say, this is an important system because we can see when users are starting to press a button, you are doing sales, but we can also start showing you the pain points now in this specific process that it is doing maybe too much of what it's not supposed to be doing anymore. And normally this problem is compounded by people that are coming to the platform from outside perspectives, trying to maybe do their way of doing things and just bolting on because the code over the years are just ... You never get a chance to actually rewrite everything from scratch. That, as we've mentioned before, it worked in the past, it is not a problem, but the business scales, the business enables a function to put it on a web, for example, maybe even on OpenAI now doing their nice agentic endpoints.

Now suddenly the traffic's going to hit your business from all sides. You need to understand how this behaves in the enterprise and that becomes valuable for your management team to decide on and actually put their private investment also to the platform. And just to maybe just give background to why I'm saying that, in the past, everybody bought an IBM i full kit, you bought the whole system there and you didn't have to put too much more investment into that. The times have now shifted. Cloud has adopted new pricing models and it enabled certain businesses to grow with scale as they are growing their business. But on IBM i, we are now starting to see that same factors coming into play as well. What does that mean for your business? You need to have the right answers. You need to show management these systems that are connected to this box, it's important.

If we go down, business will suffer with this. So if we do not put attention and actual actions around these hotspots which software now can provide, there's a strong possibility of monetary loss down the road, then they would hopefully react to that.

Dawn May:

I'm going to bounce off Marius's observations because we're both performance people and it is a very reactive environment today and that is not working well because the problem has to substantiate itself. And then if you don't collect the data, because not all the data is always collected if you need things such as job watch or data, you end up having to endure the problem again. And there are so few of us that have the deep skills to do the performance diagnostics that it's not working very well as if we continue on the current path. But as was stated, we have the best instrumentation in this box. And if you take some time to learn of things like the performance data investigator and navigator, you'll discover that we have a signature to the behavior of the system. And if a business is running normal, that signature tends to be the same day after day after day.

And this kind of leads us to where we're going in some of this conversation is how can that knowledge of that data that's in the system and the knowledge that we have a signature that can be identified, all this wonderful performance data, can we build tooling to help bridge that gap so you don't have to have really smart people come and look at that data or really experienced people come and look at your performance signatures? Can we have software and build in knowledge to help you do your job so you don't have to necessarily find the skills to do the deep investigation.

Richard Berman:

I want to pivot now to a subject that is obviously something everyone is talking about, which is AI. And my question, this is a question for Mark. We're going to start with you. What role can AI play in fixing these problems, which often can seem intractable?

Marek Walczak:

The whole world is now waiting for AI to bring some solutions for a lot of problems, and the diagnostics can be augmented definitely by AI. So, all modern platforms like Dynatrace, they use AI, but just simple AI is not enough because as I mentioned before, for IBM i, you need to have an expert knowledge to understand the data. The data are there and it's very easy to collect them, but to make correlations of the data, to draw proper conclusions, you need to have an expert knowledge. So before the AI can act, we need to put some knowledge into algorithms to analyze this. So I would say AI, yes, can help, but it's not that easy.

Alan Seiden (in response to Octavia’s question about choosing career focus on RPG/Db2 or open source and DevOps):

Ideally, I mean, I think modern RPG is great, but has been neglected in a way. We could have solved a lot of our problems if we'd all done more with it, but that said, I appreciate people who can do some of everything, to have a wide range of solutions to bring to bear, to pull all that together and know what fits best where. As the native language of the IBM i, RPG is going to perform great and be optimized. Not all the time. Obviously it takes some skill. But the open source languages are great for new people coming into the platform, and do some things better. Is that the kind of feedback you were looking for? A quick one, anyway.

Dawn May:

So from an administration perspective, today you have many monitoring solutions out there. You can have the native one that comes in navigators, simple system monitors. There's many third-party monitoring tools, but with these monitoring tools, you have to have some knowledge of what metrics you care about and what thresholds need to be set. And so you need to have some knowledge about the environment in order to correctively use these monitoring tools. And I've seen them fail just because thresholds aren't set correctly, so the notifications don't get out or they happen so fast. I think if we could use AI to understand our performance data, because collection services is a goldmine with that data that we have and understanding that and identifying when things are not behaving normal, that AI could be a very useful tool in predictive and proactive monitoring and management of the system and not waiting until some threshold is crossed and now I'm panicking.

Marius Le Roux:

Yes. To add to that, I see that the DevOps question that was also previously asked, we normally have a lot more predictive analytics around the DevOps platforms on the outside, but not necessarily on IBM i itself, but the DevOps platforms are calling the IBM i through services. Now, what's normally happening with, say, alerts that you do set in all of your systems and you send them all to a monitoring team, you end up getting what they call alert fatigue. When that happens as humans, we start ignoring the alerts. The alerts are there. They are for a reason. They're not being just there sent for jugs, but AI can now help sort those patterns out. If this system is causing that alert, doing that, it does a time correlation. It's very good at amplifying pattern recognition into data. I always try to convert my own understanding about AI.

It's actually just machine learning models. So it's a machine that's learning patterns, and that is what is good at. Just feed it at the right data sources and ask it to correlate the data. So maybe say if there's an incident or maybe the probability of an incident that's going to arise, because there are certain systems that are connecting to your main source of record systems. That is definitely the best use, in my opinion, about AI that one can implement in the enterprises.

Richard Berman:

What are some of the practical ways that companies can leverage AI to overcome the skills gap? Where can AI come in and help maybe mitigate those risks?

Marek Walczak:

For the purpose of this discussion, I would replace AI by observability, because by nature, observability is bound to AI. So there is no observability without AI. So, let's speak about observability, which is the buzzword for years for any CIO and IT management, but still for IBM i environment, it's kind of unknown word. So that's maybe the biggest problem we see on the market now that when we say you should look at IBM i the same way as you look at other systems, but of course properly understanding the architecture, the majority of administration staff are used to just simply monitoring. So, they just observe CPU consumption, this kind of stuff, which basically on IBM i says nothing because you can tell me what consumption of CPU is good, 10% or it's 100%. Both of them can be wrong or bad, depend on the circumstances. So just observing CPU consumptions is not enough.

So, observability is holistic understanding of the system state. It's not about observing known states. So, monitoring is capable of telling you, you are reaching threshold of known parameter. Here is about unknowns. So tell me what I don't see. And here observability can help a lot because first of all, can augment admins with the daily work, giving them the tool to properly understand the situation. And at the same time, observability is also about giving advices what to do, not only just telling you that this is wrong, but they tell me what to do. So again, observability is there just not only to observe if something is running not as expected, so it's going out of the patterns that we observe in the past, but it's also capable of telling you what to do to stay operationally safe. So if you are asking me what observability can do, can do a lot, because first of all, can augment admins, of course.

Secondly, as we are talking about the skill gap, it's not only about internal stuff in the organizations, but it's also the consultants of the profile of, I would say, Dawn who can help almost every situation because she knows the system to very detail. But how many people of that profile we still have on the market? So in the past when the system were very stable, but there were still some guys in IBM or maybe some other companies capable of helping if something happened. Now it's very difficult to find them. So there is no money that can buy this resource because they don't exist. So, from that perspective, i-Rays or maybe in the future, some other tools that perhaps will be developed, because there is a need for that, can also help staying independent from external, not existing resources.

So, if something happens, the companies will be in the position to fix the problem by themselves.

Richard Berman:

Dawn just shared in the chat a piece from TechChannel that talks about this. So this is something that everyone should feel free to open up and read. Dawn, I don't know if you want to talk through it a little bit.

Dawn May:

The idea for this article that I wrote, it just went live, I think, Friday, so it's brand new, is I was having a conversation with a fellow IBM eye colleague, and we talked about how terminology is part of our problem, in that if you're in the Linux and Windows world, you use words that we don't use, and observability, I think is one of them. And then telemetry data is another one of them, but IBM i excels at all of this because we have more data built in than anybody else. And so that was the reason behind the idea for this article is these are industry terms. We don't often use them in our IBM i world, but in order to be part of the whole ecosystem, we need to embrace what all everyone else around us is doing.

Richard Berman:

So, this is a question I'm going to ask this for all the panelists. What do you see as the trends over the next three to five years related to the skills gap in IBM i? And how do we think that observability and AI and all of this can mitigate this over the next, not just the next six months or 12 months, but really the next three to five years?

Alan Seiden:

Yes, AI is already helping people to get an understanding of unfamiliar technology. I think it's good at interpreting what people are looking at and filling in some gaps. Even so, it helps to have trusted partners to talk to, help you interpret what's there, because often you get some generic advice from AI. I think that’s going to get better and better with AI, but you'll still need people and tools there who are more deterministic like us, or other people, on the panel or other tools, but it'll definitely be in the right direction.

Marius Le Roux:

I'm seeing that there's going to be increased complexity around IBM i. It's not going away in my world. It just plays a different type of role, whether it be a system of record as an authorized data source and secure data source, but the system is still playing in that world, but definitely in a more complicated manner with all of the integration points here in lies, again, where AI is also going to start coming into play, the agenda coding agents out there, they are coming one way or another. How are you going to know whether this is “normal” from an admin's perspective, you can try and predict how a human behaves because 10 humans can only do so many transactions per timeframe. Agentic agents might increase that complexity in various ways due to the many probabilistic outcomes they can provide.

So that's definitely going to come around. And then my other feeling is going to be automation. We need to start bolstering IBM i in a more automated manner. Because of the complexity in the enterprise environments, long gone are the days we have enough time to do manual actions, whether it's your standard operating procedure or just documenting the process. Normally you're going to find that it's going to be an easier way just to start automating those tasks as well.

Ansible is available on IBM i. It's been for quite a while. It helps as a general admin's life be so much easier. You just need to know that everything that you are implementing on it still can work successfully. Again, monitoring and observability also follows that around. And should an incident occur, that's again where AI can come to play to say, Hey, these 10 steps that normally runs every day, they didn't run. They are important steps. You need to take action on them, or these processes will start having a knock-on effect in the system by overrunning, for example. So these environments, when you have them in a human's capacity and normally that guy that knows everything, all of that workload just gets bolted onto that one specific person for that person to handle; it's going to become more and more and more challenging as time goes by.

You are only human with so many fingers per day. So automation is going to become a more important role as well in a new world. And then we also get audits as well. Auditors require that certain people and certain actions are not being taken any more like they did in the past. So if you are thinking that you can get an admin and a developer in one, businesses are not going to like that. Great.

Marek Walczak:

Automation is key. So the skills will evaporate anyway, we cannot stop it. It's just a retirement process and we will not be able to rebuild the skillset that is just leaving the market. So the only way to solve the problem ... Of course, you can close your eyes and do nothing, but it's not helping. You can try to migrate to other platforms, but we see projects of that kind lasting for ages and never ending and not being justified by business. So I don't think this is the right way. So the automation, so the proper tooling. So, the thing is that i-Rays is the first in a game, the new will come. I'm pretty sure of that. So we'll have competition.

Today we don't have any competition for the tool. So it is bad and good at the same time because it's a lot of education we have to go through, but this will come. And Ansible, yes, exactly. We are already thinking about closing the loop because we are creating advices that can be converted into the manual actions by admins, but we are also talking to Red Hat, which is our partner about closing the loop by Ansible, which is just capable of implementing commands into IBM i. So that's the direction we are taking, not only just giving advice, but also closing the loop, so doing this automatically.

Surviving the IBM i skills gap

See your IBM i in a completely new way.