Can Agile Be Used for Production Support?

Published

Can agile be used for production support? I once got asked this question as I was working with an IT service team that was tasked to keep several business-critical systems up and running on production.

I think this is a very good question. Agile started as a movement for software developers in the early 2000s. Looking at version 4 of the ITIL framework, it’s hard to not conclude that it took a long time (more or less two decades) for service management to catch up.

Agile approaches can be used for supporting software systems in production. Kanban is the best agile service management method to use for production support. It helps you to visualize support tickets and limit work-in-progress, leading to shorter response and resolution times.

The agile movement started in 2001, when 17 software development thought leaders got together at a ski resort in Snowbird, Utah. They got snowed in and, with no skiing to do, started talking about how they built software and worked with clients instead. Finding out a surprising amount of commonalities between their ways of working, they wrote them down on a whiteboard — and created the Manifesto for Agile Software Development.

How Agile Works for Software Development

Agile is an iterative and incremental approach to software development. Agile practitioners value conversations about what is valuable and time spent building it over endless negotiation of contracts and lengthy documentation of requirements. Instead of developing everything at once and releasing it to production on a big-bang release day, an agile team will work in iterations (like two-week time boxes) and release working software in small and consumable increments (version 1, version 2, version 3, etc.).

Agile teams usually do this by following an agile framework. An agile framework is a set of roles, events, and artifacts that help a team and its sponsors or stakeholders to work in agile ways. Like the rules of a game, the framework helps keep the teams and sponsors or stakeholders balanced, so that they don’t revert to their old waterfall ways or end up in complete chaos.

The most popular agile framework is Scrum. In Scrum, a Dev Team works with a Product Owner to build a product. The members of the Dev Team have all the knowledge, experience, and tools needed to, independently and self-sufficiently, get work done.

The Dev Team works on a “product backlog” that contains all the requirements for the product. Their work happens in iterations, called “sprints.” Each sprint has a backlog of its own, called a “sprint backlog.” The members of the Dev Team determine the scope of each sprint backlog in a “sprint planning” meeting.

At each sprint planning, the Product Owner sets a “sprint goal.” The sprint goal is a short statement, usually a sentence long, that determines what the focus of the next two weeks is going to be. The Product Owner then has a conversation with the Dev Team on which items related to that goal they can pick up from the product backlog (and commit to completing in the sprint). Each item that the Dev Team commits to becomes part of the sprint backlog.

The work is estimated in story points. Story points are a relative estimation of the work using Fibonacci numbers: 0, 0.5, 1, 2, 3, 5, 8, 13, 20, 40, 100. Fibonacci numbers are a useful abstraction because estimation is never 100% accurate and the sequence helps the Dev Team and Product Owner make trade-offs early on. The commitment of the Dev Team is guided by their velocity; the amount of work completed, measured in story points, in the previous sprint.

When the sprint goal is clear and the sprint backlog is full, the planning is done. Then, the Dev Team is left to work on their own for two weeks. Every morning or afternoon, depending on the team’s location(s) and time zone(s), the members of the Dev Team meet for a 15-minute daily standup, and each of them syncs up on what they worked on yesterday, what they’re going to work on today, and whether or not they have any impediments.

Scrum is not prescriptive about the product backlog items (PBIs) and backlogs themselves. In most cases, the PBIs and backlogs tend to take the form of user stories on a Sprint board.

At the end of the sprint, the Dev Team holds a Sprint Demo, where they show completed work to the Product Owner and any sponsors or stakeholders invited by her. The Sprint Demo is where the team gets feedback on their work from others, which they can use to stay synced to their surroundings and course-correct along the way.

Before or after the Sprint Demo, the Dev Team conducts a Sprint Retrospective, where they talk openly, honestly, and in a safe space about what went well and what could have gone better, allowing them to continuously improve their ways of work.

Most Scrum Teams also have a Scrum Master. The Scrum Master is a servant leader and Scrum expert who helps the team foster an agile mindset and master the Scrum framework. The Scrum Master teaches, mentors, facilitates, and coaches the Product Owner, the Dev Team, and its individual members — until they no longer need his help and he moves on to helping another Scrum Team.

Continuous Improvement of Your Production Support Practices

At this stage, most of you coming from a background in production support are probably thinking… Sure, agile and Scrum may work well for software development, but planning one or two weeks ahead is virtually impossible in the field of production support.

Critical incidents come in at any time of day and night, incidents that affect many users simultaneously eat up your team’s response times, and resolution is always a challenge, especially when you come across a new issue that you’ve never solved before.

You simply don’t have the luxury to plan out two weeks ahead and say, “Okay, for the week we expect a Priority 1 (P1) at 3:54 AM PT on Wednesday morning, a P2 on Thursday afternoon, and we’ll spend the rest of our time responding to customer support tickets as usual.”

Agile’s iterative approach won’t help your team service demand for production support, but it will help it continuously improve its practices and tools.

Most production support teams have a set of practices and tools that they use to get work done. At a minimum, this tends to include:

  • A service desk practice that captures demand for support tickets and service requests from your customers;
  • An incident management practice to quickly respond to customers in case of incidents and restore service “back to normal” for customers;
  • A problem management practice to investigate and resolve the root causes behind incidents, reducing the number of support tickets and need for workarounds.

The remaining practices that your team uses depend on your industry, company, and team. In general, the ITIL v4 framework is a good standard to follow. It has 34 management practices in total, of which the three above are part of.

However you are running your production support operations, chances are that you’re using processes, procedures, and tools. Unless you invest the time and money to keep them up to date and continuously improve them, they’ll end up outdated and useless.

You can use the iterative approach of agile to achieve this. Say that you have a production support team of 10 people. Some are Service Desk Analysts, others are Site Reliability Engineers (SREs).

That’s 10 people working 8 hours/day, 5 days/week, 4 weeks/month, which equals 1,600 man-hours. If they spend 5% of their working hours on processes and tools, that’s 80 hours/month spent on continuous improvement.

Do this consistently for one year, and your team is going to spend 960 hours in total on making it easier to do their work. Even if 50% of that time is for communication and collaboration, that’s 480 hours of work (and 480 hours of value-added conversations).

That’s almost 500 hours spent on discovering new ways to do work, automating manual tasks, and keeping the team’s internal knowledge base up to date. Just think about the Return on Investment (ROI) in terms of saved hours and time spent doing higher-value work!

Kanban for Your Service Desk

Cards on a Kanban board | Photo by Jeff Lasovski

Kanban is an agile process improvement method that can help you visualize support tickets and limit work-in-progress, helping you streamline your production support team’s collaboration and maximize their efficiency. Here’s how.

If Scrum is all about cyclicity, then Kanban is all about flow. There are no roles or events in Kanban. No Product Owner, no Scrum Master, no Dev Team. No Sprint Plannings, Demos, or Retros. There are only cards on a board. Cards represent individual work items, the board contains them. Getting started in Kanban is as simple as drawing a couple of lines on a whiteboard and putting sticky notes in the “To Do,” “Doing,” or “Done” column.

Kanban originates from the lean manufacturing movement, which precedes agile by 70 years. The lean manufacturing movement started in post-war Japan in the 1930s when Toyota started working on a unique operating model called “The Toyota Way.”

In the 1940s, Toyota started studying supermarkets. The premise was Toyota could use the same techniques supermarkets used to stock their shelves for supplying the factory floor with car parts. In a supermarket, consumers take food from the shelves as they need it when they need it. The supermarket only stocks as much as it can sell in a given period — and consumers only buy as much as they intend to buy.

Toyota called this technique “Kanban,” for signboard or billboard in Japanese. Kanban was a way for Toyota to keep only the inventory of car parts it needed in its factory, using it just in time to assemble cars on demand.

To a large extent, Kanban was popularized in software development thanks to the work of David J Anderson, author of multiple books on the topic and creator of Kanban University.

Simply said, Kanban is a project management system that helps you achieve three things: balance between demand and supply, improve the flow of work from start to finish, and continuously improve the flow and the work itself.

Sample Kanban System for a Production Support Team

Say that your service desk team decides to use Kanban for managing support tickets from customers.

To keep our example simple, I’m going to sketch our Kanban flow in Lucidchart. But we’re going to assume that:

We’re also going to set up a basic Kanban board with five states:

  1. Ticket Received
  2. Initial Response
  3. In Investigation
  4. Solution Proposed
  5. Resolved

In Kanban, the Kanban board is split into multiple columns. Each column represents a state that cards go through as they move from left to right. The simplest Kanban board has three states: To Do, Doing, and Done. One thing you need to know about Kanban is that it works best when your board and the columns on it represent your actual way of work.

As David J Anderson wrote in Kanban: Successful Evolutionary Change for Your Technology Business, “Kanban is not a software development lifecycle methodology or an approach to project management. It requires that some process is already in place so that Kanban can be applied to incrementally change the underlying process.”

This is why Kanban works so well for production support and IT operations as a whole.

First, it’s incremental, but not iterative. It allows you to continuously improve your ways of work without working in cycles (unless you actually prefer to).

Second, it doesn’t come into conflict with your existing practices, processes, and procedures — it’s there to visualize them and help you improve them.

Here’s how our Kanban board for production support looks like:

Kanban board for production support teams

A customer opens the self-service support portal and has two options. They can either report a bug or ask a question. Based on which option they choose, they are asked to answer a number of required or optional questions. Their answers get submitted along with the ticket, helping our team to quickly understand what it is about.

When the customer submits a support ticket, the support ticket shows up in the “Ticket Received” column of our Kanban board. One of the service desk team members on shift picks the ticket up, assigns it to himself, and moves it to “Initial Response.” He replies to the customer in the ticket, letting them know that our team is looking into their request. Then, he moves it to “In Investigation.”

The support ticket is now “In Investigation.” Our service desk team members do an initial analysis of the customer’s request and look for possible solutions in their knowledge base. The knowledge base can be any repository of documents stored in a structured way. Some teams like to use a wiki tool like Atlassian Confluence. Others prefer to keep documents on cloud storage services like Box or on a shared file server in the office.

As soon as our support desk agent has found a solution to our customer’s problem, he proposes it to them in a reply and moves it to the “Solution Proposed” state. Now, the customer needs to test the solution and confirm if it works on their end. If the customer continues to have problems, the support desk agent moves the ticket back to “In Investigation.” If the customer confirms that all is okay, he moves it to “Done.”

This is how production support work flows from request to completion in a Kanban system. Kanban is a useful system for production support because it is a “pull system.” This is a term from the field of supply chain management and means that work is triggered only when there is customer demand for it. You don’t preemptively do support for the sake of doing support.

Managing Support Bottlenecks in a Kanban System

Kanban starts by visualizing the flow of work, but it doesn’t end there. Kanban is a method that can help you identify bottlenecks as they form and measure the overall efficiency of your system along the way.

Say that a bug affects a large number of customers. There is a workaround, but it needs to be given to customers individually. A high volume of support tickets comes into the “Ticket Received” column and half of the team is struggling to keep up.

A bottleneck has formed. There’s more demand than there is capacity to service support tickets in the “Ticket Received” and “Initial Response” states.

Kanban board for production support teams: a bottleneck comes in

This leads our service desk team to two conclusions:

  1. We need to inform customers above the bug and workaround immediately, so that they can fix the problem on their own and new support tickets stop coming in at this non-sustainable pace.
  2. We need to stop working on less urgent support tickets on the right and get all hands on deck to respond to customers’ support tickets as soon as we can.

The team writes a blog post to inform customers about the bug and workaround. They add a notice that briefly describes this situation and links to the blog post at the top of the self-service portal, so that customers who are experiencing this issue see the notice instead of submitting a ticket and resolve it on their own.

All team members, no matter what their roles and experience, storm together to reply to the existing support tickets and share a link to the blog post.

After a couple of hours, support tickets have stopped coming in at this pace and existing support tickets have been answered.

After several hours more, the product team resolves the bug and the problem is now gone.

Measuring Response and Resolution Times in Kanban

Two of the metrics that any support team should track are Response Time and Resolution Time.

Response time is the time it takes a service desk agent to respond to the customer’s request. That response can be as simple as “We’ve received your request and are looking into it.” It’s a small thing from the perspective of the agent, who deals with tens or hundreds of tickets a day, and a big thing in the eyes of the customer, who wants to know that they are being taken care of.

Resolution time is the total time it takes to resolve a ticket from the moment the customer submits it to when they confirm that all is back to normal or that they have the information they need. Measuring resolution times for each ticket type and for specific periods, like every week, month, quarter, or year, is a great way to see if the average time it takes to service customers is getting longer or shorter with time.

Kanban helps you do this with two concepts called Lead Time and Cycle Time.

Kanban board for production support teams: measuring cycle time and lead time

Cycle Time is the total time that a card spent in a single state. And the average total time that all cards spent in that state for a given time period. Cycle Time can help you measure not only Response Time, but also the time spent on any other state within your production support flow.

Measuring Cycle Time for each state helps you to pinpoint symptoms of systemic problems in your processes or tools that you can resolve, leading to shorter lead time and improved flow. Simply said, that means more satisfied customers and more engaged employees.

Lead Time is the total time that it took a card to move from left to right. And the average time that it took to complete all cards for a given time period.

In our service desk example, Lead Time is essentially Resolution Time for support tickets. Measured accurately and consistently, this metric can be very valuable to help you see when you’re headed in a good or bad direction over time. For example, if Lead Time keeps increasing, that means somewhere along the way, tickets are taking longer to resolve. You might want to address this before it turns into a problem for the customers and the organization.

Limiting Work In Progress with Kanban

One of the ways to improve flow in a Kanban system is to limit work in progress (WIP). Work in progress is the number of cards that a team is working on at any moment of time, as part of the whole Kanban board or within a single Kanban state.

Setting WIP limits helps you ensure that your production support team only picks up as many support tickets as they can process. This protects individual team members from overload and surfaces systemic bottlenecks in your Kanban system.

Kanban board for production support teams with WIP limits

Through measurement, you’ve found that, on average, your service desk team can investigate 5 support tickets effectively at any moment of time on a given shift.

When there’s short-lived demand for more throughput (such as a time when you expect many tickets to come in), the team can temporarily increase it by having more team members working on a single shift.

At the same time, each service desk team member proactively looks for and experiments with small tweaks to their processes and tools which continuously add more efficiency to their way of work over time.

The Bottom Line

Agile started out as an approach for developing software, but agile practices work just as well for supporting software applications on production.

When it comes to production support teams, I’ve come to the conclusion that Kanban works best for external work, where customer demand can be anticipated but not planned, whereas Scrum or other iterative frameworks work better for internal work, where planning and iterative work are possible.

What’s your experience? Let me know in the comments below.

By Dim Nikolov

Jack of all trades and master of none. Dim is a Certified Scrum Product Owner (CSPO) and Certified Scrum Master (CSM). He has a decade of experience as a stakeholder, member, leader, and coach for agile teams.