I’ve been working as Tech Lead for an SRE team since September 2023. While I don’t think I have enough experience yet to teach anybody, I´ve learned a lot. An incredible amount, both on a technical level and in leadership. Here, soft skills are just as important as hard skills. I remember that when I got the offer, the first thing I did was look up as much information as I could. What does it mean to be a Tech Lead? What does it mean to lead such a highly technical team? I found a lot and read a lot. Then, over the course of this first year, I learned even more. Little by little, just like with the technical side of my job, I started condensing quick notes, writing down keys that would trigger the stored content in my dicts. Without realizing it I ended up assembling a cheatsheet. And since this blog is all about sharing and giving back—to the Web and the world—byte by byte, I want to share my experience and that cheatsheet. Along with another reflection on leading a team.

A thought on my journe from Junior to Tech Lead

I love reading, and I always say that I’m a frustrated historian. Around my early twenties I read The Art of War, and there was a quote that immediately stuck with me, forever: “He must be just, methodical, and reserved.” It really resonated with me, and in some way or another I’ve tried to apply it in what I do. How does a quote from over 2,500 years ago apply to an industry that’s not even 80 years old yet? Well, being methodical is incredibly useful in all things IT.

Being methodical means being well organized, designing a system, validating it, and applying it to then stick to it. Isn’t that essentially what we do? A rockstar might be great, but if they’re not consistent or organized, they’re not as good as they could be. And we all have stories of that rockstar who one day broke Production. In a moment of overload, emergency, or uncertainty, being methodical gives us the ability to respond, something that confusion and the unknown often strip away from us. A clear runbook that assumes zero prior knowledge when an alert is triggered, a framework for Incident Management during the critical early stages of an incident, a takeoff checklist to ensure we don’t miss anything when starting our shift… These are all examples of a methodical approach to something as chaotic as the work of Reliability Engineering. That’s why, in my experience, striving to be methodical has helped me handle both the technical and soft aspects of leadership. Going back to Sun Tzu’s quote and its parallel with IT: a methodical person adapts and improvises, but does so based on a system. Exactly as we should. As for the parts about being just and reserved, I’ll save them for a future post :p.

My reminder, cheatsheet mode

  • The work of prioritizing, scheduling, and dispatching tasks/tickets has a huge impact on the team. And not doing it has twice the negative impact: bottlenecks, misalignment between the team’s work and the client’s priorities. And overload, whether it’s real or perceived overload.
  • Prioritize, then delegate and let the team do what they do best: their work. If you’re micromanaging without needing to, there’s a problem. If you have to micromanage because you can’t delegate reliably, there’s a problem.
  • As SREs, we can’t be reluctant to change. We like it and we manage it. Finding balance between change and acceptable risk to stability is the key. A solid change management process is essential.
  • If you are not ON-CALL its fine to prioritize deep work and take more than 10 minutes to answer an email or message.
  • Finding bottlenecks in the team’s workflow —first, find the KPIs to detect them— has a much greater impact than just resolving one or two stuck tickets.
  • A Tech Lead combines leadership and technical work. It allows you to lead from the front. It’s a unique opportunity to coach and support the growth of everyone. Win-win.
  • Always make time to work on the technical side. Don’t stop coding, don’t stop automating. Don’t stop applying new solutions. Never stop optimizing and solving problems.
  • But sometimes there isn’t time for the technical side, and that’s okay. You should only worry if it’s chronical (like a week without any time for it).
  • You are the bridge between the team and the client (whether internal or external). Sometimes it means being equidistant.
  • Pointing that something “isn’t right” and just mentioning it doesn’t help. Raising a complaint doesn’t help. What works is detecting something that we believe can be improved, providing evidence and metrics for it; and presenting a solution in the same report.
  • EXTRA: This is a reminder much just for myself, but it might also work others who share my geeky-though-not-as-geeky-as-in-the-90s hobby, video games. “Leading is not an RTS with a minimap, nice interface, top-down vision and instant reaction times. Leading is an [FPS])https://en.wikipedia.org/wiki/First-person_shooter) where you need a clear vision using a limited field of view; then you explain and point out the objective. You must have confidence that it will be accomplished when the next moment requires you to look elsewhere, towards the next challenge.”

Let´s talk metrics

SRE is a position that is very focused and based on metrics, it should be. Metrics are the foundation of observability. For me this doesn’t change when we’re trying to observe our teammates and our workload. But measuring the workload of SREs, in my experience, has a fundamental problem: we handle very diverse requirements, making it difficult to initially estimate the time or effort needed. There are routine decommissions that end up becoming true odysseys, and there are updates that lead to cascading requirements and need forking an unsupported library to update it…

As a result, I’m afraid that the metrics I currently work with aren’t as precise as I would like them to be. I’ll mention a few, very briefly to perhaps expand on this in a future post dedicated solely to metrics. My team and I work in Jira, but I will try to define them in a vendor-agnostic way.

  • Number of tickets assigned to the team (total) and to each member (value and % of the total): Observing how this number varies over time and among team members gives us a very basic idea of how balanced the workload is. It’s a key starting point. Moreover, a higher number of tickets leads to more task rotation and context switching. That’s where we start to lose a lot due to the context switching that causes residual attention.
  • List of tickets ordered by priority or volume (%) of tickets at each priority level: This depends on how good we are at organizing and prioritizing tasks, as well as how good those assigning tasks are at giving them a real and honest priority. When everything is urgent, nothing is. Fortunately, in our case, I can prioritize based on customer feedback. The more high-priority tickets there are, the greater the risk of missing a deadline or causing an impact. Increasing the priority of something means lowering the priority of another, since both people and time are finite resources. If we’ve seen that we can handle X high-priority tickets, the acceptance of another must be postponed, or one of the existing tickets has to be deprioritized.
  • Mean Time To Resolution (MTTR): the time from when a request is formally accepted until it is resolved. In our case, due to the need to balance the technical aspect with scheduling during specific maintenance windows (and other factors like budget approvals and costs), it’s not as straightforward to measure or as informative. Additionally, the varying levels of complexity make it difficult to determine when something is taking longer than necessary. However, it can generally provide an idea of the team’s “agility” within the context of the organization.
  • Average Age of tickets: I’m not a fan of averages for any statistics because they can be significantly skewed by outliers. Generally, the higher this average is, the more likely it is that we have many tickets taking longer than ideal or that we have a few that are dragging on for a very long time. Both situations need to be addressed. It can also provide a very generalized idea of whether the team is overwhelmed: if the average age of tickets is consistently increasing, it’s very likely because we’re getting requests constantly and there’s never enough time to work on the backlog. We become increasingly reactive and less proactive.
  • Completed tickets, monthly: I hesitate to mention this because I don’t want to turn anyone into a ticket-closing machine. Because we are not machines and we are here to solve problems and manage change, not to close tickets. However, it does provide a very general and imprecise idea of the work pace. If we do good dispatching and the team has a balanced workload, the output should be similar. If not, we need to see if it’s because someone has much more complex and time-consuming tasks or if they’re going through a less productive period.
  • Assigned tickets vs. completed tickets, monthly: Again, and having in mind the idea that we’re not ticket-completing machines, this provides a very basic idea of “performance.” In a well-balanced team, it should tend to be consistent. If at any point the trend becomes unbalanced (with many more assigned tickets than resolved), we might be overloading the team. Conversely, if we see that there are significantly fewer assigned and resolved tickets, we could have someone who is idle.

Continuous improvement

I know that the above metrics are somewhat general and imprecise, and on their own they don’t convey much. However, combined, they can provide a general idea that allows us to know if we need to take action. Additionally, they are there to be refined over time because I believe in the principle of “Done now is better than perfect later,” which is nothing but iterative and continuous improvement. I have all these metrics on a dashboard that I’ve named “Workload HUD” with the sole intention of having a quick overview of how my team and I are managing our workload. I don’t like to guess or assume things; I prefer to measure them.