SysAdmin Weekly #28: The Right Tool, The Real Cause
Right-size your tech, then go read your vendors' postmortems
TL;DR
Right tool for the right job is a discipline, not a budget line: stop sizing your tech to your ego and start sizing it to the task.
New episode 050: how to run a blameless incident postmortem that ends with a root cause instead of a name, whether you’re a war room of sixty or a shop of one.
The Take: your own postmortem is half the skill; reading everyone else’s, including your vendors’, is the other half, and their transparency is a procurement signal.
Tool of the Week is OneUptime, an open-source stack that captures the incident timeline automatically so your postmortem has real material to work from.
Quick Win: steal the show’s free postmortem template and fill in the boilerplate now, before the pager goes off.
From the Console
One of the most expensive habits in tech has LONG been reaching for the biggest thing on the shelf just because it happens to be sitting there.
I caught myself doing it last week. I had a two-paragraph text file that needed summarizing, and my hand instinctively went straight for the largest, most capable AI model I had access to (because i’ve been doing a lot of complex work lately). Had I not caught myself, that’d be like renting a semi to move a bag of groceries. Opus does not need to summarize your grep output. Haiku is not going to architect your app. The skill is not knowing which model is “best,” it’s knowing which one fits the job in front of you, and most jobs are small.
The lab version of this is worse, because the lab version has blinking lights. Nobody needs a half-rack SAN to hold a few text files and a 2GB media library, and yet the homelab channels are full of gear that exists mostly because it was available and the owner could. I have been guilty of this too. Right tool for the right job is not a spending rule, it’s a discipline and arguably MORE important in today’s world of insane hardware prices that it used to be. The moment you size the tech to the task instead of to your ego, everything gets cheaper, quieter, and easier to reason about.
The latest on the SysAdmin Weekly Podcast
Episode: How Do You Run a Blameless Incident Postmortem?
Topic: Andy and Eric Siron pull from combined decades of incident reviews to break down what a postmortem actually is, what kind of outage earns one, who belongs in the room, and why the follow-through is the whole point.
Why this one matters:
An effective postmortem turns an outage into tangible fixes instead of finger-pointing, which is the difference between an incident you learn from and one you repeat.
Skip it and the same failure comes back, because nobody wrote down the contributing factors and nobody owned the action items.
It’s for anyone who owns uptime (which is MOST SysAdmins), and the discipline scales: sixty people in a war room or just you writing a summary for one nervous boss.
Watch on YouTube
Listen on Spotify
The Take
Running your own postmortem is half the skill. Reading everyone else’s is the other half, and almost nobody treats it that way.
Every vendor you depend on has a track record of how they behave when they break. Some of them tell you. In February, Clerk put out a postmortem that named the actual root cause: a Postgres auto-analyze flipped a query plan because the planner’s sample size was too low, the bad plan strangled the database, and roughly ninety minutes later a manual re-analyze brought it back. That is a real root cause, with the communication failures owned in the same document. You read it and you know exactly what happened and what they changed.
Now, that willingness to publish a specific cause is itself a signal, and it correlates almost perfectly with how a vendor treats you at 2am. When GitHub’s own engineering leadership publicly attributes a run of outages to tightly coupled architecture that let local problems cascade, and to systems that couldn’t shed load from misbehaving clients, that is a company telling you how it actually failed. Compare that to the outfits whose entire public record is “we experienced elevated error rates” followed by silence. One of those is a partner. The other is a liability you haven’t been billed for yet.
So make it part of procurement. Before you take a hard dependency, go read the vendor’s status history and their last two or three postmortems. Not the marketing uptime number, the writeups. If they don’t publish any, or if every one is corporate mush with no named cause, you already know how the next outage conversation goes. The postmortem discipline Episode 050 is about does not stop at your own perimeter.
Community Signal
Awesome community work worth your attention.
Lorin Hochstein - “Quick thoughts on GitHub CTO’s post on availability” - Hochstein is a resilience-engineering practitioner (ex-Netflix SRE) who writes some of the clearest incident analysis on the internet, and here he picks apart GitHub’s own writeup on its recent outages. It’s the perfect companion to this issue’s Take: this is what it looks like when a working engineer reads someone else’s postmortem and pulls the real lessons out of it.
Google SRE - “Postmortem Culture: Learning from Failure” - Chapter 15 of the free SRE book, and still the clearest articulation of blameless-not-unaccountable in print. If Episode 050 had you nodding, this is the reference to hand the skeptic on your team who thinks “blameless” means “nobody is responsible.”
Tool of the Week
OneUptime - an open-source, self-hostable reliability stack that rolls monitoring, on-call scheduling, status pages, and blameless postmortems into one platform.
The reason it earns the slot this week: it captures the incident timeline automatically while the fire is still burning, which is exactly the raw material a good postmortem needs and exactly what nobody remembers to collect by hand at 2am. Honest scope, though. This is a full platform, not a single container you set and forget. Self-hosting it is real infrastructure with a database and the works, so for a homelab it’s a weekend project, not a five-minute one; if you just want uptime checks, Uptime Kuma is the lighter tool. And to be straight with you: this is a smaller, less grassroots-proven community than a darling like Uptime Kuma, and it’s an all-in-one bet rather than a best-of-breed one. What earns it the slot is that it’s genuinely fully open on GitHub (Apache 2.0, not open-core), actively maintained, and its postmortem tooling actually exists instead of being bolted on.
Quick Win of the Week
Drop a real postmortem template into your team wiki or repo today, before you need it. We linked a markdown postmortem template that i’ve used over the years in the resources section of the last podcast episode. That version is free, CC BY 4.0, and version-tracked in the SysAdmin Weekly community repo here.
Pre-fill the parts that never change (your severity levels, escalation contacts, and the timeline table headers) so that when something does break, you’re filling in facts instead of formatting a document while the help-desk phone screams.
Fun Retro SysAdmin Fact
On January 15, 1990, AT&T’s long-distance network collapsed for about nine hours and blocked roughly half of all calls placed through it, an estimated $60 million in lost connections, and the root cause was a single misplaced break statement in the C recovery code running on its 4ESS switches: proof, 36 years early, that “blameless” exists because the cause is almost never a person.
Question I Got Asked (and the Real Answer)
The question: “I’m a one-person shop. Do postmortems even apply to me?”
The common (wrong) answer: “No, that’s enterprise process overhead. You already know what broke.”
The real answer: The document scales down to a paragraph; the discipline does not. You will not remember, three months from now, the exact order you tried things at 2am or why the obvious fix didn’t work. A five-minute writeup with a timeline, a root cause, and two action items you actually close is worth more to a solo admin than a forty-page template is to a team that files it and forgets it. Blameless is easy when the only name in the room is yours. The hard part, same as everywhere, is the follow-through.
Until Next Week
Size the tool to the task, write down what broke, and close the action item you keep meaning to close.
Stay Frosty,
Andy
SysAdmin Weekly



