Infrastructure, and its human operators, exists in time and space. Many outages are the result of simply changing a running system, either because the change had unintended effects, or there was a mistake in the planning and execution. Conversely, when a change is performed flawlessly, it is often unseen and unknown.
Therefore, production changes are calendar events. Though we design our systems for idempotency and resiliency, that’s not always possible to achieve, or there is a bug which breaks that ideal. The knowledge that a change is upcoming is an organizational power tool. It eliminates the first question everyone asks when something breaks: “What changed?” It also greatly shortens the debug cycle, because the collectively organizational intuition has been passively heightened prior to the change and is able to draw connections.
It doesn’t hurt to also announce a change in chat, but a change to our infrastructure requires, at minimum, the same level of coordination we would put into a meeting.
The process looks like:
- Work already captured in a GitHub issue. No change originates in a vacuum. There should be a trail of activity and planning leading up to a change, even if it happens over minutes with one engineer. A single line of configuration can affect thousands of customers.
- Create an event in the
Datum Engineeringcalendar. Link to the relevant GitHub issue from step 1, and invite any active participants or critical stakeholders as named attendees. Optionally include a Zoom link depending on the blast radius of the change. Folks who don’t play an active role should be subscribed to the calendar for awareness. - Keep the event updated with changes to the schedule. Any changes to the plan for the work just happens in the GitHub issue, but scope and timing should be reflected in the event.
When is this applicable?
Much of our release process is automated, so how does this work with frequent deployments? Do we create a calendar entry for every change?
We have an automated release process so that robots can do work for us, and we assume they are regularly doing that work, similar to how users are always changing the system by using it. The guidance here is for exceptional events that are generated by humans, or for releases that are so substantial we want to be loud and specific when we’re going to tell the robots to get to work. This level of change management is more about the types of changes that are difficult to automate or we haven’t yet automated.
As always, use your judgment. The calendar isn’t limited to production changes. Anything you plan to implement with a sufficient blast radius that would likely announce in Slack could probably use an entry with a little bit of notice.