DevOps Holiday Emergency Checklist
Do you have such a checklist? No? Not yet?
Maybe you should have one. Just one hour set aside for checking things over could save your holiday.
Here's my list, gleaned from many years with many clients. If I'm listing something, I've seen it. If even one of these things prompts you to remediate something before it becomes an issue, then it was worth it.
Printable PDF Checklist
- Do you have a tested and documented rollback process in place?
- Do you have a documented way to ship hotfixes if needed?
- Does your platform support a green/blue strategy?
- Are you in a well-defined and understood code freeze window?
- Do you have any risky / impact changes just deployed or about to deploy prior to the holidays?
- Have you checked disk usage on all critical systems: databases, message brokers, CI runners, K8S masters/nodes, monitoring/logging/security servers?
- Are replicated services showing healthy replication status?
- Do any replicated services show large or increasing replicati
on lags?
- Are all load balancers & gateways health checks working?
- Are auto-scaling policies correct for expected holiday traffic?
- Are all your pipelines healthy?
- Are all pipelines green on their last runs?
- Is all your IaC up-to-date and checked into repos?
- Have you run drift controls recently for your IaC to insure that manual changes won't be overwritten in an emergency Terraform run?
- Is your CI/CD service (Azure DevOps, GitHub, GitLab) paid and isn't about to expire?
- Have you reviewed your CI/CD guardrails to insure that there can be no authorised deployments or changes that affect production?
- Do you have up to date contingency plans?
- Is your holiday on-call schedule agreed, documented and communicated?
- Does every on-call engineer have access to all the systems they might need?
- Do you have a documented and well-understood prioritisation schema for P-levels?
- Are escalation and communication paths clearly defined (customers, senior management, dev teams)?
- Are any SSL certs about to expire - web server, API gateway, load balancers, etc.?
- Are any access credentials about to expire?
- Are any secrets, service principals, service accounts about to expire?
- Do you have any deployment credentials / kubeconfigs / tokens about to expire?
- Do you have an up-to-date inventory on all your secrets (API keys, DB creds, service accounts, OAuth clients, Kubernetes tokens)?
- Are automatic and monitored rotation jobs for certs/tokens functioning correctly?
- Are admin and on-call accounts using enforced MFA/2FA, with tested backup methods?
- Are any temporary security exceptions still open that should be closed before holidays?
- Have you audited and validated your firewall rules recently?
- Are DNS records for APIs, auth endpoints, internal services etc correct with low TTLs?
- Are any domains close to expiry?
- Are all tiles showing current logging & metric data?
- Are any graphs or tiles broken or empty?
- Are key SLOs / SLIs visible: error rates, latency, traffic, saturation (CPU/memory/queue depth)
- Do you have "war room" dashboards for each critical system (web/API, DB, queue, payment, auth)?
- Are alert rules configured correctly?
- Are alerts configured with up-to-date contact information?
- Do you have alerts on disk space for DBs, logs, and main app servers?
- Are alerts for backup failures in place and going to a monitored channel?
- Are alerts going to any dead endpoints (former employees, dead Slack channels)?
- Are alerts too noisy and risk drowning out important things?
- Are all your plans paid up-to-date - cloud, monitoring, alerting, 3rd party APIs, etc.?
- Are payment methods for critical services valid and not about to expire?
- Are spending/budget alerts configured so you don’t hit hard limits or quota caps during holiday peaks?
- Are there any license expiries or contract renewals that could hit over the holidays?
- Have you verified holiday traffic forecasts against current capacity (web, API, DB, queue, cache, etc.)?
- Do you have rate limiting / backpressure in place to protect critical services under extreme load?
- Are database connection pools / limits configured and tested under peak load to avoid pool exhaustion?
- Are backups completing successfully, and have you verified via logs, not just dashboards?
- Do you have up-to-date backups ready to restore if needed?
- Do you have well-documented and tested restore procedures for data and storage systems?
- Is there enough free disk space on DBs, logging/monitoring systems, CI runners and application servers for the holiday period?
- Are log rotation jobs and cleanup tasks working and monitored to keep disk usage low?
- Are jobs for cleanup, compaction, archiving running and monitored for failures?
- Do you have an inventory of ALL critical third-party services (payments, auth, email/SMS, analytics, CDNs, DNS, storage, CI SaaS, etc.)?
- Are you subscribed to third-party status / incident notifications?
- Do you have all up-to-date contact information for all third-party services?
- Do you have escalation procedures in place with all third-party services?
- Do you have alternative paths / fallbacks if a third-party service goes down?
- Can your platform handle third-party service outages without falling apart?
- Do you have up-to-date runbooks, especially for top incidents you've already encountered?
- Do all on-call engineers know exactly where all runbooks and other documentation is?
- Is there a simple architecture overview and dependency map for on-call to orient themselves quickly?
- Have you done a quick exercise: "What if production goes down over the holidays?"
- Have you tested paging, i.e. triggering a test alert to verify people actually get notified on their devices?
- Have you recently tested critical rollback or config changes?
If you have things to add based on your experiences, I'd love to hear about it. Drop me a line at welcome@ondemanddevops.com
Too late? Facing an emergency? I can help, see my Emergency Services Page.
Lajos Moczar - 28/12/2025