Your DNS Changed and Nobody Told You. Here's the Nightly-Diff Pattern That Catches It.
A simple baseline-and-diff approach that sends you a heads-up before a silent record change becomes a Tuesday-afternoon incident.
It was a Tuesday at 2:17 PM, and the marketing team's contact form was returning 502s. Not 404. Not a timeout. A clean 502, which means something was answering, just not the thing it was supposed to be.
An hour in, I'd checked the app server logs, restarted the nginx process twice, confirmed the SSL cert was valid, and pinged our cloud provider's status page like it owed me money. Everything looked fine everywhere I looked. Then, almost by accident, I ran `dig +short www.company.com CNAME` and saw a hostname I didn't recognize. Something like `legacy-assets.decommissioned-vendor-name.com`.
Vendor had been off the account for four months. The CNAME had quietly repointed to their infrastructure during the migration wind-down, sat there untouched, and then their old infrastructure finally went dark. Nobody changed our DNS intentionally. Nobody got notified when it happened. We found out when a sales rep tried to submit a lead form.
That was the day I stopped trusting that "nothing changed in DNS" was a statement anyone could actually verify.
Why DNS Is the Silent-Failure Layer of Every Infrastructure
DNS is configuration. It's just not a configuration you can store in your repo, lint on a commit, or review in a pull request. It lives in a registrar panel or a DNS provider dashboard, updated by humans who may or may not be following a change-control process, and it's completely invisible until something breaks.
Every other layer of your stack has some kind of drift detection built in these days. Config management tools track the desired state of your servers. Container orchestrators know what's supposed to be running. Infrastructure-as-code tools will tell you if something drifted from the Terraform state. DNS gets none of that by default. You get a text field in a web UI, a change that takes effect whenever the TTL expires, and exactly zero notifications.
The operational pattern most teams rely on is "we'll know when it breaks." And they're right. They will know. They'll know at 2 PM on a Tuesday when a customer reports it, after a sales lead gets lost, after the support team has spent 45 minutes ruling out everything else. The detection mechanism is user reports, which is among the worst possible monitoring strategies.
There's also a subtler problem. The change usually isn't malicious. It's not a security incident, at least not at first. It's a vendor cleanup, a platform migration, someone at a partner org tidying up their infrastructure without realizing your CNAME still pointed at them. It's the kind of change that feels harmless to whoever made it and catastrophic to whoever depends on it.
The fix isn't complicated. What you need is a declared source of truth for what your DNS should look like, a way to compare that against what it actually looks like right now, and something that runs that comparison regularly enough to catch drift before users do.
That's the pattern. The implementation fits in a bash script.
The Pattern, the Four States, and a Wrapper You Can Use Today
The idea is straightforward: declare your expected DNS state once in a YAML file, then run a script nightly that queries your authoritative nameservers and compares what it finds against what you declared. Any gap between the two gets reported.
The baseline file is the key piece. It's not generated. You write it manually, and that act of writing it is itself useful, because it forces you to actually look up what each record currently is and decide "yes, that's correct." Once it exists, it becomes your source of truth. Commit it to your repo. Update it when you make a legitimate DNS change. The baseline is always what you intend, and the script is always asking whether reality matches.
When the diff runs, every record type for every domain you declared lands in one of four states:
MATCH means the live record matches the baseline exactly. This is the quiet result. Nothing to do.
NEW means a record exists in live DNS that isn't in your baseline. It could be a vendor auto-adding a TXT verification record. It could be someone provisioning a new subdomain. It could be something you should care about. The script surfaces it; you decide.
MISSING means your baseline declared a record that doesn't exist in live DNS anymore. An A record that was decommissioned without cleaning up. An MX record that got deleted. A CNAME that was removed when a vendor migrated their platform.
DRIFT means the baseline and live DNS both have records for a type, but the values don't match. This is the Tuesday-afternoon scenario: the CNAME target changed, the IP behind an A record flipped, the SPF policy was modified.
NEW and MISSING and DRIFT all mean something in your environment changed without you being told. The script exits nonzero when any of those occur, which makes it trivially composable with cron, alerting pipelines, or anything else that reads exit codes.
Here's a minimal working bash wrapper you can adapt right now. It keeps the dependencies to just `dig` and `bash`, uses a simple shell-array for your expected records instead of parsing YAML, and is short enough to read in under five minutes:
#!/usr/bin/env bash
# dns-check.sh
# WHAT: Minimal DNS drift checker - declare expected records, diff against live DNS
# WHY: Catches silent DNS changes before they become incidents
# Usage: ./dns-check.sh
# Add to cron: 0 2 * * * /path/to/dns-check.sh || echo "DNS DRIFT DETECTED" | mail -s "DNS Alert" you@example.com
set -euo pipefail
# ── CONFIGURATION ────────────────────────────────────────────────────────────
# Authoritative resolver to query against (use your domain's actual nameserver)
# WHY: Querying authoritative NS catches changes before they propagate to resolvers
RESOLVER="8.8.8.8"
# Declare expected records as: "domain|TYPE|expected_value"
# Get current values with: dig +short example.com A
# Run once to populate, then treat this as your source of truth
EXPECTED_RECORDS=(
"example.com|A|93.184.216.34"
"www.example.com|CNAME|example.com.cdn.cloudflare.net"
"example.com|MX|10 mail.example.com"
"example.com|TXT|v=spf1 include:_spf.google.com ~all"
)
# ── DIFF ENGINE ──────────────────────────────────────────────────────────────
DRIFT_FOUND=0
for record in "${EXPECTED_RECORDS[@]}"; do
# Parse the declared record into its three parts
domain="${record%%|*}"
rest="${record#*|}"
rtype="${rest%%|*}"
expected="${rest#*|}"
# Query live DNS at the authoritative resolver
# WHY: +short gives us clean output; @resolver pins which nameserver answers
actual=$(dig +short "@${RESOLVER}" "${domain}" "${rtype}" 2>/dev/null \
| sort \
| tr '\n' '|' \
| sed 's/\.$//g; s/|$//')
# Normalize expected for comparison (sort, strip trailing dots)
expected_norm=$(printf '%s\n' "${expected}" \
| sort \
| tr '\n' '|' \
| sed 's/\.$//g; s/|$//')
# Compare and classify the result
if [[ -z "${actual}" ]]; then
# Record existed in baseline but dig returned nothing: MISSING
printf "MISSING %s %s (expected: %s)\n" "${domain}" "${rtype}" "${expected}"
DRIFT_FOUND=1
elif [[ "${actual}" != "${expected_norm}" ]]; then
# Record exists but value changed: DRIFT
printf "DRIFT %s %s\n expected: %s\n actual: %s\n" \
"${domain}" "${rtype}" "${expected}" "${actual}"
DRIFT_FOUND=1
else
# Values match: MATCH (silent - no output unless you add --verbose logic)
: # nothing to report
fi
done
# Exit nonzero on any drift - composable with cron, alerting, CI checks
if [[ "${DRIFT_FOUND}" -eq 1 ]]; then
printf "\nDrift detected. Review records above.\n" >&2
exit 1
fi
printf "All %d declared records match live DNS.\n" "${#EXPECTED_RECORDS[@]}"
exit 0
Save that, drop your actual records into `EXPECTED_RECORDS`, and run it once to confirm it sees what you expect. Then add it to your crontab:
# Run nightly at 2 AM, email on drift
0 2 * * * /path/to/dns-check.sh || echo "DNS drift detected on $(hostname)" | mail -s "[ALERT] DNS Drift" you@example.comThe "NEW record exists in live DNS" state isn't in this minimal version, since detecting it requires knowing which record types to scan for beyond what you declared. The four-state model handles that fully once you know which types to watch, which is what the complete kit covers. For a first pass, catching MISSING and DRIFT gets you most of the value.
A few practical notes. Use `dig +short` rather than `dig` without `+short` or you'll spend time parsing the human-readable output format. Always query a specific nameserver with `@resolver` rather than relying on your local resolver, since caching can hide drift for hours. The MX record normalization is worth being careful about: `dig +short` returns the priority prefix as part of the value (`10 mail.example.com`), so your expected strings need to include it exactly that way. And commit the script alongside your baseline declaration. If the baseline lives in the repo, you get history, diffs, and code review for DNS changes as a side effect.
What Else Lives in the Full Kit
The script above covers the core pattern. The full DNS Drift Detector kit is what you reach for once you've outgrown the wrapper.
The main `dns-drift-detector.sh` handles all five record types: A, AAAA, CNAME, MX, and TXT. That last group matters more than it seems. TXT records are where SPF policies live, where DKIM selectors sit, where domain verification tokens accumulate. Quiet SPF drift can break your email deliverability for days before anyone notices. DKIM drift means legitimate mail starts hitting spam folders. These aren't hypothetical edge cases.
The color-coded output makes the diff results readable at a glance during incident response, and the `--quiet` flag strips all of that for cron-friendly logging where you only want the exit code to speak. There's a `--no-color` flag too, so piping to a log file doesn't fill it with ANSI escape sequences.
`install-cron.sh` is a one-command idempotent installer. It checks that `dig` is available, puts the scripts where they belong, creates a log directory, and writes the cron entry with a duplicate-guard marker so running it twice doesn't add the job twice. That kind of thing is boring to write and annoying to get wrong.
The `baseline.yaml` in the kit is annotated with examples for ten common services: Google Workspace MX, Cloudflare CDN CNAMEs, SendGrid SPF, common DKIM selectors, and a few others. It's the reference you use when you're populating your own baseline for the first time.
The runbook covers install, baseline setup, how to read the output, what to do for each of the four states, how to update the baseline after a legitimate change, and an FAQ. That last one matters during an incident, when you don't want to be making judgment calls about whether a NEW record means "update the baseline" or "call the registrar."
Try the Pattern Today
The pattern described here is genuinely useful as-is. Declare your records, schedule the diff, react to the exits. That alone puts you ahead of the "we'll know when it breaks" approach that most infrastructure environments are actually running.
If you want the full kit, it's at shop.asthegeeklearns.com/products/dns-drift-detector for $19. You get the complete `dns-drift-detector.sh` with all five record types, color output, quiet mode, and cron logging; the idempotent `install-cron.sh`; the annotated `baseline.yaml` with ten real-world service examples; and the full operator runbook.
The Tuesday-afternoon incident I described at the top cost more than $19 worth of everyone's time. The detection script would have caught it the night before.
As The Geek Learns is a newsletter about systems engineering, automation, and the gap between knowing something and actually applying it. If this was useful, subscribe for free to get new articles as they land.





