As The Geek Learns

Managing Anthropic Agent SDK Costs: A Post-June 15 Billing Playbook

James Cruce — Sat, 16 May 2026 20:48:17 GMT

Your background agents are about to run out of money. Anthropic's new credit pool system means your automation could die in a single week. Here is how I re-engineered my stack to stay under budget without breaking my workflows.

The Setup

You've built a small fleet of agents. They sort your mail, watch your repos, file your daily briefings.

My current setup before the June 15th cutover:

Then May 13 lands, and Anthropic announces the change: on June 15, every programmatic Claude call moves into a metered monthly credit pool. $100 a month on Max 5x. No rollover.

Run the math against your actual schedule. If you've got anything polling on the order of minutes (cron pipelines, hourly digests, watchdog sweeps), that pool drains in 7 to 10 days. And here's the kicker. Your interactive Claude Code keeps working. Your headless automation just stops. You wake up to a dead pipeline, a drained pool, and a subscription that still says active.

What's Actually Going On

This isn't just a random pricing tweak. There is a clear economic driver here. Throughout early 2026, many third-party tools used the Agent SDK at a $20 Pro subscription rate to run workloads that would cost hundreds at standard API rates. It was essentially compute arbitrage at scale.

Anthropic started cracking down in April, but the May 13 announcement is the structural fix. They are moving to dedicated monthly credit pools to restore access under metered billing. The reality is that most agentic operating systems are built directly on the Agent SDK. Because these agents lack a human in the loop to throttle their usage, they are now metered by default. Interactive sessions stay on the flat-rate subscription because the human provides the natural brake. Programmatic agents do not.

The June 15th Split

The Fix

I implemented a two-phase mitigation to deploy before the June 15 deadline.

Phase 1 was a hot patch designed to provide immediate protection. I added a BILLING_MODE environment variable with three states: unmetered, metered, and paused. The paused state blocks every programmatic call across all providers, while metered enforces a strict cap on the Anthropic route.

Billing Mode Cap

I also added a file-backed JSON ledger at store/billing-ledger.json to track monthly costs. It uses a write-then-rename pattern to ensure crash safety during updates. To handle errors, I introduced a BillingCapExceeded error class. I used the same instanceof pattern as my KillSwitchRefusal logic so a typo in a message cannot accidentally trigger a retry loop.

The logic lives in a single chokepoint: runAgent() in src/agent.ts. The pre-call gate checks the cap, and the post-call gate records result.totalCostUsd from the SDK, firing a Telegram alert if a threshold is crossed. As a final safety measure, I cut the cadence on my two highest-frequency tasks: the pipeline-advance cron moved from 15 minutes to hourly, and I paused the council-evening task entirely under metered mode.

Dispatcher Flow

// src/config.ts — tri-state env that gates programmatic agent calls
export const BILLING_MODE = optional('BILLING_MODE', 'unmetered');
export const BILLING_CAP_USD = number('BILLING_CAP_USD', 80);

// src/agent.ts — pre-call gate in the dispatcher
function assertBillingAllowed(provider: Provider): void {
  if (BILLING_MODE === 'paused') {
    throw new BillingCapExceeded(
      'BILLING_MODE=paused — programmatic agent calls are disabled.',
    );
  }
  if (provider === 'anthropic' && BILLING_MODE === 'metered') {
    const total = getMonthlyTotal();
    if (total >= BILLING_CAP_USD) {
      throw new BillingCapExceeded(
        `Anthropic monthly credit cap reached: $${total.toFixed(2)} >= $${BILLING_CAP_USD.toFixed(2)}.`,
      );
    }
  }
}

export async function runAgent(opts: AgentOptions): Promise {
  assertEnabled('AGENTS_ENABLED');
  const provider: Provider = opts.provider ?? 'anthropic';
  assertBillingAllowed(provider);

  if (provider === 'ollama') return runOllamaAgent(opts);
  if (provider === 'codex') return runCodexAgent(opts);
  return runAnthropicAgent(opts);
}

Phase 2 focuses on the long-term router infrastructure. I promoted runAgent() from a direct SDK caller to a dispatcher that can route across anthropic, ollama, and codex providers. I also extended the agent.yaml schema with provider: and local_model: fields.

I shipped a single-turn Ollama runner that wraps the local-LLM client. It returns totalCostUsd: 0 and a model tag like ollama:llama4:scout. I deliberately avoided tool calls in this initial version to keep the scope small.

# agents//agent.yaml — new fields, validated at load
id: scout
name: SCOUT
model: claude-sonnet-4-6
provider: anthropic        # default. flip to 'ollama' to route locally.
# local_model: llama4:scout  # used when provider: ollama

To be honest, I did not actually flip any agents to Ollama in this specific PR. The agents I need to move, like STEWARD or WATCHMAN, execute Bash and SQLite queries. A local runner without tool-call support would break them silently. Building a proper tool-call shim takes a few more days, but the cadence reduction and the billing breaker alone are enough to keep my spend under $80 per month.

Machine State

Why This Matters

Every person using an agent OS is in the same boat. Whether you use ClaudeClaw, Cline, Aider, or Roo Code, the underlying SDK is the same, and the June 15 cliff is approaching. The playbook I used generalizes: you need one chokepoint, one ledger, and one way to audit your cadence.

We also need to be honest about workload requirements. Tasks like editorial review or complex code deliberation still justify the Sonnet price tag. However, simple tasks like classification, routing, or summarization run perfectly fine on a local model with zero metered cost. The router infrastructure makes this migration a simple config flip rather than a massive code refactor.

Finally, this reflects where the industry is heading. OpenAI has used usage-based pricing for a long time, and GitHub Copilot is moving toward credit pools. In the next year, more vendors will split consumption between interactive flat-rate plans and programmatic metered usage. Building this abstraction now means you won't have to scramble the next time a vendor changes their terms.

Quick Reference

Single Chokepoint: Ensure every agent call flows through one function. This turned a three-week refactor into a one-week job.
Cadence over Architecture: Reducing task frequency (e.g., 15m to 1h) cuts spend faster than migrating to local models.
Ship the Breaker First: Implement the cost ledger and the BillingCapExceeded error as insurance before you attempt the complex provider migration.

3 Rules to Survive

# The cutover, June 14: flip the env, restart, reseed, smoke-test
BILLING_MODE=metered
BILLING_CAP_USD=80

# then
launchctl kickstart -k gui/$(id -u)/com.claudeclaw.app
npm run pipeline -- schedule-advance
npm run schedule -- pause council-evening

Share As The Geek Learns

Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns

Leave a comment

ChatGPT Just Invented an Entirely Fake Version of My MCP Server

James Cruce — Fri, 08 May 2026 12:03:13 GMT

I asked ChatGPT to tell me about my own MCP server. It returned about a thousand words of confident, beautifully formatted, completely fabricated nonsense. Tables. Comparisons. A made-up acronym. A "thinking substrate" that sits above data and below agents. None of it is real, and that's the part worth talking about.

The Setup

My project is called `mcp-astgl-knowledge`. It's an MCP server with 15 tools for searching my newsletter articles, backed by sqlite-vec and Ollama. The whole thing fits on a laptop. ASTGL stands for "As The Geek Learns," which is the name of this newsletter. I wrote it. I shipped it. There is a public GitHub repo and a public package.json.

So when a friend asked me what the MCP server actually does, I figured I'd see how each big AI assistant explained it. ChatGPT was first up. I typed in "ASTGL MCP Knowledge" and hit enter.

What I got back wasn't an answer. It was a hallucination wearing the suit of an answer.

"ASTGL (Abstract Semantic Task Graph Layer) MCP Knowledge Server is an emerging MCP server focused on structured knowledge representation and reasoning... it turns knowledge into graph-based, machine-reasonable structures that agents can query and evolve."

That paragraph alone has three fabrications: the acronym expansion (made up), the "graph-based, machine-reasonable structures" (the server stores text chunks with vector embeddings, no graph), and "evolve" (the index is static, refreshed every six hours by a cron job, agents do not edit it).

Then it kept going. A four-row "MCP stack" table positioning ASTGL as "the thinking substrate" between data and agents. A comparison matrix against fictional products called "Totem" and "SwarmClaw" that don't exist. A capabilities list including "task decomposition" and "reasoning over structure." Use cases. "Real-world examples." A confident sign-off: "If AST-grep is about seeing code better, then ASTGL is about thinking better."

Every word of it written with the calm, structured, lightly-emoji'd authority that makes ChatGPT sound right by default.

What ChatGPT said versus what’s actually shipping

What's Actually Going On

When you ask an LLM about a topic it doesn't have indexed, it has two options: say "I don't know," or fill in the gap with something plausible. In practice, models default to the second one. They're trained to be helpful, and "I don't know" reads as unhelpful. So the gap gets filled.

The result is what I'd call a fluency hallucination. The output has no factual grounding, but the writing is structured well enough that a casual reader can't tell. There are bullet points. There are tables. There's a "👉 In plain terms" callout. The rhetorical scaffolding looks like a real explainer because it's been pattern-matched to one. The contents underneath are pure fiction.

Three states an under-indexed creator can be in. Only one is actionable.

This is a worse failure mode than search engines have. When Google doesn't know about you, you don't appear in results, and the user can see the gap. When an LLM doesn't know about you, the user gets a beautifully written description of someone the LLM made up, and your real work is still missing, but now there's a fake version sitting in front of it.

For under-indexed creators (which, right now, is most of us), this is the default. Not the edge case.

Two paths from the same question. The model picks the second one by default.

The Fix

There's no quick patch for this on the engine side. The model isn't broken. It's doing what it was trained to do. The only handle I have is on my own side: make sure my real content reaches the retrieval surface, and measure whether it's working.

So I built a citation tester. It's a small TypeScript script that hits Perplexity, Claude, and ChatGPT through their APIs, asks each one twenty target questions tied to articles I've already published, and parses the cited URLs from the response. If `astgl.ai` shows up, that's a hit. If it doesn't, that's the data.

First automated weekly run. Zero citations across 59 successful queries.

The point isn't that the floor is bad. I knew it would be. The point is that without a number, "improve our AEO" is a vibe, not a project. Every Monday at 9am the script runs again, writes a fresh row to a SQLite table, and tells me whether the floor moved. When it does move, I'll know which engine moved first, on which questions, and at what citation position. That's the actual feedback loop.

Same root cause as the hallucination: my content isn't reaching the retrieval surface. Same fix: get it there. Different observability.

Sixty queries, three engines, one row per result. About two minutes of API time.

Why This Matters

If you write online and you care whether AI assistants represent you accurately, this is the thing to internalize: the alternative to being cited is not being silent. It's being replaced.

Replaced by a confident summary of work you didn't do, opinions you don't hold, and product features you'd never ship. People who ask an LLM about your work and read its answer don't know they're reading fiction. They walk away with a model of you that you didn't write.

The traditional AEO playbook talks about ranking, authority, and citation rate. All real, all worth measuring. But there's a tier underneath that, and it's the one most independent creators are stuck on right now: existence. Until your content is in the index, ranking doesn't apply. You aren't competing with anyone. You're competing with the LLM's imagination of you.

The four steps that turn an unknowable problem into a measurable one.

Measurement is the cheapest part of fixing it, and it's the part most people skip.

Quick Reference

Four things that matter, in order:

1. Pick 20 questions your articles should answer. Tie each one to a specific URL on your site.

2. Hit each engine via API weekly. Perplexity returns a `citations[]` array. Claude returns search results in `web_search_tool_result` blocks. OpenAI returns `url_citation` annotations on `output_text` items.

3. Record the result to a small database, not a spreadsheet. You want trend data, not a snapshot.

4. Look at the floor first. Zero is a fine starting number as long as you're tracking it.

The full script I'm using, including the gotcha where Node's `--env-file` silently dropped my Anthropic key on a fresh keypair, is in the repo. The article about the Anthropic key bug is coming separately.

Leave a comment

Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns

The Ollama Model-Swap Death Spiral That Killed Every Cron at Once

James Cruce — Wed, 06 May 2026 13:03:19 GMT

3 a.m. Every cron job on the Mac Studio failed inside the same 90-second window. No code changes. No model updates. No new jobs. Just a wall of timeout errors that lit up every channel I had wired to alerts. The culprit was hiding in plain sight: a fallback chain doing exactly what I told it to.

The Setup

One Mac Studio. One Ollama daemon. A handful of cron jobs each calling the local LLM for different tasks: code review, log summarization, doc indexing, a nightly digest. Each cron specified a preferred model. Each one inherited a "be resilient" fallback chain from the task router: try the preferred model, fall back to a smaller one, fall back to a tiny one if both fail.

It looked clean on paper. Big model for the smart stuff, smaller model when the big one chokes, tiny model as a safety net. Classic graceful degradation. The kind of pattern you'd put in a "production-ready" checklist without thinking twice.

The models on disk ranged from 4GB to 22GB. Loading the big one into VRAM took roughly 60 seconds cold. Generation, once warm, took 5 to 10 seconds. Guess which number I used to set the timeout.

What's Actually Going On

Here's the cascade. Cron A fires at 3:00:00 and asks for `qwen2.5-coder:32b`. The model isn't loaded. Ollama spends the entire 30-second timeout just paging the weights into VRAM. It never gets to generation. The request fails. The fallback chain kicks in and asks for `qwen2.5-coder:14b`. Ollama evicts the half-loaded 32b, starts loading the 14b. Another 30 seconds gone. Fallback again. Tiny model loads, finally generates. Cron A "succeeds" with degraded output.

Meanwhile, Cron B fires at 3:00:15 expecting the 32b model that Cron A's first attempt was loading. Now there's a tiny model in VRAM instead. Cron B starts the same dance from a different starting point. Cron C lands on top of that. Within 90 seconds, every cron is waiting on a model swap that the next cron is about to invalidate.

The fallback chain wasn't degrading gracefully. It was thrashing the VRAM and guaranteeing nobody finished. Every safety net I'd added was making the failure worse.

Model Swap Cascade

The Fix

Two changes. No clever code. Just operational discipline.

First, pin one model in VRAM with `keep_alive: 24h`. This is a request-level option that tells Ollama to stop evicting the model after the response. Default behavior is to unload after 5 minutes of idle. That's the eviction that lets the next caller's load attempt thrash everything.

# Pin model in VRAM with keep_alive
curl -s http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:32b",
  "prompt": "test",
  "keep_alive": "24h"
}'

Second, force every frequent cron to use that same pinned model. Kill the fallback chain for hot-path workloads. Fallback is fine for one-off scripts you run by hand. It's poison when three crons fire in parallel against shared VRAM.

To make sure the model is loaded before any cron fires, I added a LaunchAgent that runs the warm-up curl on boot:


Label
com.local.ollama-warmup
RunAtLoad

ProgramArguments

  /usr/bin/curl
  -s
  http://localhost:11434/api/generate
  -d
  {"model":"qwen2.5-coder:32b","prompt":"warmup","keep_alive":"24h"}

Load it with `launchctl load ~/Library/LaunchAgents/ollama-warmup.plist`. Now the model is hot before login completes. Every cron hits a warm model and finishes in the 5-to-10-second window the timeouts were designed for.

Result: zero model-swap thrashing since the change. Crons that used to fail intermittently now run consistently.

VRAM Thrashing

Share As The Geek Learns

Why This Matters

The lesson isn't about Ollama. It's about cold-load math. Anytime your "graceful degradation" path is slower than your timeout, every retry makes the next caller's situation worse. Fallback chains assume the fallback is fast. Model loads aren't fast. Database failovers aren't fast. Cold containers aren't fast.

Operational discipline beats clever code here. One hot model, no swaps, every cron pointed at the same target. The "less resilient" design is actually more reliable because it removes the failure mode entirely.

If you're running local LLMs on shared hardware, assume VRAM is a single resource that gets thrashed under parallelism. Pin what matters. Warm it before it's needed. Don't trust fallback chains during peak hours.

Quick Reference

Cold model load on a 20GB+ model: roughly 60 seconds
Warm generation: 5 to 10 seconds
Default Ollama eviction: 5 minutes of idle
Pin a model: `keep_alive: 24h` in the API request body
Warm-up on boot: LaunchAgent (macOS) or systemd unit (Linux)
Hot path rule: one model, no fallback, same model across every concurrent caller
Reserve fallback chains for interactive, single-caller use

Leave a comment

If you found this article useful, you can find more articles like this at:

As The Geek Learns

I Killed OpenClaw and Built ClaudeClaw Mission Control

James Cruce — Sat, 02 May 2026 23:01:21 GMT

Two months ago I wrote about ripping Notion out of my workflow and replacing it with OpenClaw—a self-hosted AI agent framework running on my Mac Studio. No cloud. No subscription. No black box.

Last weekend I shut it down. Disabled 38 cron jobs. Moved 23 LaunchAgents into a _retired-openclaw/ quarantine folder. Killed the Ollama daemon. Archived the directory with a 30-day deletion timer.

Everything in that original article still reads as true. Local-first is still right. Data ownership is still right. The critique of SaaS “well-enough” software is still right. What I got wrong was believing OpenClaw was the right vehicle for any of it.

This is the post-mortem and the replacement: an agent OS I built on top of the Claude Agent SDK called ClaudeClaw Mission Control. Thirteen themed agents. One daemon. A scheduler I can actually see into. Zero silent failures slipping past me for a week before I notice.

POST-MORTEM 2026-05-02

Let me explain how I got here.

The Setup

OpenClaw was doing real work. 38 cron jobs. Morning briefings. Evening summaries. A content pipeline that pulled research from web sources, structured it, scored it, and queued articles for ASTGL. An email triage pass. A model-usage monitor. A nerve-health monitor watching the other monitors.

On paper: impressive. In practice: I had no idea if any of it was working.

The system was so noisy that when something broke, I learned about it four days later when I noticed my morning briefing hadn’t arrived. Or I didn’t learn about it at all, because the cron job was exiting 0 while the script inside it was crash-looping.

That last one is the killer. Let me show you what I mean.

What’s Actually Going On

Three failure modes hit me in a 48-hour window, and each one was invisible to the system watching the system.

Failure one: successful exits, 100% broken payload. My content pipeline was ingesting URLs, and a regression introduced a trailing-slash bug that made example.com/foo and example.com/foo/ look like different URLs to the dedup layer. Every new article hit a UNIQUE constraint violation inside a subprocess. The outer wrapper caught the error, logged it to a file nobody was reading, and exited 0. For two weeks the cron appeared green while 100% of structurings were crashing.

Failure two: PATH-resolved Node. I had the daemon running Node 24 (absolute path, explicit). A subagent it spawned inherited a PATH that fell through to Homebrew’s Node 25. One of the native modules (better-sqlite3) was compiled against 24, so every subagent invocation crashed with ERR_DLOPEN_FAILED and MODULE_VERSION mismatch. The smoke test I’d written passed because it ran from the daemon’s shell. The actual production path failed every time.

Failure three: auth expiry with no escape hatch. OpenClaw stored some credentials in pass (the Unix password store). When my GPG key timed out, the daemon couldn’t start. Which meant the health monitor couldn’t start. Which meant the thing that would have told me about the outage was the thing that was out. OpenClaw had no watcher that lived outside the daemon it was watching.

None of these are OpenClaw-specific bugs in the upstream sense. They’re pattern problems that emerge anywhere you have: 1. A monolithic daemon responsible for its own monitoring. 2. Flat-file state (HEARTBEAT.md, LEARNINGS.md) that gets appended to rather than queried. 3. Exit codes treated as truth when the real signal is in stderr. 4. No separation between “Did it run?” and “Did it work?”

OpenClaw was built for a different job. It was a personal automation gateway—great at “kick off this script at 6:30 AM.” It wasn’t built to be an agent OS with observability. I was using a shovel to drive screws.

I also couldn’t ignore the security posture. February’s disclosures—135,000 exposed instances, 15,000 vulnerable to RCE, the ClawHavoc plugin-registry incident, nine CVEs—had pushed me to patch hard and lock down. But every week I spent hardening OpenClaw was a week I wasn’t building what I actually wanted: themed agents that owned workstreams, could be reasoned about individually, and fail loudly.

The Fix

ClaudeClaw Mission Control is a Node.js daemon built on the Claude Agent SDK. It runs as a single LaunchAgent (com.claudeclaw.app), owns a SQLite store at store/claudeclaw.db, polls a scheduled_tasks table every 60 seconds, and dispatches due tasks to agents by ID.

The interesting part isn’t the daemon. It’s the agents.

I set up thirteen of them, themed after the small council of a certain fictional kingdom, because if I’m going to stare at this UI every day, I’d rather it amused me.

The War Room

Thirteen themed agents, each owning a workstream. STEWARD drives my mornings and evenings. MAESTER runs the ASTGL content pipeline. WATCHMAN watches the whole system from outside it.

Each agent lives in its own directory at agents//, with an agent.yaml (model, personality, cwd, MCP servers) and a CLAUDE.md system prompt. A scheduled task carries an agentId column in the DB, and the dispatcher routes like this:

if (shouldRouteViaAgent(task.agentId, listAgentIds())) {
  const result = await delegateToAgent(task.agentId, task.prompt, {
    fromAgent: SCHEDULER_FROM_AGENT,
    chatId: task.chatId,
  });
  return result.text ?? '(empty response)';
}

Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick.

That’s the piece I kept trying and failing to get with OpenClaw—modular ownership. In OpenClaw, everything was “the daemon.” In ClaudeClaw, MAESTER owning the content pipeline means if content alerts stop firing, the log line says maester: task failed instead of openclaw-gateway: subprocess exited nonzero. Attribution is free.

Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick.

The Watchman probes

WATCHMAN runs every hour at :05. It has seven probes, each targeting a failure mode that burned me on OpenClaw:

1. Failed tasks. status=’failed’ in the DB. Trivial.

2. Stuck tasks. status=’running’ AND last_run < now - 10min. This catches hangs.

3. Missed slots. status=’active’ AND next_run < now - 60s. Catches scheduler drift.

4. Daemon liveness. launchctl print gui/$UID/com.claudeclaw.app—does launchd still have it?

5. Content-pipeline health. Tails the structured log file, parses the JSON, checks for crash shapes.

6. Hidden failures. Scans the last_result text column for ERR_DLOPEN_FAILED, MODULE_VERSION, Traceback, and other “the job exited zero but it sure didn’t work” signals. This is the probe that would have caught my trailing-slash bug in an hour instead of two weeks.

7. Delegation crashes. inter_agent_tasks WHERE status=’failed’ — on-demand agent invocations that blew up.

On top of that, there’s a separate LaunchAgent running a healthcheck every 30 minutes that lives outside the main daemon and uses a keychain-backed Telegram token. If the daemon is dead, the healthcheck still delivers the alert. That’s the lesson from failure three: the watcher cannot share fate with the watched.

Memory v2

OpenClaw’s memory was HEARTBEAT.md and LEARNINGS.md—flat files I appended to. Eventually they got long enough that the agent stopped reading them usefully, and I had no query surface to pull just the relevant bits.

ClaudeClaw’s Memory v2 is a five-layer context stack: 1. Semantic recall—cosine similarity against stored memory embeddings, top 5 by score, chat-scoped. 2. Recent high-importance memories—memories with importance >= 0.7 written in the last 7 days. 3. Consolidation insights—a 30-minute loop that summarizes the short-term buffer into durable notes. 4. Cross-agent hive—stubbed for now; eventually lets MAESTER peek at something STEWARD noted this morning. 5. Conversation history—last N turns.

Layers dedupe by memory ID. The whole thing is safe to drop into the SDK’s systemPrompt option. It’s not magic. It’s just queryable instead of append-only, which is the delta between “context I can use” and “a log file I’ll never re-read.”

Forum-topic routing instead of bot-per-agent

A small but satisfying piece. All thirteen agents post to one Telegram bot, into one supergroup, but each agent has a dedicated forum topic:

Alerts → thread 22 (WATCHMAN)

ASTGL → thread 23 (MAESTER)

Council → thread 24

Steward → thread 25

Whisperers → thread 26

War Room - Security → thread 40 (WAR)

One token. One chat. Threaded conversations per domain. The ergonomics are dramatically better than 13 separate bots with 13 separate tokens, which is the architecture I almost built before I remembered that Telegram supergroups have forum topics now.

Why This Matters

A few things I want to flag for anyone planning something similar.

Build the rollback before you build the new thing. I wrote scripts/retire-openclaw.sh with explicit --rollback semantics before I disabled a single cron job. Plists get moved (not deleted) into _retired-openclaw/. Cron jobs get flipped enabled: false with a timestamped backup (jobs.json.bak.pre-retire-20260419). The OpenClaw directory sits untouched for 30 days with a calendar reminder to delete it. If ClaudeClaw had cratered on day two, I was one shell command away from being back on the old system in under a minute.

Silent success is worse than loud failure. The design principle I pulled from this whole experience: every job in the system needs someone whose job it is to doubt that job ran correctly. That’s WATCHMAN. That’s the external healthcheck. That’s probe #6 specifically scanning success logs for crash text. If your system can tell you “everything’s green” without that green being adversarially checked, the green doesn’t mean anything.

Themed agents beat generic workers. This one I didn’t expect. Giving each workstream a named agent with its own CLAUDE.md persona made the system more debuggable, not less—because now when STEWARD’s morning briefing has weird tone issues, I know exactly which file to edit, and I’m not risking regressions in seven other jobs that would have shared a single “universal assistant” prompt. The theme is cosmetic. The isolation is load-bearing.

Share As The Geek Learns

The Claude Agent SDK is the right abstraction for this. I spent a while trying to decide whether to keep hacking on OpenClaw, fork it, or start over. Starting over was the right call specifically because the Agent SDK handles the parts I was getting wrong: sub-agent dispatch, MCP tool wiring, system-prompt composition, retry on transient errors. I wrote the parts that are mine (the scheduler, the memory stack, the Telegram layer, the agent router) and let the SDK own the parts that are undifferentiated heavy lifting.

What I gave up. Ollama. Local models. Full offline operation. ClaudeClaw talks to Anthropic’s API, and that’s a real philosophical loss versus the local-first thing I was doing with OpenClaw. I thought about this a lot. The honest answer is that Claude Opus is enough better at long-context agentic work than anything I could run locally that the tradeoff pays for itself. I still own my data—every memory, every document, every log is on my SSD. I just don’t own the weights. For this phase, that’s the right trade.

What I kept. The philosophy. Every document is a file I can grep. Every config is version-controlled. Every decision has a session note I can link to in a future article. The system is mine to read, mine to modify, mine to understand. The whole reason I left Notion is still the whole reason I left Notion.

Quick Reference

The migration, by the numbers: - 5 days—start of retirement to all 13 agents live (2026-04-19 → 2026-04-21) - 30+ PRs—one atomic change per commit, conventional-commit format - 38 cron jobs disabled, 23 LaunchAgents quarantined - 13 agents onboarded, 7 Watchman probes live, 14 scheduled tasks dispatched via agentId - 30-day rollback window still open

The retired vs. the replacement:

Retired vs. Replacement

Seven dimensions where the new system pays for itself—from runtime surface to routing to the memory model.

The rule I wrote for myself: No job ships without an external watcher that shares no fate with it. That’s the whole story. Two months of OpenClaw and 48 hours of cascading invisible failures reduced to one sentence I’ll never forget.

I’ll keep writing the ClaudeClaw build-out week by week—the Council orchestration pattern, the Curator autonomous publishing workflow, the voice-mode bridge, the stuff that’s too long for one article. If you want the view from inside while it’s happening, that’s what this is.

Leave a comment

Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns.

Nightshift: I Went to Sleep and My Mac Ran 118 Experiments

James Cruce — Wed, 22 Apr 2026 19:00:22 GMT

I went to sleep. My Mac ran 118 experiments. When I woke up, a small GPT had trained itself from `val_bpb` 1.563 down to 1.289, beating every documented Apple Silicon overnight run in the project's public README. I wrote no code overnight. I just left a Claude Code session running against a markdown file named `program.md`, and the agent did the rest.

This is the first morning I've ever genuinely understood why people talk about AI agents with something other than skepticism.

What autoresearch is

The idea, which is Karpathy's not mine, goes like this. You give an AI agent a real-but-small LLM training setup. One Python file (`train.py`) contains the model, optimizer, and training loop. A second file (`prepare.py`) contains the data pipeline and evaluation, and the agent isn't allowed to touch it. A third file (`program.md`) is a plain markdown document telling the agent what the experiment rules are.

The agent edits `train.py`, runs a training experiment with a fixed 5-minute wall-clock budget, checks `val_bpb` (validation bits per byte, a loss metric where lower is better), and either keeps the change with a git commit or does `git reset --hard` and tries something else. Then it does it again. And again. Indefinitely, until you stop it.

Karpathy's original repo is NVIDIA and CUDA only. A developer named trevin-creator ported it to Apple Silicon using MLX, no PyTorch required. It runs natively on the M-series chips, eating unified memory instead of GPU VRAM. Which is why I could run it on a Mac Studio sitting on my desk.

Setup and the surprise baseline

Install took about three minutes. `uv sync` pulled MLX and six other small dependencies. `uv run prepare.py` downloaded eleven training shards from the public HuggingFace dataset and trained a BPE tokenizer in 41 seconds.

Then I did one manual run, as the setup instructions said to: a single 5-minute training experiment to establish a hardware baseline, no modifications.

The first surprise: `val_bpb 1.563`. The public README documents a manual walk on older Apple Silicon that bottomed out at `1.807` after four experiments. My first run, before the AI agent had done anything, was already 13% better than that published best. I didn't tune anything. I pulled the repo and ran it.

The reason is in how the loop is constructed. The training budget is fixed at 5 minutes of wall clock. The M3 Ultra throughput is high enough that it fits 555 optimizer steps into that window, while the older hardware fits fewer. Same code. Different step count. Different result.

The hardware is a parameter, not a constant.

Specs for replication
- Hardware: Mac Studio M3 Ultra, 128 GB unified memory
- OS and runtime: macOS 15, Python 3.12, `uv` 0.10
- Framework: MLX 0.31 with Metal backend (no PyTorch, no CUDA)
- Agent runner: Claude Code (Anthropic)
- Fork used: `github.com/trevin-creator/autoresearch-mlx`
- Per-experiment budget: 5 minutes training, ~90 seconds compile and eval overhead
- Peak unified memory during training: 21.2 GB

Launching the agent overnight

Here's where you have to decide. Karpathy's default advice is to "disable all permissions" and let the agent go. That's the fastest path and it works. But it's also a permission-free Claude Code session running unattended on your Mac for eight hours, with the ability to execute arbitrary shell commands. If the agent hallucinates a destructive action at 3 AM, you won't be there to interrupt it.

I went with a scoped allowlist instead. A `.claude/settings.local.json` file listing exactly the commands the loop actually needs: `uv run train.py`, `git add train.py`, `git commit`, `git reset --hard`, `grep`, `tail`, a few others. Everything else prompts. The agent can't `rm`, can't `git push`, can't install packages, can't touch any file outside the repo.

Then I pointed a fresh Claude Code session at `program.md`, pasted "start the experimentation loop, don't stop," and went to bed.

Share As The Geek Learns

The morning, by the numbers

The morning log:

Comparison to the three overnight runs documented in the public README:

Final `val_bpb` of 1.289 lands below the best documented Apple Silicon overnight result. New territory for the public log.

What the agent actually did

Five phases overnight. Each tells you something.

Phase one: find the big axis. Four experiments in, the agent had halved the batch size three times (1.56, 1.40, 1.39, 1.38), then tried a fourth halving that bounced back to 1.44. The annotation on the discard: "gradient noise." Correct diagnosis. Below a threshold, batch becomes too small for the optimizer to converge inside 5 minutes.

Phase two: schedule tuning, six keeps in a row. The learning-rate schedule was undertuned. The agent walked `WARMDOWN_RATIO` from 0.7 to 1.0, then `WARMUP_RATIO` from 0.02 to 0.2. Every step dropped `val_bpb`. Floor went from 1.38 to 1.34. Biggest easy win of the night, and it was entirely in the schedule.

Phase three: the moment that mattered most. After schedule tuning, the agent retried `TOTAL_BATCH_SIZE = 2^14`. The same configuration it had rejected in phase one. This time it won.

The agent had discovered the thing most humans miss in hyperparameter tuning: the optimal value of one knob depends on the values of all the other knobs. You don't find N independent settings; you find a consistent N-tuple. The only way to find it is to retry earlier-rejected values after each structural change. I've watched human researchers lock in early wins and never revisit them. The agent didn't. It revisited `EMBEDDING_LR` three times over the night, landing at 1.0, then 1.5, then 1.75 across different phases. Each retry, a small win.

Phase four: two structural wins, one line each. `has_ve()` went from alternating-layers-get-Value-Embeddings to all-layers-get-Value-Embeddings, one `return True` replacing a modular-arithmetic expression. `MLP.__call__()` swapped `ReLU²` for `SiLU`, one function call for another. Both character-count-sized changes. Each dropped `val_bpb` by about 0.01.

Phase five: the 37-experiment grind. The agent spent 37 consecutive experiments without a single keep, testing every nearby hyperparameter against the current local minimum. Most humans would have quit and tried a wild leap. The agent didn't. It finished the neighborhood, then found the next structural win. Disciplined exhaustion.

And two catastrophes, both correctly reverted. Tied embeddings came back at `val_bpb 4.29`, three times worse than anything else. The agent annotated it "LR mismatch destroys." Tied embeddings is actually a good idea in general, but incompatible with the differential layer-wise learning rates the architecture uses. The agent reverted in seconds. On another experiment, removing QK-norm after RoPE spiked `val_bpb` to 1.67. Annotation: "massive regression." Reverted. A human would have spent an hour trying to salvage tied embeddings. The agent spent ten seconds on the revert. The revert discipline is the whole game.

What it taught me

Two things crystallized overnight.

Disciplined exhaustion beats creative leaps. Humans get bored. After a few hours on the same hyperparameter axis, we start reaching for something new because the exploration stops feeling productive. The agent doesn't have that pressure. It spent 37 experiments without a win because that's what the local search called for, and then it found the next jump. Most humans couldn't do that. Not because we lack the ability, but because we lack the emotional neutrality. The agent's advantage isn't intelligence. It's the absence of boredom, ego, and social pressure. That isn't a 20× productivity gap. It's a categorical one.

Generation is cheap, evaluation is sacred. Every one of the agent's wins was a one-line diff. So was every catastrophe. The "research" wasn't in writing the code. The research was in the metric's ability to rank one-line diffs instantly and unambiguously. Karpathy's genius isn't the agent. It's `val_bpb` plus a 5-minute budget plus `git reset --hard`. That design slots the agent into exactly what AI is magnitudes better at (generating variants, executing at volume) and leaves the hard part (what to measure) to the human who built the loop.

The loop runs on me too

Here's the thing I can't stop thinking about. The loop the agent ran overnight is structurally identical to the one I'm building for my Stoic practice on the same machine.

Morning intention. Five-minute run. Evening review. Keep or discard. Iterate.

Marcus Aurelius wasn't optimizing `val_bpb`. He was optimizing a harder metric with no closed form. But the shape of the loop is the same. Karpathy designed an overnight research org. Epictetus designed an overnight self. Both are the same thing running in different mediums.

The 118-experiment loop ran on a machine on my desk. The second loop runs on me.

If you have a Mac Studio and a spare evening, the repo is at `github.com/trevin-creator/autoresearch-mlx`. Clone it, run `prepare.py`, point a Claude Code session at `program.md`, go to sleep. You wake up to a log of experiments and a better model. And if you're anything like me, you also wake up thinking about which of your own loops could run this way.

Leave a comment

A Quick AI Glossary For This Article

Because not everyone speaks ML fluently, here’s a plain-English guide to the terms in this post. I’m still learning too, so these are “practitioner” definitions—enough to follow what’s happening, not academic deep-dives.

The Big Picture

GPT. A type of language model. Stands for “Generative Pre-trained Transformer.” In this article I’m training a tiny one from scratch, not using the big ones like ChatGPT. Same architecture family, just much smaller.

Pre-training. The step where a model learns to predict the next word (or “token”) across a huge pile of text. This is what `train.py` is doing. It happens before any of the fine-tuning that turns a base model into a chatbot.

val_bpb (validation bits per byte). The score the agent is optimizing. Lower is better. It’s a measure of how surprised the model is by held-out text it hasn’t seen during training. A model that predicts well has low surprise. Bits per byte is a way of measuring that surprise that works across different tokenizers, so you can compare different architectures fairly.

Loss metric. Any number that tells you how wrong a model is on a given task. Training is the process of making that number go down. `val_bpb` is a loss metric.

The Stack

Apple Silicon. Apple’s own CPU/GPU chip family (M1, M2, M3, M4). Uses unified memory, which means the CPU and GPU share the same pool of RAM instead of having separate memory pools. For AI workloads this is a big deal because you don’t have to copy data between CPU RAM and GPU VRAM.

MLX. Apple’s open-source machine learning framework, built specifically for Apple Silicon. Think of it as Apple’s answer to PyTorch but native to Metal (Apple’s GPU API). No PyTorch, no CUDA, no NVIDIA drivers needed.

PyTorch. The dominant open-source ML framework. Most research code you see online assumes PyTorch. It runs on NVIDIA GPUs (via CUDA) and, with caveats, on Apple GPUs (via MPS). MLX is an alternative that sidesteps PyTorch entirely.

CUDA. NVIDIA’s API for running general-purpose compute on their GPUs. If you’ve ever seen a blog post say “requires a CUDA-capable GPU,” they mean an NVIDIA card.

GPU VRAM. The memory that lives on a GPU card, is separate from your computer’s main RAM. On Apple Silicon, VRAM and main RAM are the same pool (that’s the “unified memory” thing).

Tokenization & Data

Tokenizer. The thing that turns text into numbers the model can actually work with. “Hello world” might become `[15496, 995]`. The model only ever sees the numbers.

BPE (Byte-Pair Encoding). The most common algorithm for building a tokenizer. It starts with individual characters and iteratively merges the most common pairs until you have a vocabulary of “tokens” that balance common words (one token) and rare words (split into pieces).

Shards. Chunks of a large dataset, split into files for parallel download and loading. Our setup uses 11 shards from a public text dataset.

Training Mechanics

Optimizer. The algorithm that actually updates the model’s weights during training. AdamW is the one used here. Every “optimizer step” is one update.

Batch size. How many training examples the model looks at before making one weight update. Bigger batches give smoother gradient estimates but use more memory. Smaller batches fit more weight updates into a fixed time budget.

Gradient accumulation. A trick for getting large effective batch sizes on limited hardware. Process smaller mini-batches sequentially, add up their gradients, then apply one update. `TOTAL_BATCH_SIZE / DEVICE_BATCH_SIZE` tells you how many mini-batches per update.

Gradient noise. When your batch is so small that the gradient estimate becomes statistically unreliable. The optimizer starts jerking around instead of smoothly descending, and training slows or stalls. The agent correctly identified this as the failure mode at batch 2^12.

Learning rate (LR). How big a step the optimizer takes each update. Too high, and training blows up. Too low, and it barely progresses. The sweet spot depends on everything else.

Learning rate schedule. How the learning rate changes over time. Typically: warm up from zero to peak, cruise, then warm down to zero. `WARMUP_RATIO = 0.3` means the first 30% of training is the warm-up.

Differential / layer-wise learning rates. Using different learning rates for different parts of the model. In the nightshift setup, the embedding layer gets LR 1.75, but the output projection (`lm_head`) gets 0.006 — a 290× difference. This matters because different parameter types have very different sensitivities.

Architecture Pieces

Attention (or attention layer). The core mechanism that lets a transformer model “pay attention to” relevant earlier tokens when predicting the next one. Modern LLMs are mostly stacks of attention layers alternating with MLPs.

MLP (multi-layer perceptron). A simple feed-forward neural network with one or two hidden layers. In a transformer, an MLP sits between each pair of attention layers and does the “thinking” on the representations attention produced.

Activation function. A nonlinear function applied inside a neural net. Without activations, no matter how many layers you stack, the whole thing collapses mathematically into one linear transformation. Examples in this article: `ReLU²` and `SiLU`.

SiLU (Sigmoid Linear Unit). `x * sigmoid(x)`. A smooth, differentiable activation function. Also called Swish. Used in many modern models because it plays nicely with optimizers.

ReLU² (squared ReLU). `max(x, 0) ** 2`. The piece that nanoGPT-speedrun and some research codebases use. Produces sparse, squared activations. Theoretically expressive but less numerically stable than SiLU for short training runs — which is why SiLU won overnight.

Embedding. The lookup table that converts each input token (a number) into a vector of real numbers. The model learns what each vector should be during training. `wte` = word token embedding.

Value Embeddings (VE). An additional set of embeddings injected into attention layers as the “value” vectors. Think of them as a skip connection from the raw input that every attention layer can consult, on top of what the previous layer produced. Helps information flow when the network is deep.

Tied embeddings. Sharing the input embedding weights with the output projection weights (the thing that produces final logits). Saves millions of parameters. Commonly used in GPT-2 and many others. Broke catastrophically in our run because the differential learning rate setup couldn’t handle the shared weight.

QK-norm (Query-Key normalization). A stabilization trick: normalize the query and key vectors inside attention before computing attention scores. Without it, score magnitudes can spike, saturating the softmax. The agent tried removing QK-norm and `val_bpb` jumped 28% worse.

RoPE (Rotary Position Embedding). How the model knows the order of tokens. Rotates the query and key vectors by an angle that depends on the token’s position. Standard in modern transformers.

Softmax. The function that turns raw attention scores into a probability distribution over the tokens you might attend to. Highly peaked inputs cause “softmax saturation” — most of the weight collapses onto one token and gradients downstream get weak. That’s why QK-norm matters.

Methodology

Hyperparameter. Any configuration value you set *before* training, as opposed to weights the model learns *during* training. Batch size, learning rate, WARMUP_RATIO, depth—all hyperparameters.

Hyperparameter tuning. The art (and mostly the grind) of finding good hyperparameter values. Most of what the agent did overnight was hyperparameter tuning.

Interaction effect. When the optimal value of hyperparameter A changes depending on what hyperparameter B is set to. A consistent set of hyperparameters is not N independent optima — it’s one N-tuple.

Local search. A research strategy: after finding an improvement, test every nearby variation of your current best before venturing somewhere completely different. Tedious for humans. Perfect for agents that don’t get bored.

If I missed a term you’d have liked defined, please let me know in the comments and I’ll add it.

Hosted RAG vs. Self-Hosted RAG for MCP Servers—When Does Paying Actually Win?

James Cruce — Tue, 21 Apr 2026 00:42:22 GMT

I shipped an MCP knowledge server in a weekend with sqlite-vec and Ollama. It answers questions about my own articles. It runs on a laptop. It costs $0/month.

Then someone asked the obvious next question: "Can you point it at our Confluence? And Notion? And the Google Drive?"

Suddenly self-hosted isn't free anymore. It's a part-time job—PDF parsing, OCR, re-indexing schedules, dealing with 50-page slide decks where the first 20 pages are a title card. The embedding pipeline that was elegant for 20 markdown articles starts to sweat when you throw a 400-page SOC 2 audit at it.

So here's the question I had to actually answer for myself: when does paying Cloudflare, AWS, or Pinecone actually beat running your own stack?

I spent a research pass comparing the live services. Here's what I found.

TL;DR

Self-host when content is static, under about a thousand docs, single source, you control ingestion cadence, and privacy or cost-per-query matters more than your time.

Hosted when: multiple unstructured sources, frequent re-indexing, non-engineers uploading docs, you need SLAs, or you're shipping this to customers.

Hybrid is increasingly common: hosted RAG for the customer-facing product, self-hosted for internal dogfooding and dev. The two aren't mutually exclusive.

The Contenders

Five options worth your attention. One paragraph each.

Cloudflare AI Search (AutoRAG)

The newest entrant, currently in open beta. Cloudflare stitched together R2 for storage, Vectorize for embeddings, and Workers AI for inference, then wrapped the whole thing in a management API. Strongest pitch: near-zero config, pay-as-you-go, and an official MCP server ships with it. Weakest point: retrieval is vector-first. Cloudflare added optional reranking in October 2025, but there's still no published BM25 or hybrid-search path as of this writing. If your corpus is well-structured, you probably won't notice. If you're indexing messy enterprise content, you will.

AWS Bedrock Knowledge Bases

The enterprise default if you're already on AWS. Hybrid search (vector + BM25) is built in, Cohere reranking is available, and chunking modes range from fixed-size to semantic to custom Lambda. Titan V2 embeddings run at $0.02 per million tokens. There's an official AWS Labs MCP server for retrieval. And then there's the OCU landmine—which I'll get to in a minute, because it deserves its own sidebar.

Pinecone Assistants

Best-in-class retrieval, managed. Hybrid sparse-dense search with automatic reranking, configurable alpha weighting, managed embeddings abstracted away from you, and an official remote MCP server. Pricing is fully usage-based—$5 per million context retrieval tokens, plus input/output token, storage, and ingestion charges on top. The Standard plan has a $50/month minimum; the old $0.05/assistant-hour fee was removed. Free tier is real but tight—5 assistants per project, 1 GB storage, 500k input tokens, and 500k context retrieval tokens per month. Past that you're paying, but the retrieval quality is noticeably better than anything else on this list.

LlamaCloud

Managed LlamaIndex. Multimodal parsing that actually handles diagrams, configurable chunking modes, hybrid retrieval, reranking. The free tier gives you 10,000 credits a month—about a thousand pages. Paid tiers start at $50/month (Starter, 40K credits) and scale to $500/month (Pro, 400K credits). For a LlamaIndex-native team, the Starter tier is genuinely cheap; Pro is where the platform pays off. LlamaIndex ships `run-llama/llamacloud-mcp` (Python) and `run-llama/mcp-server-llamacloud` (TypeScript), plus a hosted gatway at mcp.llamaindex.aithe MCP story is actually stronger here than I initially realized.

Self-Hosted (sqlite-vec + Ollama)

This is what the ASTGL Knowledge MCP server actually runs on. sqlite-vec for vectors, FTS5 for keyword search (that's your hybrid search right there, no cloud required), and Ollama serving nomic-embed-text for embeddings, all of it on a $10/month Hetzner VPS or a Mac mini on my desk. Works well for up to around a million vectors in my testing. Real cost: infrastructure plus your time. The second one is the variable.

The Six Axes That Actually Matter

Pricing gets the attention, but it’s rarely the deciding factor. Here’s what I look at:

Setup cost

Time-to-first-query is where hosted services actually earn their money. Pinecone Assistants and Cloudflare AI Search will have you chatting with your docs in under a minute after signup—upload and go. Bedrock is the outlier on the hosted side: AWS documentation puts CloudFormation infrastructure deployment at 7–10 minutes, with a full hand-wired setup typically landing at 20–30 minutes. That's hosted pricing with self-hosted-ish friction.

Self-hosted with sqlite-vec and Ollama is about 30 minutes from `apt install` to first working query if you know what you're doing, longer if you're learning. For me it's fast because I've done it. For someone new to local LLMs it's a weekend.

Ongoing cost

This is where the story flips. For a small corpus with low query volume—think a few hundred docs and a few thousand queries a month—Cloudflare AI Search is genuinely cheap, maybe $5–15/month in storage and API costs. Pinecone Assistants sits at $20–50 in that range. Bedrock KB looks innocent until you hit the OCU minimum (more on that below). LlamaCloud's $50/month Starter floor is reasonable; the $500/month Pro tier is where the platform pays off at real scale.

Self-hosted is $10/month for a Hetzner VPS, flat. Mac mini on your desk? $0/month plus electricity. The per-query cost of hosted RAG is the thing that compounds when you scale—or when someone builds something that hammers it.

Ingestion complexity

This is the axis where hosted services earn their keep without argument. Bedrock KB and LlamaCloud both handle PDFs with embedded tables, Word docs, and (in LlamaCloud's case) actual diagrams, not just the text around them. Bedrock's Data Automation service charges $0.010 per page for parsing—not free, but a lot cheaper than writing your own PDF extractor.

Self-hosted with Ollama and sqlite-vec doesn't ship with any of that. If your corpus is markdown, you're fine. If it's a pile of PDFs from your legal team, you're either writing parsers or paying someone to.

Retrieval quality

All four hosted services offer hybrid retrieval except Cloudflare AI Search, which is vector-only as of this writing. Pinecone Assistants has automatic reranking baked in. Bedrock KB has optional Cohere reranking. Self-hosted with sqlite-vec can do hybrid via FTS5 for keyword matching combined with vector similarity, which is genuinely good—but you're the one writing the ranking logic.

For most queries on well-structured content, vector-only is fine. For ambiguous queries over messy content, reranking earns its cost.

Data residency

Self-hosted wins this one by default. The data never leaves your machine.

On the hosted side: Pinecone has US and EU regions with a DPA, and LlamaCloud has SOC 2 Type II and HIPAA. Bedrock's EU region support has been inconsistent in 2026 documentation—verify before you commit. Cloudflare's Data Localization Suite handles this at the platform level.

If you're in a regulated industry, audit the provider before you pick. Don't trust the marketing page.

Ops burden

This is the one nobody advertises. Self-hosted means you're responsible for:

Keeping Ollama updated
Monitoring embedding drift when you upgrade models
Backing up knowledge.db
Scheduling re-indexing when source content changes
Debugging why sqlite-vec suddenly returns zero results (hint: usually the embedding model changed dimensions)

Hosted services handle all of that. That's most of what you're paying for.

Sidebar: The Bedrock OCU Landmine

Bedrock Knowledge Bases advertises "no charge for the Knowledge Bases feature itself." Technically true. What they don't mention on the pricing page is that the vector storage layer requires a minimum of 2 OCUs—OpenSearch Compute Units—at roughly $0.24/hour each.

Do the math: 2 OCUs × $0.24/hour × 730 hours/month = about $350 per month whether your knowledge base has 10 documents or 10 million.

Nobody else on this list has a fixed cost floor like that. Cloudflare AI Search scales down to pennies. Pinecone Assistants has a real free tier. Self-hosted is $10.

If you're building something small and you're not already deep in AWS—Bedrock KB is the wrong answer. If you're running enterprise-scale search over millions of docs, that $350 becomes a rounding error, and the hybrid+rerank features earn their keep.

Know where you sit before you commit.

The MCP Angle

Here's the thing I didn't expect to find: every production RAG service on this list ships an official MCP server. Cloudflare, Bedrock, Pinecone, LlamaCloud—all of them. This went from "experimental" to "table stakes" over the past year.

Cloudflare AI Search → The official Cloudflare MCP server exposes AI Search endpoints
Bedrock KB → AWS Labs ships `bedrock-kb-retrieval-mcp-server`
Pinecone Assistants → Each assistant gets its own remote MCP endpoint, plus a local Docker option
LlamaCloud → `run-llama/llamacloud-mcp` plus the hosted MCP Gateway at mcp.llamaindex.ai

This wasn't true a year ago. The MCP ecosystem has absorbed the big RAG providers fast enough that "hosted RAG you can query from Claude Desktop" is now a checkbox feature.

Self-hosted doesn't ship with an MCP server—but wrapping one around your sqlite-vec database is a weekend of TypeScript. That's what the ASTGL Knowledge MCP server actually is: an MCP wrapper around vector search and Q&A retrieval over a SQLite database. The MCP part is trivial. The content curation and ingestion pipeline is 90% of the work.

The real insight: hosted RAG plus MCP wrapper is the modern middle path. You don't have to pick pure self-hosted or pure managed. Point a custom MCP server at Pinecone Assistants or Bedrock KB, and you get the retrieval quality of managed services with the MCP-native interface your agents expect. The `cloudflare/ai-search` MCP server does exactly this.

That changes the decision. It's not "hosted vs. self-hosted RAG" anymore. It's "Whose retrieval layer do I want behind my MCP server?"

Decision Framework

Enough philosophy. Here's the checklist I use.

1. Is your corpus under 500 docs and mostly static? Self-host. You'll spend more time reading hosted RAG docs than it would take to `npm install sqlite-vec`.

2. Do you have under 20 hours to ship this? Hosted. Pinecone Assistants or Cloudflare AI Search will get you to a demo faster than you can read the Bedrock IAM setup guide.

3. Are you charging money for this? Either hosted (you need the SLA) or self-hosted with a real infra budget and a pager rotation. Don't split the difference on production.

4. Is any of this data regulated—PHI, PII under GDPR, or financial? Self-host, or audit the hosted provider's compliance posture before you upload anything. Don't trust the marketing page. Ask for the SOC 2 report.

5. Are you already in AWS? Bedrock KB makes sense if your scale justifies the OCU floor. Otherwise, Pinecone.

6. Everything else? Prototype self-hosted with sqlite-vec. Migrate to hosted when a specific pain point forces the move. "We keep hitting embedding model drift" is a real reason. "It seems complicated" isn't.

The rule of thumb I use: pay for what hurts, self-host what you enjoy. If PDF parsing makes you want to quit, pay Bedrock or LlamaCloud. If SQL and vector search are fun, keep sqlite-vec.

What I'd Actually Build in 2026

If you asked me right now, for real scenarios:

Weekend side project. sqlite-vec plus Ollama plus nomic-embed-text. Runs on a laptop, costs nothing, and teaches you how RAG actually works. This is where I'd start every time.

Customer-facing SaaS feature. Cloudflare AI Search. Pay-per-query pricing means your costs track your usage. Official MCP server means Claude Desktop users can plug in directly. The open-beta caveat is real—verify the SLA matches your product's uptime needs before launch.

Enterprise RAG over thousands of internal docs. Bedrock Knowledge Bases if you're already in AWS and you'll comfortably exceed the OCU floor. Pinecone Assistants if you're not. LlamaCloud if your team is already deep in LlamaIndex and the multimodal parsing earns its cost. All three have hybrid search; all three ship MCP servers. Pick based on where your infrastructure—and your team's existing expertise—already lives.

Team knowledge base. Self-hosted if it's under five people and you've got one engineer who cares about it. Hosted the moment it crosses twenty users or someone non-technical needs to upload docs. The threshold isn't the document count—it's the human factor.

The sqlite-vec era isn't ending. It's just not the only answer anymore. A year ago, self-hosted was the serious choice, and hosted was for people who didn't want to learn. In 2026, that framing doesn't hold. Hosted RAG is production-ready, MCP-native, and sometimes cheaper than your own ops time.

Pick the tool that matches the job. That's it.

FAQ

What is Cloudflare AI Search?

Cloudflare AI Search (formerly AutoRAG) is a managed Retrieval-Augmented Generation service built on Cloudflare's platform. It combines R2 storage, Vectorize for embeddings, and Workers AI for inference into a single API. It's currently in open beta with vector-first retrieval and optional reranking, and ships with an official MCP server that lets Claude and other AI assistants query your indexed documents directly.

When should I use hosted RAG instead of sqlite-vec for an MCP server?

Use hosted RAG when your corpus exceeds a few thousand documents, you're ingesting multiple source types like PDFs or Word docs, non-engineers need to upload content, or you need a production SLA. Stick with sqlite-vec when the corpus is static markdown under about 1,000 documents, you control ingestion, and cost-per-query matters more than ops time.

Can I use the Cloudflare AI Search MCP with Claude Desktop?

Yes. Cloudflare ships an official MCP server that exposes AI Search endpoints as MCP tools. Add the Cloudflare MCP server to your Claude Desktop config, provide your API token, and Claude can query your indexed documents through the same interface it uses for any other MCP tool. The setup is documented in the Cloudflare MCP repository.

Leave a comment

Related reading:*
How I Shipped an MCP Knowledge Server in a Weekend: the self-hosted case study this article references
How Do MCP Registries Work (Smithery, mcpt)?: finding MCP servers, including the ones in this article
Cortex: An Event-Sourced Memory Architecture for AI Coding Assistants: related exploration of the memory/retrieval landscape

What's the Future of MCP Servers in 2026-2027?

James Cruce — Mon, 13 Apr 2026 04:21:50 GMT

MCP servers have gone from a niche protocol announcement to the backbone of AI tool integration in under two years. But we're still early.

Here's where the ecosystem is heading, what's changing, and what it means for anyone building with AI tools today.

The Short Answer

MCP is becoming the standard way AI connects to tools. The next 18 months will bring better security, larger registries, enterprise adoption, and a shift toward local-first AI architectures. If you're building skills in this space now, you're ahead of the curve.

| Trend | 2025 | 2026 (Now) | 2027 (Projected) |

|-------|------|------------|-------------------|

Trend 1: The Local-First AI Shift

The most significant trend in AI infrastructure isn't a new model—it's where models run.

What's Happening

Open-source models are improving at a staggering pace. Gemma 4, Qwen 3, Llama 3.3, and their successors close the gap with cloud models every quarter. A 26B parameter model running on a Mac Studio today outperforms cloud GPT-4 from 18 months ago.

This changes the economics. When local models handle 90% of tasks at zero marginal cost, the question shifts from "should I use AI?" to "should I pay for cloud AI when local works?"

What This Means for MCP

Local AI + MCP servers = fully autonomous local automation. No cloud dependency. No API costs. No data leaving your machine. The stack is:

Ollama → Local model serving
MCP servers → Tool integration
Gateway (OpenClaw, n8n) → Orchestration
Local storage → Data and knowledge

This stack runs on consumer hardware. A Mac Mini with 32 GB handles it. This is enterprise-grade automation on a consumer budget.

The Prediction

By 2027, running a personal AI stack locally will be as common as running a home media server is today. The early adopters are doing it now. The mainstream follows when setup becomes one-click.

Trend 2: Registry Maturation

From Wild West to App Store

Current MCP registries are like the early npm ecosystem—anyone can publish anything, quality varies wildly, and discovery is hit-or-miss. That's changing.

What's coming:

Verified publishers—Registries will distinguish between official, verified, and community servers
Security scanning—Automated analysis of server code for vulnerabilities and suspicious behavior
Dependency management—Tools to manage, update, and audit all installed MCP servers
Usage analytics—Data on which servers are most used, most reliable, most maintained
Compatibility testing—Verified compatibility with specific AI clients (Claude, VS Code, etc.)

The Consolidation Question

Will one registry dominate? Probably not in the npm-monopoly sense. More likely:

Smithery stays the largest general-purpose registry
mcpt establishes the quality-curated niche
Platform-specific registries emerge (VS Code marketplace, Claude's built-in catalog)
Enterprise registries appear for internal MCP server management

The Prediction

By 2027, installing an MCP server will feel like installing a browser extension. Browse, click install, authenticate, done. The manual JSON editing of today will be a historical footnote.

Trend 3: Protocol Evolution

What MCP Gets Right Today

Simplicity—The client-server model is easy to understand and implement
Language agnostic—Servers can be built in any language
Tool abstraction—AI sees tools, not implementation details
Local-first—Servers run on your machine by default

What's Coming

Authentication and Security

Current MCP servers handle auth inconsistently. The protocol will standardize:

OAuth 2.0 integration for services that need it
Fine-grained permission scoping (read vs. write, specific resources)
Credential management (secure storage, rotation)
Audit logging (who accessed what, when)

Streaming and Real-Time Data

Today's MCP is mostly request-response. Future versions will support:

Event streams (new email arrives, calendar changes, file modified)
WebSocket-based persistent connections
Real-time monitoring and dashboards

Multi-Modal Support

Current MCP tools primarily handle text. Expanding to:

Vision tools (analyze images, screenshots, documents)
Audio tools (transcription, speech synthesis)
Video tools (clip extraction, analysis)
Document tools (PDF processing, spreadsheet manipulation)

Server-to-Server Communication

Enabling MCP servers to call each other:

A calendar server queries a contacts server to enrich meeting attendee data
A research server calls a web search server to fetch sources
Composable server chains without client involvement

The Prediction

MCP 2.0 (or equivalent major version) will land by mid-2027 with standardized auth, streaming, and multi-modal support. The protocol will feel complete rather than minimal.

Trend 4: Enterprise Adoption

The Enterprise AI Problem

Large organizations want AI automation but face:

Security concerns—Data can't leave the corporate network
Compliance requirements—audit trails, access controls, data residency
Integration complexity—Hundreds of internal tools, custom APIs, legacy systems
Governance—Who approves which AI can access what?

How MCP Solves This

MCP's local-first architecture is inherently enterprise-friendly:

Servers run inside the network—data stays on-premises
Per-server permissions—Each server accesses only what's allowed
Standard protocol—one integration pattern for all tools
Audit capability — All tool calls are loggable

What's Coming

Enterprise MCP platforms—Companies like Cloudflare, AWS, and Azure offering managed MCP infrastructure
Internal MCP registries—Corporate app stores for approved MCP servers
Policy engines—Centralized rules for which AI can use which tools, when, with what data
SOC 2 / HIPAA compliant servers—Certified MCP servers for regulated industries

The Prediction

By late 2027, enterprise MCP infrastructure will be a recognized market category, similar to how API gateways became standard enterprise infrastructure.

Trend 5: The Knowledge Server Pattern

Beyond Tool Servers

Most MCP servers today are tool servers—they do things (send email, search web, manage files). A growing pattern is knowledge servers—they know things.

A knowledge server exposes structured information that AI can query:

Company knowledge base
Product documentation
FAQ databases
Research libraries
Personal notes and archives

Why This Matters

When AI can query your knowledge directly, it answers from your data—not its training data. This means:

Answers grounded in your actual documentation
No hallucination about your specific products or processes
Always current (the knowledge server reads live data)
Personalized to your context

The Example: mcp-astgl-knowledge

This is what I'm building—an MCP server that indexes all 20 articles in this series and makes them queryable by any AI client.

Tools it will expose:

`search_answers` — Semantic search across all articles
`get_answer` — Retrieve a specific article by topic
`list_topics` — Browse all available topics
`get_faq` — Pull FAQ entries for specific questions

How it works:

Articles parsed from markdown (frontmatter + body)
Embeddings generated via local Ollama (nomic-embed-text)
Stored in SQLite with sqlite-vss for vector search
Served as an MCP server that any AI client can connect to

When someone asks Claude "What's the best local LLM for coding?" and this server is connected, Claude queries the knowledge base and answers with information from article 12—not from its training data, which might be outdated.

The Prediction

Knowledge servers will be the fastest-growing MCP server category in 2027. Every company with documentation, every creator with a content library, every expert with a knowledge base will want one.

What This Means for You

If You're a Developer

Now: Learn the MCP SDK, build a server, publish it. The ecosystem rewards early builders—servers published now accumulate installs and reputation.

Next 12 months: Expect demand for custom MCP servers at companies integrating AI. MCP development will become a marketable skill alongside API development.

Key skill: Understanding how AI agents use tools. The protocol is simple—the design of good tools is the hard part.

If You're a Business Owner

Now: Connect MCP servers to Claude for immediate productivity gains. Start with email, calendar, and file access. Automate one workflow.

Next 12 months: Evaluate local AI infrastructure for privacy and cost benefits. Budget for hardware that pays for itself in reduced API costs.

Key opportunity: Businesses that adopt AI automation now will have 12-18 months of compounding efficiency gains over competitors who wait.

If You're an Individual User

Now: Set up Claude Desktop with 2-3 MCP servers. Experience the difference between chatbot AI and tool-connected AI.

Next 12 months: Consider a local AI setup (Ollama on existing hardware or a Mac Mini). Automate your most repetitive tasks.

Key realization: AI connected to your tools is dramatically more useful than AI in a chat window. MCP servers are what make that connection possible.

How I Actually Do This

I've been building in this ecosystem for over a year. Here's what the trajectory looks like from the inside:

What I'm Building Next

mcp-astgl-knowledge—The knowledge server I mentioned above. It's the capstone of this article series: 20 articles become a queryable knowledge base that any AI can access.

The plan:

1. Build with TypeScript + @modelcontextprotocol/sdk

2. Index all 20 articles with local embeddings

3. Publish to npm

4. Register on Smithery and mcpt

5. Share the build process as an ASTGL tutorial

This is the pattern I believe will explode: experts building knowledge servers that make their expertise available to AI systems.

What I've Observed

1. The tooling is getting better fast. Building an MCP server in early 2025 required reading the spec and figuring things out. In 2026, the SDK handles most of the boilerplate and registries provide distribution.

2. Local model quality jumped significantly. Gemma 4 was a step change. Tasks that needed cloud models a year ago now run locally at comparable quality. The gap keeps narrowing.

3. The automation compound effect is real. My first automation saved 30 minutes per day. Twenty-six automations save hours. Each new automation builds on the infrastructure of previous ones. The marginal cost of the 27th automation is near zero.

4. Community momentum is accelerating. The number of MCP servers, tools, and tutorials appearing weekly is orders of magnitude higher than a year ago. This is the network effect in action.

5. The biggest barrier is awareness, not technology. Most people who would benefit enormously from MCP servers don't know they exist. That's why I wrote this series—and why I'm building a knowledge server to make the information accessible.

The 18-Month Outlook

| Quarter | What to Expect |

|---------|---------------|

| Q2 2026 (now) | Local models competitive for 85% of tasks. MCP registries at 5,000+ servers. |

| Q3 2026 | Enterprise MCP pilots at major companies. Visual workflow builders support MCP natively. |

| Q4 2026 | MCP auth standardization lands. Knowledge servers emerge as a category. |

| Q1 2027 | Local models reach 90%+ parity for common tasks. MCP registries pass 15,000 servers. |

| Q2 2027 | Enterprise MCP platforms from major cloud providers. One-click server installation standard. |

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Will MCP be replaced by a competing standard?

Unlikely in the near term. MCP has strong momentum, broad adoption, and backing from Anthropic with buy-in from Microsoft and Google. Standards wars are possible, but MCP's open protocol design and existing ecosystem make it the safe bet for building today.

What if I invest in MCP and it becomes obsolete?

The skills transfer. Understanding tool integration, agent patterns, and local AI architecture is valuable regardless of the specific protocol. If a successor to MCP emerges, it will solve the same problems in a similar way, and your experience will translate directly.

Are local models improving fast enough to matter?

Yes. The improvement curve for open-source models is steep. Every 6-12 months brings a significant quality jump. Hardware you buy today runs better models next year—your investment appreciates in capability over time.

When should I start building with MCP?

Now. The ecosystem is mature enough for production use but early enough that builders and early adopters have significant advantages. Every month you wait is a month of compounding automation benefits you don't capture.

What's the single most important thing to do today?

Connect one MCP server to Claude and use it for one real task. That first experience—seeing AI interact with your actual tools instead of just your text—changes how you think about what AI can do. Everything else follows from that realization.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

How Do I Automate Workflows with AI Agents?

James Cruce — Mon, 13 Apr 2026 04:01:55 GMT

Article 13 in this series introduced what AI agents are. This article goes deeper: how to design, build, and operate agent workflows that handle real work—from simple scheduled tasks to multi-agent orchestration.

If you've already automated a few tasks and want to level up, this is your guide.

The Short Answer

Agent workflows combine AI reasoning with tool access and scheduling to complete multi-step tasks autonomously. The architecture ranges from simple (one agent, one task) to complex (multiple agents coordinating on a pipeline).

| Complexity | Architecture | Example |

|-----------|-------------|---------|

| Simple | One agent, one task, scheduled | Morning briefing at 6:30 AM |

| Chained | Multiple steps, sequential | Research → Draft → Edit → Publish |

| Parallel | Multiple agents, simultaneous | 5 news sources searched concurrently |

| Orchestrated | Coordinator + specialist agents | Content council with 5 roles |

Workflow Patterns

Pattern 1: Scheduled Single-Agent

The simplest useful workflow. One agent runs one task on a schedule.

[Schedule] → [Agent + Tools] → [Output + Delivery]

Example: Daily security audit

Schedule: Saturday 8:00 AM
Agent: Gemma 4 31B with filesystem MCP
Task: Read all config files, check for common misconfigurations, compare against best practices
Output: Audit report delivered to Discord

When to use: Any standalone task that repeats on a schedule and benefits from AI reasoning.

Pattern 2: Sequential Chain

Multiple steps execute in order, each feeding into the next.

[Step 1: Research] → [Step 2: Draft] → [Step 3: Edit] → [Step 4: Publish]

Example: Content creation pipeline

Step 1: SCOUT agent searches for trending topics, produces research brief
Step 2: QUILL agent writes article from research brief
Step 3: LEDGER agent fact-checks article against sources
Step 4: MAVEN agent generates distribution pieces

When to use: Tasks with natural stages where each stage's output becomes the next stage's input.

Pattern 3: Fan-Out / Fan-In

One task spawns multiple parallel tasks, results are collected and synthesized.

→ [Agent A: Source 1] →
[Dispatch] → [Agent B: Source 2] → [Collect + Synthesize]
            → [Agent C: Source 3] →

Example: Competitive research

Dispatch: "Research these 5 competitors"
5 parallel agents each research one competitor
Collector synthesizes all 5 reports into a single competitive brief

When to use: Tasks that can be decomposed into independent subtasks, where parallelism saves time.

Pattern 4: Router + Specialists

A lightweight router examines each incoming task and dispatches it to the best specialist.

[Input] → [Router] → [Specialist A (code)]
                   → [Specialist B (writing)]
                   → [Specialist C (research)]
                   → [Specialist D (triage)]

Example: Notification processing

Router: Gemma 4 e4B classifies incoming notifications (fast, cheap)
Critical → Immediate Discord alert
Important → Queue for hourly batch
Routine → Queue for 3-hour digest
Spam → Discard and log

When to use: High-volume inputs that need different handling based on content or urgency.

Pattern 5: Multi-Agent Council

Multiple specialized agents collaborate on a complex task, each contributing their expertise.

[SCOUT] → findings → [FORGE] → outline → [QUILL] → draft → [LEDGER] → verified → [MAVEN] → published
                                                         ↑                              |
                                                         └──── revision request ────────┘

Example: Content production council (my actual setup)

SCOUT: Research and topic discovery
FORGE: Structure and outlining
QUILL: Drafting with voice profile
LEDGER: Fact-checking and validation
MAVEN: SEO, distribution, and publishing

Agents can request revisions from earlier agents—LEDGER can send a draft back to QUILL if facts don't check out.

When to use: Complex, multi-faceted work where specialized expertise improves quality.

Building a Workflow: Step by Step

Let's build a real workflow from scratch—a weekly competitive intelligence report.

Step 1: Define the Goal

"Every Monday at 7 AM, research the top 5 competitors in my space, summarize their recent activity, identify notable changes, and deliver a structured report to Discord."

Step 2: Choose the Architecture

This is a fan-out/fan-in pattern:

Fan-out: Research 5 competitors in parallel
Fan-in: Synthesize into one report

Step 3: Design the Agents

Research Agent (runs 5 times, once per competitor):

Model: Gemma 4 26B
Tools: Web search MCP server
Input: Competitor name + website
Output: Structured findings (recent blog posts, product changes, social mentions, job postings)

Synthesis Agent (runs once):

Model: Gemma 4 26B
Tools: Filesystem MCP (to save the report)
Input: All 5 research outputs
Output: Formatted competitive brief with highlights, threats, and opportunities

Delivery Agent:

Input: Final report
Output: Discord message with report content

Step 4: Define the Schedule

# Cron expression: Every Monday at 7:00 AM
0 7 * * 1

Step 5: Build Error Handling

| Failure | Handling |

|---------|---------|

| Web search fails for one competitor | Skip that competitor, note in report |

| Model times out | Retry once, then use smaller model as fallback |

| All searches fail | Alert human, skip this week's report |

| Discord delivery fails | Save report to file, alert via email |

Step 6: Add Logging

Every agent execution logs:

Timestamp
Input received
Tools called and their responses
Output generated
Execution time
Any errors or retries

Step 7: Test and Iterate

Run the workflow manually first. Review the output. Adjust prompts, model choice, and error handling based on real results. Only schedule it after 3 successful manual runs.

Orchestration Tools

OpenClaw (Local Gateway)

OpenClaw is a local AI gateway that manages model routing, task scheduling, and tool execution.

Strengths:

Runs entirely locally—no cloud dependency
Routes tasks to appropriate models based on complexity
Manages MCP server connections
Built-in scheduling and delivery (Discord, Slack, email)
Logging and monitoring

Best for: Users who want full local control over their agent workflows.

n8n (Visual Workflow Builder)

n8n provides a visual drag-and-drop interface for building workflows.

Strengths:

No-code visual builder
Hundreds of pre-built integrations
Self-hostable (runs on your machine)
Supports webhooks, schedules, and event triggers

Best for: Non-developers who want automation without writing code.

Cron + Scripts (DIY)

The simplest orchestration: cron jobs that run scripts calling the Ollama API.

Strengths:

Zero additional software
Works on any Unix system
Complete control
No abstraction overhead

Best for: Developers comfortable with bash scripting who want minimal dependencies.

Claude Agent SDK (Custom Code)

Anthropic's SDK for building custom agent logic in Python or TypeScript.

Strengths:

Full programmatic control
Access to Claude's tool-use capabilities
Complex agent logic (loops, conditionals, multi-turn)
Production-grade error handling

Best for: Developers building sophisticated custom agents.

How I Actually Do This

My workflow orchestration runs through OpenClaw on a Mac Studio. Here's the production architecture:

The Orchestration Layer

OpenClaw Gateway
├── Schedule Manager (cron-like)
├── Model Router (triage → specialist)
├── MCP Connector (15+ servers)
├── Delivery Manager (Discord, file system)
└── Log Aggregator

Daily Workflow Map

|------|---------|---------|--------|

Multi-Agent Council Integration

The ACA Council (SCOUT/FORGE/QUILL/LEDGER/MAVEN) runs as an orchestrated multi-agent workflow:

1. Morning meeting (7 AM): SCOUT presents topic research, council prioritizes

2. Production cycle: Sequential chain through all 5 agents

3. Evening meeting (8 PM): Review completed articles, queue for publishing

4. Publishing: Automated sync to site, Substack, and social channels

Paperclip Integration

Paperclip (a separate agent management platform) provides additional orchestration for agents that need web-based interfaces and team collaboration. It runs alongside OpenClaw—some workflows use OpenClaw's local scheduling, others use Paperclip's cloud features.

The key insight: you don't need one orchestration tool. Different workflows have different needs. Simple schedules use cron. Complex pipelines use OpenClaw. Team-visible workflows use Paperclip.

Lessons from Production

1. Start with single-agent workflows. Get one agent reliable before adding coordination complexity. My first 10 workflows were all single-agent scheduled tasks.

2. The router pattern is the highest-leverage addition. Adding a triage router that classifies incoming work and dispatches to the right model immediately improved quality and speed across all workflows.

3. Logging saved me dozens of hours. When an agent produces bad output, logs show exactly what happened. Without logs, you're guessing. I log every tool call, every model response, every delivery.

4. Agents need guardrails, not just goals. "Research competitors" is too vague. "Search for blog posts published in the last 7 days from these 5 domains, extract titles and summaries, skip anything older than 7 days" — that produces reliable results.

5. Schedule slack prevents cascading failures. My 6:00 AM research pipeline sometimes takes 25 minutes. The 6:30 AM briefing doesn't depend on it—theyf run independently. Dependent workflows have explicit wait conditions, not just time offsets.

Monitoring and Maintenance

What to Monitor

| Metric | Why It Matters | Alert Threshold |

|--------|---------------|----------------|

| Execution time | Detect slowdowns before they cascade | >2x normal duration |

| Error rate | Catch model or tool failures | >10% of executions |

| Output quality | Detect model drift or prompt degradation | Spot-check weekly |

| Token usage | Track resource consumption | Unexpected spikes |

| Tool call failures | MCP server or API issues | Any persistent failure |

Weekly Maintenance

Review error logs—fix recurring issues
Spot-check 2-3 outputs per workflow for quality
Update models if new versions improve quality
Review and trim logs (they grow fast)
Check MCP server updates

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

How many agent workflows can I run simultaneously?

Depends on your hardware and model sizes. A Mac Mini with 32 GB comfortably runs 3-5 concurrent lightweight workflows. A Mac Studio with 192+ GB runs 20+ concurrent workflows across multiple models. The bottleneck is usually model memory, not CPU.

Can agent workflows interact with each other?

Yes—through shared data. One workflow writes results to a file or database; another reads them. For direct coordination, use a message queue or orchestration layer. Keep interactions simple to maintain debuggability.

What's the failure rate for agent workflows?

Well-designed workflows with proper error handling run at 95%+ success rates. The remaining failures are usually transient (API timeouts, network issues) that resolve on retry. Poorly designed workflows (vague goals, no error handling) fail 20-40% of the time.

Should I use local or cloud models for agent workflows?

Local for volume, cloud for quality. If a workflow runs 50+ times per day, local models save significant money. If a workflow runs once per week and quality is critical, cloud models may be worth the cost. Most production setups use both.

How do I debug a failing agent workflow?

Logs are everything. Check: (1) What input did the agent receive? (2) What tools did it call? (3) What did the tools return? (4) What output did the agent produce? The failure is usually in step 2 or 3—a tool returned unexpected data, or the model misinterpreted the tool response.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

How Do MCP Registries Work (Smithery, mcpt)?

James Cruce — Mon, 13 Apr 2026 03:43:15 GMT

There are thousands of MCP servers available. Finding the right one, evaluating whether it's trustworthy, and installing it correctly—that's where registries come in.

Here's how MCP registries work, which ones matter, and how to use them effectively.

The Short Answer

MCP registries are directories of MCP servers—searchable, categorized, and installable. They solve the discovery problem: "Which MCP server does X, and how do I install it?"

|----------|------|----------|----------|

How Discovery Works

The Problem Registries Solve

Without registries, finding an MCP server means:

1. Searching GitHub for "mcp-server-[thing you want]"

2. Hoping the README has install instructions

3. Guessing if it's maintained, secure, and compatible

4. Manually configuring everything

With registries:

1. Search or browse by category

2. Read description, reviews, and install command

3. Copy-paste the command

4. Done

Search Patterns

Most registries support these discovery methods:

Category browsing: Productivity, Development, Data, Communication, Creative, Business

Keyword search: "gmail", "database", "web scraping", "calendar"

Tag filtering: "official", "verified", "popular", "new"

Sort options: Most installed, highest rated, recently updated

Smithery: The Largest Registry

Smithery is the de facto standard for MCP server discovery. Here's how to use it effectively.

Browsing Smithery

Visit smithery.ai and you'll see:

Featured servers—editorially curated highlights
Categories—organized by use case
Search—keyword search across all servers
Trending—most popular servers this week

Reading a Server Listing

Each server page shows:

| Section | What It Tells You |

|---------|------------------|

| Description | What the server does and its capabilities |

| Tools | Specific tools the server exposes (e.g., `list_events`, `create_event`) |

| Install command | Copy-paste for Claude Desktop or Claude Code |

| Configuration | Required API keys or settings |

| Author | Who built it—official orgs vs. community |

| Stats | Install count, last updated, GitHub stars |

| Reviews | User feedback on reliability and quality |

Installing from Smithery

For Claude Desktop:

Smithery provides the exact JSON block to add to your config file:

{
  "mcpServers": {
    "server-name": {
      "command": "npx",
      "args": ["-y", "@scope/mcp-server-name"],
      "env": {
        "API_KEY": "your-key"
      }
    }
  }
}

Copy it, paste it into `claude_desktop_config.json`, add your API key, restart Claude.

For Claude Code:

Smithery often shows the CLI command:

claude mcp add server-name -- npx -y @scope/mcp-server-name

Evaluating Quality on Smithery

Not all servers are equal. Here's how to assess quality:

| Signal | Good Sign | Warning Sign |

|--------|-----------|-------------|

| Author | Official org or verified developer | Anonymous, no GitHub link |

| Last updated | Within the past 3 months | Over 6 months ago |

| Install count | Hundreds or thousands | Single digits |

| GitHub stars | Active community | No repository linked |

| Reviews | Specific positive feedback | No reviews or vague complaints |

| Tools listed | Clear, well-documented tools | Vague or missing tool descriptions |

mcpt: The Curated Alternative

mcpt takes a quality-first approach. Fewer servers, but higher average reliability.

The mcpt CLI

mcpt provides a command-line tool for managing MCP servers:

# Install the CLI
npm install -g mcpt

# Search for servers
mcpt search calendar

# Install a server
mcpt install google-calendar

# List installed servers
mcpt list

# Update all servers
mcpt update

Advantages of mcpt

| Feature | Smithery | mcpt |

|---------|----------|------|

| Curation | Community-driven | Editorially reviewed |

| Install method | Manual config editing | CLI tool handles everything |

| Updates | Manual | `mcpt update` handles all |

| Quality bar | Low (anyone can list) | Higher (review process) |

| Size | Largest selection | Smaller, more reliable |

When to Use mcpt vs. Smithery

Use Smithery when you need a server for an obscure tool or want maximum choice
Use mcpt when you want reliable, well-maintained servers with easy management

OpenTools and Other Registries

OpenTools

OpenTools is a growing registry with a clean search interface. It focuses on categorization and discoverability. Worth checking if you don't find what you need on Smithery.

npm Direct

Every Node.js-based MCP server is published to npm. You can search npm directly:

npm search mcp-server

This gives you access to everything, including servers not yet listed on any registry. But there's no curation, reviews, or quality signals—you're on your own to evaluate.

GitHub

Many MCP servers live on GitHub before they're registered anywhere. Search GitHub for:

`mcp-server` (general)
`modelcontextprotocol` (official repos)
`mcp-server-[tool-name]` (specific tools)

GitHub gives you access to source code, issues, commit history, and contributor activity—the deepest quality signals available.

Security Considerations

MCP servers run code on your machine. Take security seriously.

Before Installing Any Server

1. Check the source. Is the GitHub repo linked? Can you see the code?

2. Check the author. Is it an organization you recognize? A developer with a history?

3. Read the permissions. What tools does it expose? What data can it access?

4. Check for credentials. Does it need API keys? Where does it send data?

5. Check freshness. When was it last updated? Are dependencies current?

Red Flags

| Red Flag | Risk |

|----------|------|

| No source code available | Can't verify what the code does |

| Requests unusual permissions | May access more than needed |

| No clear author or organization | Harder to trust, no accountability |

| Hasn't been updated in 12+ months | May have unpatched vulnerabilities |

| Very few installs, no reviews | Unvalidated by the community |

Best Practices

Start with official servers from Anthropic, Google, Microsoft, and established organizations
Read the code if you're installing a community server that handles sensitive data
Scope permissions—only give file system access to directories you actually need
Monitor behavior—check logs if a server seems to be making unexpected network calls

How I Actually Do This

I use a mix of Smithery, npm, and direct GitHub sources for my MCP server setup.

My Discovery Workflow

1. Need a server → Search Smithery first (broadest selection)

2. Found candidates → Check GitHub repo for each (code quality, maintenance)

3. Evaluate → Look at install count, last update, and whether tools match my needs

4. Test → Install and test with a simple prompt before adding to automation

5. Production → Only servers that pass testing go into my daily workflow

Building for Registries

I'm building `mcp-astgl-knowledge` — an MCP server that exposes all 20 articles in this series as searchable knowledge. The plan:

1. Build with TypeScript and the `@modelcontextprotocol/sdk`

2. Publish to npm (`npm publish`)

3. Register on Smithery (submit listing with description, tools, install command)

4. Register on mcpt (submit for review)

5. Maintain—update when new articles are added, respond to issues

This will be a real example of the full lifecycle: build → publish → register → maintain. I'll update this article with the actual experience once it's done.

What I've Learned

1. Smithery is the starting point for everyone. It has the most servers and the most familiar interface. Start there.

2. mcpt's CLI is underrated. Managing updates across 15 servers manually is tedious. `mcpt update` handles it.

3. Official servers are worth the premium. Anthropic's official MCP servers for filesystem, web search, and databases are rock-solid. Community servers vary widely in quality.

4. Read the tools list carefully. A "Gmail MCP server" might only support reading emails, not sending them. The tools list tells you exactly what's possible.

5. The ecosystem is young and growing fast. New servers appear daily. Check registries monthly—the server you wished existed last month might exist now.

Publishing Your Own MCP Server

If you've built something useful, publishing it helps the community and builds your reputation.

The Publishing Process

1. Build your server using the MCP SDK

2. Test it locally with Claude Desktop or Claude Code

3. Publish the npm package: `npm publish`

4. Register on Smithery: Submit your listing with description, install command, and documentation

5. Register on mcpt: Submit for editorial review

6. Maintain: Respond to issues, update dependencies, improve based on feedback

What Makes a Good MCP Server Listing

| Element | Why It Matters |

|---------|---------------|

| Clear description | Users need to know what it does in 2 sentences |

| Tool documentation | Every tool should have a name, description, and example |

| Install command | Copy-paste ready for Claude Desktop and Claude Code |

| Configuration guide | What API keys or settings are needed |

| Source code link | Transparency builds trust |

| Changelog | Shows the server is actively maintained |

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Are MCP registries safe?

Registries are directories, not security guarantees. They make discovery easier but don't audit every server's code. Treat registry listings like npm packages—check the source, author, and community signals before installing. Official and popular servers are generally safe. Obscure, unreviewed servers deserve more scrutiny.

Can I use MCP servers not listed on any registry?

Yes. Any MCP server can be installed manually by pointing your config to the npm package or local path. Registries are for discovery—they're not gatekeepers. You can even build private MCP servers that never appear on any registry.

How often should I update my MCP servers?

Monthly is a good cadence. Security patches, bug fixes, and new features arrive regularly. If you use mcpt, `mcpt update` handles everything. For manual installs via `npx`, you always get the latest version automatically.

Will there be one dominant MCP registry?

Probably not—the ecosystem benefits from multiple registries with different strengths. Smithery for breadth, mcpt for quality, npm for raw access. This mirrors how package managers work: npm, GitHub, and specialized registries coexist.

Can I run my own private MCP registry?

Yes, for organizations that want to share internal MCP servers. The MCP protocol doesn't require public registries—any discoverable endpoint works. Some companies run internal registries for proprietary servers that access internal systems.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

What's the ROI of Local AI Infrastructure?

James Cruce — Mon, 13 Apr 2026 03:29:13 GMT

The question isn't whether local AI saves money—it does. The question is how fast and how much, based on your specific usage pattern.

Here's the real math, with actual hardware costs, cloud API pricing, and the breakeven points where local infrastructure pays for itself.

The Short Answer

Local AI has high upfront cost and near-zero ongoing cost. Cloud AI has zero upfront cost and scales linearly forever. The crossover point depends on your usage volume.

| | Local AI | Cloud AI |

|--|----------|----------|

| Upfront cost | $600-8,000 (hardware) | $0 |

| Monthly cost | $5-15 (electricity) | $50-5,000+ (API fees) |

| Per-call cost | $0 | $0.001-0.10 per call |

| Scales with usage | No—flat cost | Yes—more usage = more cost |

| Quality ceiling | Very good (not frontier) | Frontier models available |

| Privacy | Complete—data stays local | Data sent to provider |

Rule of thumb: If you'd spend more than $100/month on API calls, local AI probably pays for itself within a year.

The Cloud Cost Reality

Cloud AI pricing is per-token. Here's what real usage patterns cost:

Typical Monthly Cloud Costs

|--------------|-----------|-------|-------------|

| Casual user | 10-20 | Claude Sonnet | $10-30 |

| Power user | 50-100 | Claude Sonnet | $50-200 |

| Developer with AI tools | 200-500 | Mixed models | $200-800 |

| Automated workflows | 500-1,000 | Claude Haiku + Sonnet | $500-2,000 |

| Full automation pipeline | 2,000-5,000 | Mixed models | $2,000-8,000 |

| Enterprise scale | 10,000+ | Mixed models | $10,000+ |

The jump from casual to automated is where costs explode. A morning briefing that runs daily, a content pipeline that generates articles, and notification routing that processes hundreds of messages—these add up fast.

The Subscription Alternative

Claude Pro ($20/month) and Claude Max ($100-200/month) offer high-volume access at flat rates. These are excellent value for interactive use. But they have rate limits that don't work well for automated pipelines running 24/7.

The Local Cost Reality

Hardware Options

|--------|-----|-------|----------|

| Mac Mini M4 | 32 GB | ~$800 | Entry-level: runs 7-12B models comfortably |

| Mac Mini M4 Pro | 48 GB | ~$1,400 | Mid-range: runs 26B models, 2-3 concurrent |

| Mac Studio M3 Max | 96 GB | ~$3,000 | Serious: runs 70B models, full automation |

| Mac Studio M3 Ultra | 192 GB | ~$5,000 | Professional: multiple large models simultaneously |

| Mac Studio M3 Ultra | 512 GB | ~$8,000 | Maximum: every model, every size, all at once |

Apple Silicon advantage: Unified memory means the GPU can access all system RAM. A 192 GB Mac Studio can run models that would require multiple $2,000 GPUs on Linux.

Ongoing Costs

| Cost | Monthly | Annual |

|------|---------|--------|

| Electricity (always-on Mac Studio) | $10-15 | $120-180 |

| Internet (already have it) | $0 incremental | $0 |

| Software (Ollama, open-source models) | $0 | $0 |

| Maintenance time (~2 hours/month) | Time cost | Time cost |

| Total cash cost | $10-15 | $120-180 |

Breakeven Analysis

Scenario 1: Light Automation

Setup: Mac Mini 32 GB ($800) running morning briefings and email triage.

Cloud alternative: ~$150/month in API calls (500 calls/day, mixed models).

Breakeven: $800 ÷ $150/month = 5.3 months

Year 1 savings: ($150 × 12) - $800 - $150 electricity = $850

Scenario 2: Content Creator

Setup: Mac Mini 48 GB ($1,400) running content pipeline, research, and repurposing.

Cloud alternative: ~$400/month in API calls (content generation is token-heavy).

Breakeven: $1,400 ÷ $400/month = 3.5 months

Year 1 savings: ($400 × 12) - $1,400 - $150 = $3,250

Scenario 3: Full Automation

Setup: Mac Studio 192 GB ($5,000) running 26 daily tasks, content pipeline, multi-agent council.

Cloud alternative: ~$2,000/month in API calls (thousands of daily calls across multiple agents).

Breakeven: $5,000 ÷ $2,000/month = 2.5 months

Year 1 savings: ($2,000 × 12) - $5,000 - $180 = $18,820

The Pattern

The more you automate, the faster local infrastructure pays for itself. Light users might take a year to break even. Heavy automation users break even in months.

Beyond Dollar Savings: Hidden ROI

The financial math is compelling, but the less obvious benefits matter too.

Privacy ROI

With local AI, sensitive business data never leaves your machine. No data processing agreements. No compliance concerns about which country your data is processed in. No risk of training data leakage.

For regulated industries (healthcare, legal, finance), this alone can justify the hardware cost—the alternative is expensive enterprise AI contracts with compliance guarantees.

Availability ROI

Cloud APIs have outages. Rate limits. Capacity constraints during peak hours. Your automated pipeline at 6 AM shouldn't depend on whether a cloud provider's servers are congested.

Local AI is available whenever your computer is on. No rate limits. No outages (except your own hardware). No "please try again later."

Latency ROI

Local inference is fast—especially on Apple Silicon. A Gemma 4 26B running locally generates tokens faster than most cloud APIs deliver them, because there's no network round trip.

For interactive use, this means snappier responses. For automation, this means faster pipeline throughput.

Experimentation ROI

When every API call costs money, you hesitate to experiment. With local models, experimentation is free. Try 50 different prompt variations. Run A/B tests on voice profiles. Process your entire email archive to build training data. The marginal cost is zero.

This freedom to experiment accelerates learning and leads to better automation designs.

How I Actually Do This

I run a Mac Studio M3 Ultra with 256 GB unified memory. Here's the real financial picture:

My Costs

| Item | Cost |

|------|------|

| Mac Studio M3 Ultra 256 GB | $7,000 (one-time) |

| Electricity (~120W average, 24/7) | ~$12/month |

| Cloud Claude (10% of tasks) | ~$20/month (Pro subscription) |

| Total monthly ongoing | ~$32/month |

What I'd Pay With Cloud APIs

| Workload | Estimated Monthly Cloud Cost |

|----------|----------------------------|

| 26 scheduled agent tasks | $800-1,200 |

| Content pipeline (ACA Council) | $400-600 |

| Ad-hoc development assistance | $200-400 |

| Research and analysis | $100-200 |

| Total estimated | $1,500-2,400/month |

My Breakeven

$7,000 ÷ $1,500/month = 4.7 months

I passed breakeven months ago. Every month now is pure savings.

The Honest Caveats

1. Cloud Claude is still better for some tasks. Complex architectural decisions, nuanced code review, novel problem-solving—I still reach for cloud Claude. Local models handle 90% of the volume but not 100% of the difficulty.

2. Setup time is real. I spent about 40 hours over several weeks getting the full automation stack running. That's an investment of time that wouldn't exist with cloud APIs.

3. Hardware depreciates. In 3-4 years, I'll want newer hardware. The Mac Studio will still work, but newer models will be faster and more capable. Budget for replacement cycles.

4. Not everyone needs this. If you make 20 AI calls a day interactively, a Claude Pro subscription ($20/month) is the right answer. Local infrastructure makes sense when you're automating at volume.

Decision Framework

Choose Cloud When:

- You're just starting with AI (explore before investing)
- Your usage is primarily interactive (chatting, not automating)
- You need frontier model quality for every task
- You want zero hardware management
- Monthly API costs stay under $100

Choose Local When:

- You're automating workflows that run daily/hourly
- Privacy is a requirement (regulated industry, sensitive data)
- You'd spend $200+/month on cloud APIs
- You want unlimited experimentation without cost anxiety
- You're running multiple concurrent AI tasks

Choose Hybrid (Best for Most):

- Local models for volume tasks (triage, automation, content generation)
- Cloud models for high-value tasks (complex reasoning, frontier quality)
- Result: 90% of compute is free, 10% is cloud-quality

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Can I start with a cheaper machine and upgrade later?

Absolutely. A Mac Mini with 32 GB ($800) runs a solid automation stack. If you outgrow it, sell it (Macs hold resale value well) and upgrade. You don't need to start with the most expensive option.

What about Linux with NVIDIA GPUs?

Competitive for raw inference speed—an RTX 4090 (24 GB VRAM) is fast. But limited VRAM means you can only run one large model at a time. For multi-model architectures (triage + workhorse + specialist), Apple Silicon's unified memory is more flexible. Linux rigs are better for single-model, high-throughput workloads.

Does model quality improve fast enough to justify local hardware?

Yes. Open-source models improve dramatically every 6-12 months. A 26B model today outperforms a 70B model from two years ago. Your hardware runs better models over time without any additional cost—just download the new model.

What if I already have a powerful gaming PC?

If it has an NVIDIA GPU with 12+ GB VRAM, you can run local AI today at zero additional cost. Install Ollama, pull a model, and start experimenting. This is the cheapest possible entry point.

Is the electricity cost significant?

No. A Mac Studio draws about 40-120W depending on load. At US average electricity rates (~$0.15/kWh), that's $4-13/month running 24/7. An RTX 4090 draws more (300-450W under load) but idles much lower. Electricity is a rounding error compared to API costs.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

How Do I Build an AI Pipeline for Content Creation?

James Cruce — Mon, 13 Apr 2026 03:11:56 GMT

Writing one blog post is straightforward. Turning that post into a newsletter, social threads, SEO metadata, and platform-specific variations—that's where the hours disappear.

A content pipeline automates the repetitive parts so you focus on ideas, not formatting. Here's how to build one, from simple to fully autonomous.

The Short Answer

An AI content pipeline takes raw ideas and produces finished, distributed content through automated stages. The simplest pipeline has 3 stages. A production pipeline has 7 or more.

|---------------|--------|-------------------|--------|

Pipeline Architecture: The 7 Stages

Every content pipeline, regardless of complexity, follows the same logical stages. You can automate as many or as few as you want.

Stage 1: Discovery

What: Find topics worth writing about.

How: AI monitors news feeds, social media trends, competitor content, and search queries in your niche. It surfaces topics with high interest and low competition.

Manual version: You browse industry sites and note ideas.

Automated version: A scheduled agent searches 10+ sources daily and delivers a ranked topic list every morning.

Stage 2: Research

What: Gather information, data, and sources for the chosen topic.

How: AI searches the web, reads relevant articles, extracts key facts, and compiles a structured research brief.

Output: A 1-2 page brief with key points, statistics, source URLs, and suggested angles.

Stage 3: Drafting

What: Write the first draft.

How: AI takes the research brief, applies your voice profile and article template, and generates a full draft.

Critical input: A voice profile—a document that describes your writing style, preferred vocabulary, sentence patterns, and tone. Without this, AI output sounds generic. With it, the draft sounds like you.

Stage 4: Editing

What: Improve the draft's quality, flow, and accuracy.

How: AI reviews for clarity, removes filler, tightens sentences, checks structure against the template, and verifies the voice matches your profile.

This is best done as a separate pass with a different prompt than the drafting stage. A fresh "editor" perspective catches issues the "writer" misses.

Stage 5: Fact-Checking

What: Verify claims, statistics, and technical accuracy.

How: AI identifies every factual claim in the article, searches for supporting sources, flags anything unverified, and provides confidence ratings.

Two-phase approach:

1. Extraction: Pull every claim from the article

2. Verification: Check each claim against web sources and known data

This step catches hallucinations before they reach your audience.

Stage 6: Repurposing

What: Transform the article into platform-specific content pieces.

How: AI reads the finished article and generates variations for every distribution channel.

| Derivative | Platform | Format |

|-----------|----------|--------|

| 5 social posts | LinkedIn, X, Facebook, Instagram, Threads | Platform-native length and style |

| 5 short-form notes | Substack Notes | Teaser + link |

| 1 newsletter intro | Email | Hook paragraph + article link |

| 1 SEO document | Search engines | Meta description, keywords, schema markup |

| 1 voiceover script | Podcast / video | Conversational spoken version |

| 5 graphic suggestions | Design tools | Visual concepts with text overlays |

| 3 pull quotes | Social graphics | Shareable quote cards |

One article → 25+ content pieces. The marginal cost of each derivative is near zero.

Stage 7: Publishing

What: Distribute content to all channels.

How: Automated posting through APIs, MCP servers, or scheduling tools.

Current best practice: Publish to your own site first (canonical URL), then syndicate to Substack, social platforms, and newsletters. This is POSSE—Publish Own Site, Syndicate Everywhere.

Building Your First Pipeline

Start simple. You can always add stages.

Level 1: Manual Pipeline (30 Minutes)

Tools needed: Claude with filesystem MCP server.

1. Write an article (or paste an existing one)

2. Ask Claude: "Read this article and generate 5 LinkedIn posts, 5 tweets, a newsletter intro, and SEO metadata."

3. Review and publish manually

Time investment: 30 minutes per article cycle.

Output: 1 article + ~15 derivatives.

Level 2: Template Pipeline (15 Minutes)

Tools needed: Claude Code with custom commands.

Create a slash command (`.claude/commands/repurpose.md`) that contains your repurposing prompt with voice profile, platform specs, and output format. Then:

/repurpose path/to/article.md

One command generates all derivatives. You review and publish.

Time investment: 15 minutes per article cycle.

Output: 1 article + 25+ derivatives.

Level 3: Automated Pipeline (5 Minutes of Review)

Tools needed: Local AI (Ollama), scheduling (OpenClaw/cron/n8n), MCP servers.

The pipeline runs on schedule:

1. Discovery agent finds a topic (or you assign one)

2. Research agent compiles a brief

3. Draft agent writes the article

4. Edit agent polishes it

5. Fact-check agent verifies claims

6. Repurpose agent generates derivatives

7. Draft appears in your review queue

Your only job: Read the final output, approve or request changes, hit publish.

Time investment: 5 minutes per article cycle (review only).

Output: 1 article + 25+ derivatives, fully automated.

Voice Profiles: The Secret to Human-Sounding AI Content

The #1 differentiator between generic AI content and content that sounds like you is the voice profile.

What a Voice Profile Contains

## Writing Voice: [Your Name]

### Tone
- Conversational but authoritative
- Direct — lead with the answer, not the preamble
- Practical over theoretical

### Sentence Structure
- Short sentences (10-15 words average)
- Active voice (95%+)
- Start paragraphs with statements, not questions

### Vocabulary
- Plain language — "use" not "utilize", "help" not "facilitate"
- Technical terms only when the audience expects them
- No corporate jargon: avoid "leverage", "synergy", "paradigm"

### Patterns to Follow
- Open with a hook that acknowledges the reader's problem
- Use tables for comparisons
- Include "How I Actually Do This" sections with real examples
- End FAQ sections with practical answers, not theory

### Patterns to Avoid
- Never use "In today's fast-paced world..."
- Never start with a definition from Wikipedia
- No filler paragraphs that don't add information
- Don't hedge excessively — state positions clearly

How to Build Your Voice Profile

1. Gather 5-10 pieces of your best writing—articles, emails, presentations

2. Ask AI to analyze your voice: "Read these writing samples and describe my writing style—tone, sentence structure, vocabulary choices, recurring patterns."

3. Edit the analysis—AI will capture 80%, you refine the rest

4. Include it in every content prompt—either inline or as a system prompt

How I Actually Do This

I run a multi-agent content pipeline called the ACA Council. Five specialized agents handle different stages:

The ACA Council

| Agent | Role | What It Does |

|-------|------|-------------|

| SCOUT | Discovery & Research | Monitors 15+ sources, surfaces trending topics, compiles research briefs |

| FORGE | Outlining & Structure | Takes research briefs and generates structured article outlines |

| QUILL | Drafting | Writes full articles following voice profile and article templates |

| LEDGER | Fact-Checking & Validation | Two-phase verification of every claim, flags uncertainties |

| MAVEN | SEO & Distribution | Generates all derivative content, optimizes for search and social |

The Daily Schedule

The Council meets twice daily:

Morning session (7 AM): SCOUT presents research, team prioritizes topics
Evening session (8 PM): Review completed articles, queue for publishing

The 7-Step Build Pipeline

For each article:

1. SCOUT delivers a research brief

2. FORGE generates the outline

3. QUILL writes the draft using voice profile

4. LEDGER fact-checks (two-phase: extract claims → verify sources)

5. Human review checkpoint (me—5 minutes)

6. MAVEN generates 25+ derivatives

7. Auto-publish to site, sync to Substack, queue social posts

Results

Output: 3-5 articles per week with full distribution packages
My time per article: ~10 minutes (topic approval + final review)
AI time per article: ~20 minutes of model compute
Cloud API cost: $0/month—everything runs on local Ollama models
Quality: Consistent voice, verified facts, platform-optimized distribution

Common Pipeline Mistakes

| Mistake | Why It Fails | Fix |

|---------|-------------|-----|

| No voice profile | Content sounds generic and robotic | Build a voice profile from your existing writing |

| Skipping fact-check | AI hallucinates statistics and quotes | Always run a verification pass before publishing |

| One-shot generation | Single prompt produces mediocre results | Split into stages—research, draft, edit are separate steps |

| No human review | Errors slip through, voice drifts | Always scan final output before publishing |

| Over-automating too early | Complex pipeline before simple one is proven | Start manual, automate one stage at a time |

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

How long does it take to build a content pipeline?

A basic manual pipeline (Level 1) takes 30 minutes to set up. A template pipeline (Level 2) takes 1-2 hours. A fully automated pipeline (Level 3) takes 1-2 weeks of iterating on prompts, voice profiles, and scheduling. Start at Level 1 and upgrade when you feel the friction.

Will Google penalize AI-generated content?

Google penalizes low-quality content, regardless of how it's created. AI content that's well-researched, accurate, and genuinely useful ranks well. The key factors: original insights (your "How I Actually Do This" sections), verified facts, and genuine expertise. A pipeline that produces generic AI slop will be penalized. One that produces expert-informed, fact-checked content won't.

Can I use this for client work?

Yes, with transparency. Many agencies use AI pipelines to increase throughput. The ethical approach: AI handles research, drafting, and repurposing; humans provide expertise, review, and final approval. Disclose AI assistance if your clients expect it.

How do I handle topics the AI gets wrong?

This is why the fact-checking stage exists. For technical topics, include authoritative sources in the research brief so the AI has correct information to work from. For evolving topics, always include a web search step to get current data. And always review—the pipeline produces drafts, not published pieces.

What's the best AI model for content pipelines?

For drafting: Gemma 4 26B or 31B (best voice quality at the local tier). For fact-checking: a model with web search access. For repurposing: Gemma 4 26B (handles format transformation well). For the full pipeline on a budget: a single Gemma 4 26B handles all stages adequately.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

Can Small Businesses Benefit from MCP Servers?

James Cruce — Mon, 13 Apr 2026 03:02:20 GMT

Small businesses run on limited time and limited people. Every hour spent on repetitive tasks—reformatting content, sorting email, prepping for meetings—is an hour not spent growing the business.

MCP servers change that math. They let AI handle the repetitive work using the tools you already have. No enterprise budget. No IT department. No coding.

The Short Answer

Yes. MCP servers give small businesses access to the same AI automation capabilities that enterprises are spending millions on—at a fraction of the cost. The ROI is immediate and measurable.

| Business Size | Without MCP | With MCP |

|--------------|-------------|----------|

| Solo operator | You do everything manually | AI handles content, email, research while you focus on clients |

| Small team (2-10) | Everyone wears multiple hats | AI automates shared workflows — briefings, reports, scheduling |

| Growing business (10-50) | Processes are inconsistent, tribal knowledge | AI standardizes workflows and makes knowledge accessible |

The Five Highest-ROI Automations for Small Business

These five automations deliver the most time savings for the least setup effort. Start with one, prove the value, then add more.

1. Content Repurposing (Save 2-4 Hours Per Article)

You write one blog post. MCP-connected AI turns it into:

| Output | Platform | Time to Create Manually |

|--------|----------|------------------------|

| 5 social media posts | LinkedIn, X, Facebook, Instagram, Threads | 45 min |

| 1 newsletter intro | Email / Substack | 20 min |

| 1 SEO summary | Google / site meta | 15 min |

| 3 pull quotes | Social graphics | 15 min |

| 1 thread/carousel | X or LinkedIn | 30 min |

| 5 Substack Notes | Substack | 25 min |

| 1 voiceover script | Podcast / video | 20 min |

| 5 alt-angle headlines | A/B testing | 15 min |

Total manual time: ~3 hours. With AI: ~15 minutes (review and approve drafts).

The AI reads your original article through the filesystem MCP server, generates all the variations, and saves them as files. You review, tweak if needed, and publish.

One article → 25+ content pieces. For a small business publishing weekly, that's 12+ hours saved per month.

2. Morning Briefing (Save 30 Minutes Per Day)

Instead of checking email, calendar, tasks, and news manually each morning:

Setup: Connect Calendar + Email + Task Manager + Web Search MCP servers.

What AI delivers at 6:30 AM:

Today's meetings with prep notes
Important emails that need attention (prioritized)
Overdue tasks
Relevant industry news

Monthly time saved: ~10 hours

This one automation pays for a Claude Pro subscription ($20/month) in the first week.

3. Email Triage and Draft Responses (Save 30-60 Minutes Per Day)

Setup: Connect Gmail MCP server.

How it works:

AI reads incoming emails
Classifies by priority (urgent, important, routine, spam)
Drafts responses for routine emails
Flags emails that need your personal attention

You review the drafts, hit send on the good ones, and personally handle only the ones that matter. Instead of reading 50 emails, you review 10 drafts and write 5 personal responses.

Monthly time saved: 10-20 hours

4. Meeting Prep and Follow-Up (Save 20-30 Minutes Per Meeting)

Setup: Connect Calendar + Email + CRM MCP servers.

Before the meeting, AI generates:

Attendee background (from CRM and recent emails)
Last interaction summary
Suggested talking points
Relevant documents or proposals

After the meeting, AI generates:

Action item list
Follow-up email drafts
Updated CRM notes
Calendar entries for follow-ups

For a business with 5 meetings per week, that's 8-12 hours saved per month.

5. Competitive Research (Save 1-2 Hours Per Week)

Setup: Connect Web Search MCP server.

Weekly automated research:

Competitor website changes
New product announcements in your space
Industry news and trends
Social media mentions of competitors

AI summarizes everything into a structured brief. You read a 2-page summary instead of spending hours browsing websites and social feeds.

Monthly time saved: 4-8 hours

The ROI Math

Let's make this concrete. Assume a small business owner's time is worth $75/hour (conservatively).

| Automation | Monthly Hours Saved | Monthly Value |

|-----------|-------------------|---------------|

| Content repurposing | 12 hours | $900 |

| Morning briefing | 10 hours | $750 |

| Email triage | 15 hours | $1,125 |

| Meeting prep/follow-up | 10 hours | $750 |

| Competitive research | 6 hours | $450 |

| Total | 53 hours | $3,975 |

Monthly cost:

Claude Pro: $20/month
Or local models (Ollama on a Mac Mini): ~$600 one-time, $0/month ongoing

Even with just one or two of these automations, the ROI is overwhelming.

Getting Started: The 30-Minute Setup

Here's the fastest path from zero to your first useful automation.

Minute 0-5: Install Claude Desktop

Download from claude.ai, install, and log in. A Claude Pro subscription ($20/month) includes MCP server support.

Minute 5-10: Install Node.js

This is the one-time prerequisite. Download from nodejs.org (LTS version), install, done.

Minute 10-20: Add Your First MCP Servers

Open Claude Desktop settings and add two servers:

Filesystem (access your business documents):

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem",
               "/path/to/your/business/documents"]
    }
  }
}

Web Search (research capabilities):

{
  "mcpServers": {
    "filesystem": { "..." },
    "web-search": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-web-search"],
      "env": {
        "BRAVE_API_KEY": "your-free-api-key"
      }
    }
  }
}

Minute 20-30: Run Your First Automation

Restart Claude Desktop. Then try:

"Read my latest blog post from Documents/blog/ and create 5 LinkedIn posts, 3 tweet-length posts, and a newsletter introduction from it."

Claude reads the file through the filesystem server, generates all the content variations, and presents them. Copy, review, publish.

You just saved 2 hours of work in 10 minutes.

Scaling Up: Month-by-Month Plan

Month 1: Foundation

Set up filesystem + web search MCP servers
Use Claude for content repurposing (manual—you ask each time)
Track time saved vs. time spent

Month 2: Communication

Add Gmail MCP server
Start using email triage and draft responses
Add Google Calendar server for meeting prep

Month 3: Automation

Set up scheduled tasks (morning briefing via OpenClaw or n8n)
Create templates for recurring workflows
Add CRM or project management MCP server

Month 4+: Optimization

Refine prompts based on output quality
Add specialized servers for your industry
Consider local models (Ollama) to eliminate API costs

How I Actually Do This

I use MCP servers to run a content-first small business. Here's the actual pipeline:

Content Pipeline (1 Article → 25+ Pieces)

My ACA Council—five specialized AI agents—handles the full content lifecycle:

1. SCOUT finds trending topics and research material

2. FORGE generates structured outlines

3. QUILL writes the draft

4. LEDGER fact-checks and validates claims

5. MAVEN optimizes for SEO and generates distribution pieces

One article generates: the full blog post, SEO metadata, social posts for 5 platforms, a newsletter version, Substack Notes, a voiceover script, and graphic suggestions. All automated.

Morning Briefing (30 Minutes Saved Daily)

OpenClaw runs a morning briefing at 6:30 AM using local models:

Calendar summary (Google Calendar MCP)
Priority email scan (Gmail MCP)
Overnight log review (filesystem MCP)
Industry news (web search MCP)

Delivered to Discord before I finish my coffee. I read a 2-page summary instead of spending 30 minutes checking 4 different apps.

The Business Case

My entire automation stack runs on local models—$0/month in API costs. The Mac Studio hardware was a significant upfront investment, but you don't need that to start. A Mac Mini with 32 GB ($800) runs a morning briefing and content repurposing pipeline comfortably. That investment pays for itself in the first month of time savings.

Common Concerns

"What if the AI produces bad content?"

It will, sometimes. That's why every automation includes a review step. AI generates drafts; you approve or edit. The time savings come from not starting from a blank page—editing a draft is always faster than writing from scratch.

"I'm worried about data privacy."

MCP servers run locally on your computer. Your business data doesn't leave your machine unless you explicitly connect to cloud services. For maximum privacy, use local models through Ollama—then nothing touches the internet at all.

"My business is too niche for AI."

AI doesn't need to understand your niche deeply to save you time. Email triage, meeting prep, and content formatting are universal tasks. The AI handles the structure; you provide the domain expertise. That combination is powerful regardless of industry.

"I tried ChatGPT and it wasn't that helpful."

ChatGPT without MCP servers is limited to what you paste into the chat. With MCP servers, AI accesses your actual data—files, emails, calendars, web. The difference is dramatic. It's the difference between describing your schedule to someone and handing them your calendar.

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Which industries benefit most from MCP automation?

Any industry with significant knowledge work: professional services (law, accounting, consulting), real estate, marketing agencies, e-commerce, healthcare administration, education. The common thread is repetitive text-based tasks—email, documents, research, content.

Can I use MCP servers if I'm not tech-savvy?

Yes. The initial setup requires following step-by-step instructions (similar to installing any software). Once configured, you interact with AI using plain language. "Summarize my emails" is the interface—not code.

How do MCP servers compare to Zapier or Make?

Zapier and Make connect apps with if-then rules. MCP servers connect apps to AI with understanding. A Zapier automation moves data between apps in a fixed pattern. An MCP-connected AI reads, interprets, decides, and acts. They complement each other—use Zapier for simple triggers and MCP for intelligent processing.

What's the minimum investment to get started?

Claude Pro subscription: $20/month
Hardware: Your existing computer
Time: 30 minutes for initial setup
Total first-month cost: $20

Compare that to a virtual assistant ($500-2000/month) or a part-time employee ($1500+/month) for the same tasks.

Can my team share MCP server setups?

Yes. Configuration files can be shared and standardized across a team. Everyone gets the same MCP servers connected the same way, ensuring consistent AI capabilities across the business.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

Can I Use MCP Servers Without Being a Developer?

James Cruce — Mon, 13 Apr 2026 02:51:54 GMT

You don't need to write code to use MCP servers. You don't need to understand APIs, protocols, or SDKs. If you can edit a text file and follow instructions, you can connect AI to your real tools in minutes.

Here's exactly how—three methods, zero coding required.

The Short Answer

Yes, you can use MCP servers without being a developer. The setup involves editing a configuration file or running a single command. No programming skills needed.

|--------|-----------|----------------|----------|

Prerequisites (One-Time Setup)

Before installing MCP servers, you need two things on your computer. If you already have them, skip ahead.

Node.js

Most MCP servers are built with Node.js. Install it once and forget about it.

Mac:

# Open Terminal and run:
brew install node

Windows:

Download from nodejs.org and run the installer. Choose the LTS version.

Verify it's installed:

node --version
# Should show something like: v22.x.x

Python (Optional)

Some MCP servers use Python instead of Node.js. Install it if you encounter a Python-based server.

Mac:

brew install python

Windows:

Download from python.org. Check "Add to PATH" during installation.

That's it for prerequisites. You won't need to write any JavaScript or Python—these are just runtimes that MCP servers need to execute.

Method 1: Claude Desktop (GUI)

Claude Desktop is the easiest starting point. You edit one configuration file, restart Claude, and your MCP servers appear as tools.

Step 1: Find the Config File

Open Claude Desktop, then:

Mac: Claude menu → Settings → Developer → Edit Config
Windows: File → Settings → Developer → Edit Config

This opens `claude_desktop_config.json` in your text editor.

Step 2: Add an MCP Server

The config file has a `mcpServers` section. Each server gets a block. Here's an example adding a web search server:

{
  "mcpServers": {
    "web-search": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-web-search"],
      "env": {
        "BRAVE_API_KEY": "your-api-key-here"
      }
    }
  }
}

What each part means:

`"web-search"` — A name you choose (anything you want)
`"command"` — How to run the server (`npx` for Node.js servers)
`"args"` — The server package name
`"env"` — API keys or settings the server needs

Step 3: Add Multiple Servers

Stack them in the same config file:

{
  "mcpServers": {
    "web-search": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-web-search"],
      "env": {
        "BRAVE_API_KEY": "your-key"
      }
    },
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/Documents"]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "your-token"
      }
    }
  }
}

Step 4: Restart Claude Desktop

Close and reopen Claude Desktop. You should see a hammer icon showing your connected tools. Click it to see which tools each server provides.

Step 5: Use Them

Just talk to Claude normally:

"Search the web for MCP server tutorials"
"List the files in my Documents folder"
"Show me my recent GitHub pull requests"

Claude automatically decides which MCP server tools to use based on your request.

Method 2: Claude Code CLI (Fastest)

If you use Claude Code (the terminal version), adding MCP servers is a single command.

Add a Server

claude mcp add web-search -- npx -y @anthropic/mcp-server-web-search

That's it. One command. The server is immediately available in your next Claude Code session.

Add With Environment Variables

claude mcp add github -- npx -y @modelcontextprotocol/server-github \
  --env GITHUB_PERSONAL_ACCESS_TOKEN=your-token

List Connected Servers

claude mcp list

Remove a Server

claude mcp remove web-search

Scoping

Claude Code lets you scope servers to specific projects:

# Available in all projects (global)
claude mcp add --scope global web-search -- npx -y @anthropic/mcp-server-web-search

# Available only in the current project
claude mcp add --scope project web-search -- npx -y @anthropic/mcp-server-web-search

Method 3: VS Code (GUI)

If you use VS Code with the Claude extension, MCP servers can be configured through the settings GUI.

Step 1: Open Settings

`Cmd+,` (Mac) or `Ctrl+,` (Windows) → Search for "MCP"

Step 2: Add Server Configuration

VS Code's settings UI lets you add MCP server entries with fields for command, arguments, and environment variables. Same information as the JSON config, just in a form.

Step 3: Reload Window

`Cmd+Shift+P` → "Reload Window" to activate the new servers.

Finding MCP Servers to Install

Public Registries

| Registry | URL | Notes |

|----------|-----|-------|

| Smithery | smithery.ai | Largest registry, reviews, install commands |

| mcpt | mcpt.ai | Curated, quality-focused |

| OpenTools | opentools.ai | Growing collection, search by category |

| npm | npmjs.com (search "mcp-server") | Raw package listings |

How to Browse

1. Go to a registry (start with Smithery)

2. Browse categories: Productivity, Development, Data, Communication

3. Click a server to see: description, required config, install command

4. Copy the install command into your config file or CLI

Popular Servers for Non-Developers

| Server | What It Does | Config Complexity |

|--------|-------------|-------------------|

| Filesystem | Read/write files on your computer | Simple—just specify allowed directories |

| Web Search | Search the internet | Needs one API key (Brave) |

| Gmail | Read and draft emails | OAuth setup (guided) |

| Google Calendar | Check and create events | OAuth setup (guided) |

| Slack | Read and send messages | Bot token from Slack |

| Notion | Read and edit Notion pages | API key from Notion |

| GitHub | Manage repos, PRs, issues | Personal access token |

Real Setup Walkthrough: 3 Servers in 15 Minutes

Let's set up three useful MCP servers from scratch using Claude Code.

Server 1: Filesystem (2 minutes)

claude mcp add filesystem -- npx -y @modelcontextprotocol/server-filesystem ~/Documents ~/Desktop

Now Claude can read and write files in your Documents and Desktop folders. Try:

"List the files on my Desktop"
"Read the contents of Documents/notes.txt"
"Create a file called todo.txt on my Desktop with today's tasks"

Server 2: Web Search (3 minutes)

1. Get a free API key from brave.com/search/api

2. Run:

claude mcp add web-search -- npx -y @anthropic/mcp-server-web-search \
  --env BRAVE_API_KEY=your-key-here

Now Claude can search the internet. Try:

"Search for the latest news about MCP servers"
"Find tutorials on Ollama setup"

Server 3: GitHub (5 minutes)

1. Go to GitHub → Settings → Developer settings → Personal access tokens → Generate new token

2. Select scopes: `repo`, `read:org`

3. Run:

claude mcp add github -- npx -y @modelcontextprotocol/server-github \
  --env GITHUB_PERSONAL_ACCESS_TOKEN=your-token

Now Claude can interact with your GitHub repos. Try:

"Show my open pull requests"
"List issues in my project repo"

How I Actually Do This

I run about 15 MCP servers connected to Claude Code for daily work. Here's my philosophy:

The Config File Approach

For Claude Desktop, I keep a curated config file that I've refined over months. New servers get tested individually before joining the main config—one bad server config can prevent Claude Desktop from loading any of them.

The CLI Approach

For Claude Code, I use `claude mcp add` for project-specific servers and global scope for universal ones:

# Global — available everywhere
claude mcp add --scope global web-search -- npx -y @anthropic/mcp-server-web-search
claude mcp add --scope global filesystem -- npx -y @modelcontextprotocol/server-filesystem ~

# Project-specific — only in this repo
claude mcp add github -- npx -y @modelcontextprotocol/server-github

What I've Learned

1. Start with 2-3 servers. File system + web search covers most needs. Add more only when you feel the gap.

2. Test one at a time. If you add 5 servers at once and something breaks, you won't know which one caused it. Add one, verify it works, then add the next.

3. Keep API keys in environment variables. Never paste keys directly in config files that might be synced or shared. Use `.env` files or your system's keychain.

4. Read the server docs. Each server has specific capabilities and limitations. A filesystem server configured with `~/Documents` can only acces Documentsit can't read your whole drive. That's a feature, not a bug.

5. The MCP ecosystem is growing fast. When I started, there were maybe 100 servers available. Now there are thousands. Check registries periodically—there might be a server for that tool you've been wishing Claude could access.

Troubleshooting

| Problem | Likely Cause | Fix |

|---------|-------------|-----|

| Server doesn't appear in Claude | Config syntax error | Validate your JSON at jsonlint.com |

| "Command not found" error | Node.js not installed | Install Node.js (see Prerequisites) |

| Server connects but tools don't work | Missing API key or wrong permissions | Check the server's documentation for required env variables |

| All servers stopped working | One server config is broken | Remove servers one at a time to find the broken one |

| Slow responses | Too many servers loaded | Remove servers you don't use regularly |

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Is it safe to give MCP servers access to my files?

MCP servers only access what you explicitly allow. A filesystem server configured with `~/Documents` cannot read your email, browser history, or other folders. Always scope access to the minimum needed directories.

Do MCP servers send my data to the cloud?

Not by default. MCP servers run locally on your machine. Data only leaves your computer if the server explicitly calls an external API (like web search or Gmail). Local-only servers like filesystem never send data anywhere.

Can I use MCP servers on my phone?

Not directly—MCP servers currently run on desktop/laptop computers. However, if you set up servers on a home computer, you can access them remotely through tools like Claude Code over SSH or a web-based interface like Open WebUI.

What happens if an MCP server crashes?

Claude continues working—it just loses access to that server's tools. Other servers remain connected. Restart the crashed server by restarting Claude Desktop or running `claude mcp restart` in Claude Code.

Do I need to update MCP servers?

Occasionally. Servers installed via `npx` automatically use the latest version. Servers installed globally with `npm install -g` need manual updates with `npm update -g`. Check for updates monthly.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

What Is AI Agent Automation and How Do I Start?

James Cruce — Mon, 13 Apr 2026 02:32:05 GMT

You've probably used AI to answer questions or draft emails. But what if AI could handle entire workflows—research, decide, act, report—while you focus on something else?

That's what AI agents do. And you can start building them today with tools you already have.

The Short Answer

An AI agent is an AI system that works autonomously—it receives a goal, makes a plan, uses tools to execute that plan, and delivers results without you supervising each step. Unlike chatbots that wait for your next message, agents take initiative.

| Chatbot | AI Agent |

|---------|----------|

| Responds to one message at a time | Executes multi-step workflows |

| Waits for your next input | Works independently until done |

| Uses knowledge only | Uses knowledge AND tools |

| "Here's what I think" | "Here's what I did" |

| You drive the conversation | You define the goal, agent drives execution |

The Agent Spectrum

Not all automation is agent automation. Understanding the spectrum helps you pick the right level for each task.

Level 1: Simple Automation (Scripts)

Traditional automation. A script runs predefined steps in a fixed order. No AI involved.

If new_email → forward to team → done

Strengths: Predictable, fast, zero AI cost.

Limits: Can't handle exceptions, can't adapt, brittle.

Level 2: AI-Assisted (Chat + Tools)

You prompt an AI with MCP servers connected. The AI uses tools when you ask, but you direct every step.

You: "Check my calendar for tomorrow"
AI: [calls calendar tool] → "You have 3 meetings..."
You: "Draft prep notes for each"
AI: [drafts notes] → "Here are your prep notes..."

Strengths: Flexible, handles ambiguity, easy to start.

Limits: Requires your attention. Doesn't run without you.

Level 3: Autonomous Agent

The AI receives a goal and handles the entire workflow. It decides which tools to use, in what order, and handles errors along the way.

Goal: "Every morning at 6:30 AM, deliver a briefing with my calendar, important emails, overdue tasks, and relevant news."
Agent: [checks calendar] → [scans email] → [queries task manager] → [searches news] → [synthesizes briefing] → [delivers to Discord]

Strengths: Runs without you, handles variations, scales.

Limits: Needs well-defined boundaries. Can fail on truly novel situations.

Level 4: Multi-Agent Systems

Multiple agents coordinate. One agent researches, another writes, another edits, another publishes. They pass work between them.

SCOUT agent: [finds trending topics] → passes to
FORGE agent: [generates article outline] → passes to
QUILL agent: [writes draft] → passes to
LEDGER agent: [fact-checks and validates] → passes to
MAVEN agent: [optimizes for SEO and publishing]

Strengths: Specialized agents outperform generalists. Parallel execution.

Limits: Complex to build and debug. Coordination overhead.

Building Your First Agent

You don't need a framework or custom code to start. Here's the progression from simple to sophisticated.

Method 1: Scheduled Prompts (Easiest)

The simplest agent is a prompt that runs on a schedule. No framework needed.

Tools: Ollama + cron (or n8n, Make, Zapier)

# A cron job that runs at 6:30 AM
30 6 * * * curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [{"role": "user", "content": "Generate a morning briefing..."}]
}' | jq -r '.message.content' | send-to-discord

This is technically an agent—it runs autonomously, uses a model, and delivers results. It just can't use tools or handle multi-step logic.

Method 2: AI Gateway (Recommended Starting Point)

An AI gateway like OpenClaw sits between your scheduled tasks and your models. It adds tool calling, model routing, and error handling.

What a gateway provides:

Schedule management (cron-like)
Model selection per task
MCP server connections (tools)
Output delivery (Discord, Slack, email)
Error handling and retries
Logging and monitoring

This is where most people should start for real agent automation. You get 80% of the power with 20% of the complexity.

Method 3: Agent Frameworks (For Developers)

If you want full control, frameworks like LangChain, CrewAI, or the Claude Agent SDK let you build custom agent logic in code.

# Conceptual example
agent = Agent(
    model="gemma4:26b",
    tools=[calendar, email, web_search, task_manager],
    goal="Generate a morning briefing"
)
result = agent.run()

Use frameworks when: You need custom logic, complex coordination between agents, or integration with specific systems that don't have MCP servers.

Designing Reliable Agent Workflows

Agents fail when their scope is unclear. Here's how to design workflows that actually work in production.

The Three Constraints

Every reliable agent workflow defines:

1. Goal—What does "done" look like? Be specific. "Research AI trends" is vague. "Find 5 news articles from the past week about MCP servers and summarize each in 2 sentences" is actionable.

2. Tools—What can the agent access? Only give it the tools it needs. An agent that can search the web, read email, and post to social media has a larger blast radius than one that can only search and summarize.

3. Boundaries—What should the agent NOT do? "Never send messages directly—always save as draft for review." "Never delete files." "If uncertain, log the uncertainty and skip."

Error Handling Patterns

| Pattern | How It Works | When to Use |

|---------|-------------|-------------|

| Retry with backoff | Try again after a delay | Transient failures (API timeouts, rate limits) |

| Fallback model | Switch to a different model | Model-specific failures |

| Human escalation | Alert a human and pause | High-stakes decisions, ambiguous situations |

| Skip and log | Log the failure, continue with remaining tasks | Non-critical subtasks in a batch |

| Circuit breaker | Stop after N consecutive failures | Prevent cascading failures |

The Review Step

For any agent workflow that produces output seen by others, add a review step:

Draft mode: Agent creates drafts, human approves before sending
Threshold mode: Agent acts autonomously below a confidence threshold, escalates above it
Audit log: Agent acts freely but logs everything for post-hoc review

Start with draft mode. Move to threshold mode once you trust the workflow. Keep audit logs always.

How I Actually Do This

I run 26 automated agent tasks on a Mac Studio through OpenClaw. Here's the architecture:

The Task Schedule

|------|-----------|-------|--------|

Routing Architecture

Every incoming task hits a router first:

Task arrives → Triage model classifies complexity →
  Simple → Gemma 4 e4B (fast, cheap)
  Standard → Gemma 4 26B (quality)
  Code → Qwen 3 Coder (specialized)
  Complex → Gemma 4 31B (deep reasoning)

The triage classification takes under 100ms. The right model gets the right task.

What I've Learned Running Agents Daily

1. Scope tight, iterate wide. Start each agent with a narrow, well-defined task. Expand scope only after it's been reliable for weeks.

2. Logs are everything. When an agent produces bad output, you need to know what prompt it received, what tools it called, and what each tool returned. Without logs, debugging is guesswork.

3. Models narrate instead of acting. Some models—particularly smaller ones—will describe what they would do rather than actually calling tools. This was the single most frustrating failure mode. Solution: test every model with actual tool-calling prompts before deploying it.

4. Schedule slack matters. If your 6:00 AM research pipeline occasionally takes 20 minutes, don't schedule the next task at 6:05 AM. Build buffer between dependent tasks.

5. $0/month is real. All 26 tasks run on local models through Ollama. The total cloud API cost is zero. Hardware paid for itself within two months of not paying for API calls.

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Can AI agents make mistakes?

Yes. Agents can misinterpret goals, use tools incorrectly, or produce low-quality output. The mitigation is constraint design—narrow scope, clear boundaries, human review for high-stakes outputs. Well-constrained agents on well-defined tasks are remarkably reliable.

How much does it cost to run AI agents?

If you run local models: hardware cost only ($0/month in API fees). A Mac Mini with 32 GB runs several agent tasks. A Mac Studio with 64+ GB runs dozens. Cloud-based agents using API calls typically cost $0.01-0.10 per task execution, depending on model and complexity.

What's the difference between AI agents and RPA (Robotic Process Automation)?

RPA automates UI clicks—it literally moves the mouse and types. AI agents understand context, make decisions, and use APIs directly. RPA is brittle (breaks when a UI changes). AI agents are flexible (adapt to variations in input). They solve different problems, though AI agents are increasingly replacing RPA for knowledge work.

Can agents work with my existing tools?

Yes, through MCP servers. If your tool has an MCP server (or an API), agents can use it. Calendar, email, databases, file systems, web search, Slack, GitHub—all connectable. The MCP ecosystem covers most common business tools.

How do I monitor agents in production?

Log every execution: timestamp, prompt, tool calls, responses, and final output. Set up alerts for failures, timeouts, and anomalous output. Review logs weekly to catch quality drift. This is the same principle as monitoring any automated system—visibility is everything.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

What's the Best Local LLM for Your Specific Task?

James Cruce — Mon, 13 Apr 2026 02:16:10 GMT

Not every task needs the biggest model. A 4-billion parameter model can sort your notifications just as well as a 70-billion parameter one—and it'll do it 10x faster.

The trick isn't finding one "best" model. It's matching the right model to each job. Here's how to think about it, with specific recommendations for every common use case.

The Short Answer

There is no single best local LLM. The best model depends on your task, your hardware, and your tolerance for slower responses. Here's the quick decision matrix:

|------|-----------|---------------|-----|

Understanding Model Tiers

Local LLMs come in rough capability tiers. Knowing where each tier sits helps you avoid overspending memory on simple tasks—or underpowering complex ones.

Tier 1: Micro (1-4B parameters)

Models: Gemma 4 e4B, Phi-3 Mini, Qwen 3 1.7B

Good for: Classification, routing, keyword extraction, simple formatting, notification triage, yes/no decisions.

Not good for: Complex reasoning, nuanced writing, multi-step analysis, coding beyond snippets.

Memory: 4-8 GB

These models are fast and cheap. Use them for high-volume, low-complexity tasks where speed matters more than depth. If you're processing hundreds of notifications per day, a micro model handles the sorting while your bigger models handle the interesting work.

Tier 2: Small (7-14B parameters)

Models: Gemma 3 12B, Qwen 3 8B, Llama 3.2 11B

Good for: General conversation, basic coding, summarization, email drafting, structured output, moderate reasoning.

Not good for: Complex multi-step reasoning, long document analysis, production-grade code generation.

Memory: 12-20 GB

The sweet spot for most people starting out. A Gemma 3 12B on a 16 GB MacBook handles daily tasks surprisingly well. These models punch above their weight on focused tasks.

Tier 3: Medium (26-34B parameters)

Models: Gemma 4 26B, Gemma 4 31B, Qwen 3 Coder 32B, Qwen 3 32B

Good for: Almost everything—coding, writing, research, complex analysis, tool calling, agent tasks.

Not good for: Tasks requiring frontier-model reasoning (use cloud Claude for those).

Memory: 32-64 GB

This is where local models become genuinely competitive with cloud APIs for most tasks. Gemma 4 26B is my daily workhorse—it handles 80% of everything I throw at it.

Tier 4: Large (65-70B+ parameters)

Models: Llama 3.3 70B, Qwen 3 72B, DeepSeek V3

Good for: Maximum local quality, deep reasoning, complex research, academic analysis.

Not good for: Anything where speed matters—these are slow without enterprise hardware.

Memory: 96-128+ GB

Impressive quality, but the hardware requirements limit accessibility. If you have a Mac Studio with 192+ GB or a multi-GPU server, these are worth exploring. Otherwise, use cloud models for tasks that need this level.

Task-Specific Recommendations

Coding

Best: Qwen 3 Coder 32B

Runner-up: Gemma 4 26B

Budget: Qwen 3 8B

Coding requires strong structured output, function understanding, and the ability to follow precise instructions. Qwen 3 Coder is purpose-built for this. It handles:

Code generation across languages
Tool calling and function signatures
Debugging and refactoring
Test generation
Structured JSON output

Important finding: Some models narrate what they would do instead of actually executing tool calls. Qwen 3 8B sometimes exhibits this behavior—it describes the function call rather than producing the structured output. Qwen 3 32B (and the Coder variant) follows tool-call instructions reliably. Test your chosen model with actual tool-calling prompts before committing to it for automation.

Writing and Content

Best: Gemma 4 31B

Runner-up: Gemma 4 26B

Budget: Gemma 3 12B

Writing quality improves noticeably with model size. Larger models produce more natural voice, better paragraph flow, and more nuanced tone. For blog posts, documentation, and professional writing, the jump from 12B to 26B is significant.

For content pipelines where you're generating dozens of pieces, 26B offers the best throughput-to-quality ratio. Save 31B for final drafts or pieces where voice really matters.

Research and Analysis

Best: Llama 3.3 70B

Runner-up: Gemma 4 26B

Budget: Gemma 3 12B

Deep research tasks benefit from maximum reasoning capability. A 70B model catches connections and nuances that smaller models miss. But for daily research briefs and competitive monitoring, 26B is more than sufficient.

Automation and Agent Tasks

Best: Gemma 4 26B (general) + Qwen 3 8B (structured output)

Budget: Gemma 4 e4B (triage) + Gemma 3 12B (execution)

Agent automation needs reliable tool calling, consistent structured output, and the ability to follow multi-step instructions. You don't always need the smartest model — you need the most reliable one.

For triage tasks (sorting, routing, classification), micro models are perfectly reliable and run 10x faster. Route the complex work to bigger models.

The Multi-Model Architecture

Running a single model for everything is like using a sledgehammer for every nail. The smart approach is a tiered architecture where each model handles the tasks it's best at.

My 4-Tier Setup

|------|-------|-------|-------------|

A routing layer (itself running on the triage model) examines each incoming task and sends it to the appropriate tier. Simple tasks go fast. Complex tasks get the power they need.

Why Not Just Use the Biggest Model?

Three reasons:

1. Speed. A 4B model responds in milliseconds. A 31B model takes seconds. For 500 daily triage operations, that's the difference between instant and painfully slow.

2. Concurrency. Smaller models use less memory, so you can run more models simultaneously. Four small models serving different task types beats one large model handling a queue.

3. Cost-efficiency. Even though local models are "free," memory is finite. Using a 31B model for notification sorting wastes 17 GB of memory that could serve other workloads.

How I Actually Do This

My Mac Studio M3 Ultra with 256 GB unified memory runs all four tiers simultaneously through Ollama:

# All models pinned permanently — no cold starts
export OLLAMA_KEEP_ALIVE=-1
export OLLAMA_MAX_LOADED_MODELS=4

OpenClaw (my local AI gateway) routes tasks to the appropriate model based on a classification prompt:

Given this task: [task description]
Which tier should handle it?
- TRIAGE: Simple classification, yes/no, sorting
- DAILY: Research, drafting, analysis, summarization
- CODE: Code generation, tool calling, structured output
- HEAVY: Complex reasoning, long documents, multi-step logic
Reply with only the tier name.

The triage model runs this classification in under 100ms. The task then goes to the right model. This architecture has been running 26 automated tasks daily for months with zero model-related failures.

When I Still Use Cloud Claude

Local models have a ceiling. For tasks that need frontier reasoning—complex architectural decisions, nuanced code review across large codebases, novel problem-solving—I reach for cloud Claude. It's not about local vs. cloud. It's about using the right tool.

My split: ~90% local, ~10% cloud. The 10% is high-value work where the quality difference justifies the cost.

Picking Your Starting Model

Don't overthink it. Here's the decision tree:

1. How much memory do you have?

8-16 GB → Start with Gemma 3 12B
32-64 GB → Start with Gemma 4 26B
96+ GB → Start with Gemma 4 26B + Qwen 3 Coder 32B

2. What's your primary use?

General → Gemma 4 26B
Coding → Qwen 3 Coder 32B
Writing → Gemma 4 26B (or 31B if you have memory)

3. Run it for a week. If quality falls short on specific tasks, add a specialist model for those tasks. If it handles everything, you're done.

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

How often do new models come out?

Major model releases happen every few months. Ollama makes upgrading easy — `ollama pull model:latest` downloads the new version. You don't need to chase every release. Upgrade when a new model offers clear improvements for your specific tasks.

Should I use quantized models?

Quantization shrinks models by reducing numerical precision. A Q4 quantized 70B model fits in roughly the same memory as a full-precision 26B model. The quality trade-off is usually small—Q4 and Q5 quantizations preserve most capability. For memory-constrained systems, quantized larger models often outperform full-precision smaller ones.

Can I fine-tune local models?

Yes, but you probably shouldn't—at least not yet. Fine-tuning requires significant expertise and compute. For most use cases, prompt engineering and Modelfiles (custom system prompts) get you 90% of the way there. Save fine-tuning for when you've exhausted prompt-based approaches.

Do model benchmarks matter?

Benchmarks measure synthetic tasks that may not reflect your workload. A model that scores highest on HumanEval might not be the best at drafting your emails. Use benchmarks as rough filters, then test with your actual tasks. Five real-world test runs tell you more than any leaderboard.

What about multimodal models (vision, audio)?

Ollama supports multimodal models like LLaVA and Gemma 4 with vision. These can analyze images, screenshots, and documents. Useful for automation tasks like reading invoices, analyzing charts, or processing visual data. The vision capabilities are still maturing but improving rapidly.

Leave a comment

*This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

How Do I Set Up Ollama on Mac, Windows, and Linux?

James Cruce — Mon, 13 Apr 2026 02:05:33 GMT

Running AI locally means no API bills, no data leaving your machine, and no rate limits. Ollama makes that possible in under 10 minutes on any platform.

Here's how to install, configure, and optimize Ollama on Mac, Windows, and Linux—plus the configuration I use to run 26 automated AI tasks daily on a Mac Studio.

The Short Answer

Ollama is a free tool that runs large language models locally. Install it, pull a model, and you have a fully functional AI running on your hardware with zero cloud dependencies.

|----------|---------------|-------------|----------|

Installation: Platform by Platform

macOS

Option A: Direct download

1. Go to ollama.com and download the macOS installer

2. Open the `.dmg` and drag Ollama to Applications

3. Launch Ollama—a menu bar icon appears

Option B: Homebrew

brew install ollama
ollama serve   # Start the server

Pull your first model:

ollama pull gemma3
ollama run gemma3

That's it. You're running a local AI.

Apple Silicon note: If you have an M1/M2/M3/M4 Mac, Ollama automatically uses Metal acceleration and unified memory. A MacBook Air with 16 GB can comfortably run 7-8B parameter models. A Mac Studio with 192-512 GB can run the largest open models available.

Windows

1. Download the Windows installer from ollama.com

2. Run the installer—Ollama installs as a Windows service

3. Open PowerShell or Command Prompt:

ollama pull gemma3
ollama run gemma3

NVIDIA GPU: Detected automatically if CUDA drivers are installed. Check with `nvidia-smi` in a terminal. If your GPU shows up there, Ollama will use it.

No GPU? Ollama falls back to CPU. It works, but responses will be slower. Stick to smaller models (3-7B parameters) on CPU-only systems.

Linux

One-line install:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and creates a `systemd` service that starts automatically.

ollama pull gemma3
ollama run gemma3

GPU setup:

NVIDIA: Install CUDA drivers first (`nvidia-driver-xxx` package), then install Ollama. It detects the GPU automatically.
AMD: ROCm support is available. Install ROCm drivers, then Ollama picks them up.

Headless/server: Ollama runs perfectly without a desktop environment. The API listens on `localhost:1134` by default.

Choosing Your First Models

Don't overthink this. Start with one general-purpose model and add specialized ones later.

|-------|------|----------|--------------|

Rule of thumb: You need roughly 1.2x the model file size in available memory. If a model is 17 GB, you need at least 20 GB free.

# Pull a model
ollama pull gemma4:26b

# List installed models
ollama list

# Remove a model
ollama rm gemma3:4b

Configuration for Real Use

The defaults work fine for casual use. For daily automation or development, tune these settings.

Environment Variables

Set these in your shell profile (`.zshrc`, `.bashrc`) or systemd service file:

# Where models are stored (default: ~/.ollama/models)
export OLLAMA_MODELS="/path/to/models"

# Listen on all interfaces (for network access)
export OLLAMA_HOST="0.0.0.0:11434"

# Keep models in memory longer (default: 5m)
export OLLAMA_KEEP_ALIVE="24h"

# Max models loaded simultaneously
export OLLAMA_MAX_LOADED_MODELS=3

# GPU layers (0 = CPU only, -1 = all GPU)
export OLLAMA_NUM_GPU=-1

The API

Ollama serves a REST API on `localhost:11434`. Every tool that supports Ollama — Claude Desktop, Open WebUI, Continu, LangChain—connects through this API.

# Quick test
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "What is MCP?",
  "stream": false
}'

# Chat format (most common)
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [{"role": "user", "content": "What is MCP?"}],
  "stream": false
}'

Modelfiles (Custom Configurations)

Create a `Modelfile` to customize model behavior:

FROM gemma4:26b

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM """You are a technical writing assistant. Write clearly and concisely. Use active voice. Avoid jargon unless the audience is technical."""

ollama create my-writer -f Modelfile
ollama run my-writer

This lets you create task-specific variants without downloading the same model multiple times.

How I Actually Do This

I run Ollama on a Mac Studio M3 Ultra with 256 GB of unified memory. Here's what my production setup looks like:

Models Pinned in VRAM

I keep multiple models loaded permanently — no cold starts, instant responses:

# Keep models alive indefinitely
export OLLAMA_KEEP_ALIVE=-1
export OLLAMA_MAX_LOADED_MODELS=4

| Model | Role | Why This One |

|-------|------|-------------|

| `gemma4:e4b` | Triage/routing | Fast, cheap, handles notification sorting and simple classification |

| `gemma4:26b` | Daily workhorse | Balanced quality — handles 80% of tasks including research, drafting, analysis |

| `qwen3-coder:fast` | Code tasks | Optimized for code generation, tool calling, structured output |

| `gemma4:31b` | Heavy reasoning | Complex analysis, long documents, multi-step reasoning |

Automated Task Runner

OpenClaw (my local AI gateway) connects to Ollama's API and runs 26 scheduled tasks:

6:00 AM — Research pipeline hits web sources, summarizes findings
6:15 AM — Log review scans overnight system logs for anomalies
6:30 AM — Morning briefing synthesizes calendar + priorities + news
Every hour — Notification batching across all channels

Total cloud API cost: $0/month. Everything runs locally.

Performance Tuning

For Apple Silicon specifically:

Set `num_gpu` to `-1` (all layers on GPU) — Apple's unified memory means the GPU can access all 256 GB
Set `num_ctx` based on task — 4096 for quick tasks, 16384 for long documents, 32768 for code analysis
Monitor with `ollama ps` to see which models are loaded and how much memory they use

# Check what's running
ollama ps

# Example output:
# NAME              SIZE    PROCESSOR  UNTIL
# gemma4:26b        17 GB   100% GPU   Forever
# gemma4:e4b        3 GB    100% GPU   Forever
# qwen3-coder:fast  5 GB    100% GPU   Forever

Troubleshooting Common Issues

| Problem | Cause | Fix |

|---------|-------|-----|

| Model runs slowly | Not enough GPU memory, layers on CPU | Check `ollama ps` — if PROCESSOR shows "CPU", you need a smaller model or more memory |

| "Out of memory" error | Model too large for your system | Try a smaller quantization (`q4_0` instead of `q8_0`) or a smaller model |

| API connection refused | Ollama not running | Run `ollama serve` or start the desktop app |

| Model download stuck | Network issue | Ctrl+C and re-run `ollama pull` — it resumes from where it stopped |

| GPU not detected (Linux) | Missing CUDA/ROCm drivers | Install drivers first, then reinstall Ollama |

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Is Ollama free?

Yes, completely free and open source. The models are also free. There are no subscriptions, API fees, or usage limits.

Can I use Ollama with Claude Desktop?

Not directly — Claude Desktop uses cloud Claude. But you can connect Ollama to tools like Open WebUI, Continue (VS Code), or any application that supports the OpenAI-compatible API format. Many MCP servers can also route to local Ollama models.

How much disk space do models need?

Models range from 2 GB (small 3-4B models) to 45+ GB (large 70B models). Budget 5-20 GB for a typical setup with 2-3 models. Ollama stores them in `~/.ollama/models` by default.

Can I run Ollama on a Raspberry Pi?

Technically yes, but practically no for anything useful. Even a Raspberry Pi 5 with 8 GB RAM can only run tiny models (1-3B) very slowly. A mini PC with 32 GB RAM is a much better entry point for local AI.

Does Ollama support tool calling / function calling?

Yes. Models like Qwen 3, Gemma 4, and Llama 3.3 support structured tool calling through Ollama's API. This is essential for MCP server integration and agent automation.

Leave a comment

This is part of the ASTGL Definitive Answers series—structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.*

How I Shipped an MCP Knowledge Server in a Weekend

James Cruce — Sun, 12 Apr 2026 11:30:52 GMT

Turning static content into an AI-citable knowledge base via npm

You have likely used Claude or Cursor to write code, but you have also probably noticed them hallucinating or missing the specific nuances of niche technical topics. I wanted my AI assistant to actually know what I write about, so I built a way to give it a direct line to my content.

The Setup

I spend most of my time writing about MCP servers, local LLMs, and AI automation. The problem is that while my articles are great for humans, LLMs often rely on outdated training data or generic web crawls. I needed a way to inject my specific, up-to-date knowledge into the context window of an AI assistant without the manual friction of copying and pasting links every time I had a question.

I decided to build a Model Context Protocol (MCP) server that acts as a specialized, searchable knowledge base for my writing. The goal was simple: when I ask an AI about a topic I have covered, it should be able to query this server, find the relevant article, and cite the source URL back to astgl.ai.

What I Built

The architecture is designed to be zero-config for the end user. I built a TypeScript-based MCP server using the @modelcontextprotocol/sdk (v1.12.1). Instead of requiring the user to run a heavy vector database or even have Ollama installed, I pre-computed the embeddings at build time.

I used nomic-embed-text (768 dimensions) via Ollama to turn my articles into vectors. These are stored in a single SQLite database using sqlite-vec. Because the vectors are baked into the package, the server is incredibly lightweight.

The server provides three main tools: 1. search_articles: This performs a vector similarity search and returns ranked results with relevance scores (from 0 to 1). 2. get_answer: This is a direct Q&A tool. It has a preference for FAQ entries and returns a direct answer, the source URL, and related articles. 3. list_topics: This simply lists all ingested articles with their descriptions, URLs, and section headings.

The Build Steps

Building this was less about complex coding and more about making smart decisions regarding data structure and portability.

First, I had to figure out how to chunk the data. I did not just dump whole articles into the database. Instead, I used a per-H2 chunking strategy. Since my technical sections are usually self-contained (typically between 300 and 800 tokens), this keeps the context clean and prevents the AI from getting overwhelmed by irrelevant text. I also treated FAQ entries as their own individual chunks so the get_answer tool could hit them directly for high-precision queries.

Next, I chose the storage engine. I opted for sqlite-vec over an external vector database like Pinecone or Weaviate. I wanted the entire server to be a single, portable npm package. There is no infrastructure to manage and no API keys to handle. The resulting database is about 3.2MB, which compresses down to just 450KB inside the npm package. It is small enough to ship anywhere.

I also made a specific choice regarding the math. I used cosine distance instead of L2 (Euclidean) distance. Since the embeddings are normalized, cosine distance provides much better semantic separation for text. During testing, I saw on-topic relevance scores around 0.89, while off-topic noise stayed around 0.73. That gap is wide enough for the AI to clearly distinguish between a “good” match and a “maybe” match.

Finally, I kept this project in its own separate Git repository rather than burying it inside my larger openclaw-localllm monorepo. It makes the versioning much cleaner and allows me to publish updates to the knowledge base without touching my main codebase.

Publishing It

The goal was to make this “install and forget.” I published the package as mcp-astgl-knowledge@1.0.0 on npm, and I also listed it on Smithery under @jmeg8r/mcp-astgl-knowledge.

However, publishing wasn’t entirely seamless. I ran into a classic npm 2FA (Two-Factor Authentication) nightmare. When trying to automate the publish process, a standard Automation token and the --auth-type=web flag both failed with an E403 error. I eventually realized the fix: I had to create a Granular Access Token in npm and specifically enable the “Bypass 2FA for automation” option. Once I updated my local config with that new token, npm publish worked perfectly

The end result is that anyone can add my knowledge base to their Claude Desktop or Cursor setup with a single JSON snippet. There is no need to install Ollama or set up a local database. You just run it via npx.

Why This Pattern Matters

This project taught me a valuable lesson about the future of content. We often think of MCP servers as bridges to live APIs, like a way to query GitHub or Google Search. But an MCP server can also be a way to ship pre-computed knowledge.

This is a new way to think about content strategy. By shipping an MCP server, you are making your writing “AI-citable.” When an AI assistant uses this tool, it is not just answering a question; it is actively citing your URL. This creates a technical loop where the AI’s answer drives referral traffic back to your site. It turns your technical documentation into a programmable asset that lives directly inside the user’s development environment.

Quick Reference

To use the server, add this to your mcpServers configuration in Claude Desktop or Cursor:

{
“mcpServers”: {
“astgl-knowledge”: {
“command”: “npx”,
“args”: [”-y”, “mcp-astgl-knowledge”]
}
}
}

You can find the source code and the full history of the build on GitHub: https://github.com/Jmeg8r/mcp-astgl-knowledge

Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns.

Leave a comment

Behind ASTGL: How I Built an Autonomous AI Product Team That Ships Without Me

James Cruce — Fri, 03 Apr 2026 16:27:52 GMT

A technical case study on running a 5-agent AI council that researches, debates, builds, and publishes digital products—entirely on local hardware.

The Problem I Was Trying to Solve

I run a tech blog called As The Geek Learns. I also work full-time as a systems engineer. I also maintain two newsletters. I don’t have time to research product ideas, validate markets, write content, build deliverables, create marketing copy, and publish to a storefront.

But I had a Mac Studio M3 Ultra sitting on my desk with 256 GB of unified memory, and I kept thinking: what if I could build a team that does most of this for me?

Not a chatbot. Not a single prompt chain. An actual team—with specializations, disagreements, voting, and accountability.

This is what I built. Here’s how it works, where it breaks, and what I’ve learned.

The Architecture: Five Agents, One Pipeline

The system runs on OpenClaw, an open-source AI agent framework. The “team” is a council of five agents, each with a distinct role and scoring rubric. They share a single pipeline—one product at a time, from idea to published storefront listing.

The Agents

SCOUT—Market Intelligence

Scans for demand signals, competitor gaps, and audience fit. Scores ideas on: demand evidence, pain severity, competitor gap, ASTGL brand fit, and freshness. SCOUT’s job is to make sure we’re not building something nobody wants.

FORGE—Feasibility & Build

Estimates build time, assesses toolchain requirements, and defines scope ceilings. Scores on: build time, tool availability, format clarity, scope containment, and reusability. FORGE is the one who says, “That's a 40-hour build, not a weekend project,” and gets outvoted anyway. (More on that later.)

QUILL—Sellability & Messaging

Tests headlines, evaluates audience clarity, and writes all marketing copy. Scores on: headline test, audience clarity, urgency, differentiation, and shareability. If QUILL can’t write a compelling one-liner, the product doesn’t move forward.

LEDGER—Revenue & Pricing

Analyzes price points, margin viability, market size, and willingness to pay. Scores on: price point, addressable market, willingness to pay, recurring potential, and margin. LEDGER killed our first micro-eBook idea at $1.99—“below the $19 floor individually; bundle pricing required for margin viability.”

MAVEN—Customer Value & Quality

The quality gate. MAVEN scores on pain severity, solution completeness, time to value, trust signals, and predicted satisfaction. Nothing ships without MAVEN’s approval. MAVEN also runs the final quality review, fact-checking deliverables against their own constraints.

Five Agents - One Pipeline

How They Score

Each agent scores every product idea on their 5 criteria, 1-5 points each. Maximum: 25 per agent, 125 total. The pipeline threshold is 80% (100/125). Below that, the idea goes back to the backlog.

This isn’t a formality. I’ve watched LEDGER tank a product the other four loved because the margin math didn’t work. I’ve watched FORGE dissent on sequencing even when the vote went against them. The scoring rubric creates genuine tension, and that tension produces better decisions.

The Pipeline: Seven Steps

When the council reaches consensus on a product, it enters a sequential pipeline:

1. Market Research—SCOUT deep-dives demand signals, competitor analysis, pricing benchmarks

2. Pricing—LEDGER sets price point with live competitor data

3. Creative Brief—QUILL writes the brief: audience, tone, format, constraints

4. Build—FORGE constructs the deliverables (PDFs, templates, code bundles, worksheets)

5. Quality Review—MAVEN fact-checks, verifies constraints, scores 1-10 (must be ≥7 to pass)

6. Marketing—QUILL produces product descriptions, social posts, email sequences

7. Package & Publish—Final ZIP assembly, Stripe product creation, storefront listing

Each step runs as a cron job on a 30-minute cycle. The pipeline runner checks for the active product, determines which step is next, and executes it. A full product can go from consensus to published in under 24 hours—though in practice it usually takes 2-3 days because of quality gates and my own review bottlenecks.

The Governance: How Five Agents Make Decisions

This is the part I’m most proud of and the part that surprised me the most.

Voting

The council uses ranked-choice instant-runoff voting. Each agent ranks their preferred ideas. If no idea gets >60% of the weighted score in Round 1, the lowest-scoring idea is eliminated and votes redistribute. This continues until consensus or deadlock.

In practice, most decisions resolve in 2-3 rounds. The debates are real—FORGE might argue for a faster build while SCOUT pushes for a higher-signal market. LEDGER might prefer a premium product, while QUILL argues the audience can’t bear that price point.

Deadlock Protocol

If the evening session can’t reach >60% consensus, it escalates to a frontier model (Claude Sonnet) for a tie-breaking analysis. This has happened twice. Both times, the frontier model sided with the minority agent, which I found interesting.

Kill Switches and Stall Prevention

The system has several circuit breakers:

3 consecutive no-consensus meetings → Alert sent to me via Discord

7 days with no active build → Next morning meeting must activate the highest-scored backlog item

48+ hours stalled on one pipeline step → Council votes to hold or shelve

Global kill switch—I can halt everything with a single signal file

The stall prevention matters because I learned early that without it, the council will endlessly debate scoring refinements instead of building products. The 48-hour timer forces decisions.

SOUL.md: The Operating Constitution

Every agent operates under a shared SOUL.md—a set of principles that override everything else:

Be genuinely helpful (no filler content)

Have opinions (agents must take positions, not hedge)

Be resourceful before asking (try to solve problems before escalating to me)

Earn trust through competence

When SOUL.md conflicts with a protocol decision, SOUL.md wins. This has saved me from shipping mediocre products more than once—MAVEN has invoked SOUL principles to block a product that technically passed all numeric thresholds but didn’t meet the “genuinely helpful” standard.

The Infrastructure: Running It Locally

Everything runs on a single Mac Studio. No cloud APIs for the pipeline work—just local models.

Models

Primary: Qwen3 32B (served via llama.cpp, not Ollama—4x faster wall time)

Code tasks: Qwen 2.5 Coder 32B

Heavy reasoning: Qwen 2.5 72B (for consolidation and ideation tasks)

Light tasks: Qwen3 8B (monitoring, notifications, briefings)

The 32B model handles all pipeline steps. llama.cpp serves it on port 8081 with flash attention, 3 parallel slots, and 128K total context. I migrated from Ollama after benchmarking showed 3.7 seconds vs 14.4 seconds on the same prompt.

Cron Fleet

26 cron jobs run the system:

Pipeline runner every 30 minutes during business hours

Morning briefing at 6:30 AM

Evening summary at 8:00 PM

Market scans daily

Council meetings (morning standup + evening strategy session)

Notification triage every 5 minutes (critical), hourly (important), every 3 hours (low)

Weekly deep scan, docs audit, test suite, security audits

All delivery goes through Discord with a Slack mirror. The system has its own Discord server with dedicated channels for ops alerts, council meetings, pipeline status, product reviews, and market intelligence.

The Numbers That Matter

VRAM usage: ~97 GB pinned across 4 models (out of 256 GB available)

Pipeline throughput: Consensus to published in 12-48 hours depending on product complexity

Quality scores: Averaging 8-8.5/10 on MAVEN reviews

Council consensus rate: >80% of votes resolve without escalation

What I’ve Shipped

Since the council went live in late March 2026:

Incident Response Runbook & Playbook Toolkit—112/125, $19-24, targeting DevOps/SRE teams

Sysadmin Documentation Toolkit—113/125, $24, Markdown + Notion + Obsidian formats

Server Security Hardening Checklist Kit—119/125 (highest score), $24, Linux + Windows

Homelab Disaster Recovery Kit—116/125, $29 (premium tier), Proxmox + TrueNAS focused

Each product went through all seven pipeline steps, MAVEN quality review, and full marketing asset generation. The marketing copy, product descriptions, and email sequences were all council-produced.

I review everything before it goes live. The council builds it; I sanity-check it. So far I’ve shipped every product MAVEN approved—their quality bar has been higher than mine in several cases.

Where It Breaks

I’d be dishonest if I didn’t cover the failure modes. There are several.

The Approval Bottleneck

The biggest problem has been me. The system is designed to be autonomous, but exec permissions—the ability for agents to run shell commands—require approval policies. When those policies are too restrictive, the pipeline stalls waiting for me to approve a command I would have approved anyway. One product sat blocked for 30+ hours because the agent couldn’t run a script to rebuild a ZIP file.

The fix was widening exec permissions for the pipeline agent. The tradeoff is real: more autonomy means more trust in the model not to do something destructive. I’m comfortable with it because everything runs locally and I have kill switches, but it’s a genuine tension.

Model Limitations

32B parameter models are not frontier models. They occasionally:

Lose track of multi-step instructions across long pipeline runs

Generate marketing copy that’s technically correct but tonally flat

Miss nuanced quality issues that a larger model would catch

The council structure mitigates this—five agents checking each other catches most issues. But I’ve had to add explicit constraints (like MAVEN’s “interface-agnostic instructions” rule) after catching problems the models didn’t flag.

The Constant Infrastructure Churn

OpenClaw ships updates frequently. Each update can break config schemas, change entry points, invalidate auth tokens, or require new mandatory config keys. I’ve turned off auto-updates and moved to manual, controlled upgrades—the same approach I use for enterprise infrastructure at my day job.

Stale Sessions

When an agent session accumulates enough failed attempts, the model sometimes learns the wrong patterns from its own conversation history. The fix is clearing the session, but diagnosing when this is happening versus a legitimate configuration issue takes time.

Share As The Geek Learns

What I’ve Learned

Governance matters more than model quality. The scoring rubric, voting mechanics, and kill switches produce better outcomes than throwing a more powerful model at an unstructured problem. Five constrained agents outperform one unconstrained agent.

Autonomy is a spectrum, not a switch. The right level of autonomy depends on the task, the blast radius, and your comfort level. I give the pipeline agent full exec access. I keep the notification agent on a tight allowlist. Both are correct.

Local inference is viable for production work. A 32B model on good hardware, served properly (llama.cpp, not Ollama, with enough context per slot), handles pipeline tasks reliably. You don’t need cloud APIs for everything.

Your AI team will reflect your engineering discipline. If you don’t build in stall prevention, it won’t prevent stalls. If you don’t define quality gates, quality will drift. If you don’t document your governance, your agents will improvise governance—poorly.

The hardest part isn’t the AI. It’s the ops. Model selection, prompt engineering, and agent design are maybe 30% of the work. The other 70% is cron scheduling, delivery routing, auth management, monitoring, and debugging why the Discord webhook stopped working at 3 AM.

What’s Next

The council is evaluating enterprise-tier products (an AI Agent Readiness Guide targeting IT leaders) and exploring whether the pipeline can handle longer-form content like courses. I’m also working on making the council’s own meeting transcripts and decision logs available as a teaching resource—because the most interesting thing about this system isn’t the products it ships, but the way five AI agents argue about what to build next.

If you want to build something like this, start smaller than I did. One agent, one cron job, one delivery channel. Get that working reliably before you add governance. The infrastructure complexity compounds fast.

Leave a comment

James Cruce is a systems engineer and the human behind As The Geek Learns. He runs an autonomous AI product team on a Mac Studio in his home office. The council has not yet voted to replace him, but LEDGER has noted the margin improvement if they did.

Technical Stack: OpenClaw v2026.4.2 · Qwen3 32B (llama.cpp) · Mac Studio M3 Ultra (256 GB) · Discord + Slack delivery · 26 cron jobs · 5-agent council with IRV voting

Published on As The Geek Learns—astgl.com

From Notion Export to Local Knowledge Base in One Afternoon

James Cruce — Mon, 30 Mar 2026 01:01:06 GMT

I hit the Export button in Notion and got a zip file. Three megabytes. Years of notes, content calendars, CRM records, meeting summaries, and project tracking—all flattened into Markdown and CSV files with garbled names.

Turning that export into a structured local knowledge base for an AI agent wasn’t hard. But it was full of the kind of small surprises that make you appreciate why Notion charges a subscription.

Knowledge Word Cloud

The Export

Notion’s export is a zip file. Or rather, it’s a zip inside a zip. The outer archive contains an inner archive, and the inner archive contains your data. This is apparently normal. It’s also apparently undocumented.

The files inside follow Notion’s internal naming convention: every file has a 32-character hex ID appended to its name. Content Calendar 1a2b3c4d5e6f.csv instead of just Content Calendar.csv. The directory structure mirrors your Notion workspace, but the folder names have IDs too.

For 47 files, this was manageable. I organized them into a workspace structure that made sense for both me and Tars:

Every file is Markdown or CSV. Every file lives on my SSD. Every file is readable by both ClawPad (the editor) and Tars (the agent). No database. No API. Just files.

The CSV Gotcha

The content calendar was the most important import. Thirty articles in various stages—ideas, drafts, ready to publish. Notion exports these as CSV with all the metadata: title, status, publish date, and tags.

My first attempt to parse it produced rows titled “Untitled” for every entry. The data was there, but the title column wasn’t matching.

The culprit: BOM (Byte Order Mark) characters. Notion’s CSV export prepends \ufeff to the beginning of the file. This invisible character attaches itself to the first column header, so Title becomes \ufeffTitle. Your code reads the header, doesn’t find a match for Title, and returns empty strings.

# Wrong
with open(’calendar.csv’) as f:
 reader = csv.DictReader(f)

# Right
with open(’calendar.csv’, encoding=’utf-8-sig’) as f:
 reader = csv.DictReader(f)

The utf-8-sig encoding strips the BOM automatically. This is a Python-specific fix—other languages have their own BOM handling. But the universal lesson is always inspect the actual bytes of imported data before writing parsing code. A head -c 20 file.csv | xxd would have shown me the BOM in seconds.

Building the Content Pipeline

The raw CSV became pages/astgl/pipeline.md—a living document Tars checks during every heartbeat cycle:

HEARTBEAT.md includes the instruction: “Check pages/astgl/pipeline.md for content with Target Publish dates this week.” Now Tars flags upcoming deadlines in morning briefings without me checking a dashboard.

This is the difference between a document editor and an agent-backed workspace. In Notion, I’d set a reminder on a database entry and get a notification. Here, Tars reads the pipeline, cross-references dates, and proactively tells me what’s due—in the same morning briefing that includes my calendar and reminders.

Structuring a Newsletter Workspace

The import included data for Resist & Rise, a weekly newsletter covering national politics and mutual aid. In Notion, this lived in its own database with its own views. Locally, it became a directory structure that mirrors the actual workflow:

resist-rise/
├── newsletter/
├── drafts/
├── investigations/
├── sources/
└── research/

The research folder gets a source credibility index. When you’re writing about politics and policy, not all sources carry equal weight, and an AI agent needs to understand that.

Each has a research index with source credibility tiers:

Tars has context about the project in AGENTS.md and IDENTITY.md. It understands the newsletter’s focus and can route research requests appropriately. When I say “What’s the latest on voting rights legislation?” Tars knows to pull from the Resist & Rise research pipeline and prioritize primary and secondary tier sources.

This is where the local-first approach starts paying off. The source tiers aren’t just a reference table for me. They’re instructions Tars follows when gathering material. The agent doesn’t just store your research workflow. It participates in it.

What Didn’t Export

The export was incomplete. Forty-seven files out of what should have been hundreds. My VMware training notes, meeting summaries, and a chunk of project documentation didn’t make it.

Notion’s export is a single monolithic operation. When it fails, it fails silently—you get whatever it managed to grab before it gave up. There’s no error log, no partial manifest, no indication that files are missing unless you know what should be there.

The workaround: smaller, targeted exports. Instead of exporting the entire workspace at once, export individual top-level pages or databases. The success rate is dramatically higher with smaller payloads.

This reinforced something I already knew but needed to feel again: any system that holds your data hostage through export limitations is a system you should leave. Notion isn’t malicious about this—the export just isn’t robust. But the effect is the same. Your data is easy to put in and annoying to get out.

The Missing Piece: RAG

Right now, Tars reads files directly. It can open a Markdown file, parse it, and use the contents. But it can’t search across hundreds of documents semantically. If I ask, “What did we discuss about VMware licensing in January?” Tars would need to open every file and look.

RAG (Retrieval-Augmented Generation) solves this by embedding documents into a vector database and retrieving relevant chunks at query time. The model I pulled—nomic-embed-text at 274 MB—is sitting ready for this. AnythingLLM can provide the RAG layer.

That’s deferred to a future phase. The workspace works without it—Tars knows where files are and can read them on demand. But RAG is the difference between “organized file system” and “searchable knowledge base.” It’s coming.

Thanks for reading As The Geek Learns! This post is public, so feel free to share it.

Quick Reference

Leave a comment

This is Part 4 of the Notion Replacement series. Part 3 covered Channel Communication with and AI Next up: securing the whole thing against a 7-layer threat model. Follow along at As The Geek Learns.

I Gave My AI Agent Hands. It Immediately Started Texting My Wife.

James Cruce — Tue, 24 Mar 2026 22:31:05 GMT

My local AI agent, Tars, could talk. It could answer questions via iMessage and Slack. But it couldn’t actually do anything—couldn't create files, couldn’t read my calendar, couldn’t manage tasks. It was a chatbot trapped behind glass.

Phase 2 of the Notion replacement project was about giving Tars hands. Phases 3 and 4 were about teaching it what to do with them. Phase “oh no” was when it started doing things I didn’t ask for.

From Chatbot to Agent

OpenClaw ships with something called “tool profiles”—presets that control what an agent can and can’t do. Out of the box, Tars was running the messaging profile. Sounds reasonable. Here’s what that actually means:

Messaging profile: sessions_list, sessions_history, sessions_send, message

Four tools. All it could do was read and write messages. It couldn’t touch the filesystem, run commands, or use any of the skills I was about to install. I needed the full profile.

I found this by reading OpenClaw’s bundled JavaScript source code. The tool profiles aren’t documented anywhere obvious—they're defined in a minified file called tool-catalog-CDe8aNjS.js. The full profile is literally an empty object: {}. No restrictions.

{
 “tools”: {
 “profile”: “full”
 }
}

One config change, gateway restart, and Tars went from a chatbot to an autonomous agent with read, write, edit, and execute permissions on my machine.

This is the moment where self-hosting gets real. You’re not toggling a feature in a SaaS dashboard with guardrails someone else built. You’re handing your local AI unrestricted shell access. The power is exhilarating. The implications are sobering. We’ll come back to that.

ClawPad: The Editor That Sees What The Agent Writes

ClawPad is a Notion-style document editor that connects to the same workspace Tars uses. When Tars creates a file at ~/.openclaw/workspace/pages/daily-notes/2026-03-07.md, it shows up in ClawPad instantly. When I edit a page in ClawPad, Tars can read the changes.

The setup was straightforward—clone, pnpm install, build for production, create a LaunchAgent for auto-start. One gotcha: ClawPad authenticates to the gateway via WebSocket, and every gateway restart invalidates the auth token. If ClawPad shows “Reconnecting…” after a gateway restart, stop and restart ClawPad too.

The result is something Notion can’t do: a document editor where an AI agent and a human work in the same workspace simultaneously. I type notes. Tars creates daily briefings. Same files. Same directory. No sync service.

Teaching Tars Useful Skills

With full tool profile, Tars could theoretically do anything. In practice, it needed specific skills:

calctl for Apple Calendar access

apple-reminders for task management

cairn-cli for project-level task tracking

brainrepo for knowledge organization

Installing skills from ClawHub (OpenClaw’s skill marketplace) revealed an interesting tension. ClawHub integrates VirusTotal scanning, and cairn-cli got flagged as “suspicious.” The flag was a false positive—VirusTotal flagged the npm install requirement as risky behavior.

I reviewed the actual source code on GitHub. It’s a local-only tool that writes .jsonl files. No network calls. No shell exec. No dependencies beyond the Node.js standard library. Clean.

But ClawHub wouldn’t install it. Rate limiting kicked in after multiple attempts. The workaround: install the npm package directly and create the OpenClaw skill wrapper manually.

npm install -g @valpet/cairn-cli

Then create ~/.openclaw/workspace/skills/cairn-cli/SKILL.md with YAML frontmatter (required for OpenClaw to detect it) and a _meta.json file. This discovery—that skills need YAML frontmatter with name and description fields—cost me an hour of debugging.

The lesson here generalizes: automated security scanning is essential but imperfect. A tiered vetting approach works better. Bundled skills get auto-trusted. Community skills get an automated scan. Anything that needs npm install or runs shell scripts gets a manual code review. You can’t deep-review everything, but you can calibrate the depth to the risk.

The Automation That Changes Everything

Phase 4 is where Tars stopped being a tool I use and started being an assistant that works for me.

OpenClaw has a cron system. I configured two jobs:

Morning briefing (6:30 AM): 1. Create today’s daily note from a template 2. Check Apple Calendar for today’s schedule 3. Check Apple Reminders due today 4. Send a bullet-point summary via iMessage to my phone

Evening summary (8:00 PM): 1. Read today’s daily note 2. Summarize what happened 3. Note any incomplete reminders 4. Send the recap via iMessage

I wake up to a briefing I didn’t ask for. I go to bed with a summary I didn’t write. The HEARTBEAT.md file—standing instructions the agent checks every 30 minutes—includes explicit guard rails: “Do NOT repeat tasks from prior heartbeats. Do NOT infer tasks from old conversations. Only act on what HEARTBEAT.md says.”

Those guards exist because Tars hallucinated actions from stale context during early testing. A local 32B model without guardrails will confidently do things you never asked for. The instructions have to be specific enough that “creative interpretation” isn’t an option.

The Wife Incident

This is the part nobody warns you about when they demo autonomous AI agents.

During Phase 1 setup, I added my wife’s phone number to Tars’s iMessage allow list for testing. We verified it worked, moved on, and I removed the number from the config later.

Two days later, my wife got a message from Tars. Then another. It was responding to her texts to me—not maliciously, not incorrectly, but helpfully. She was not amused.

The root cause was subtle: removing a number from allowFrom doesn’t revoke existing sessions. OpenClaw had created a conversation session for her number during testing. That session persisted even after I removed authorization. Her inbound messages hit the existing session and Tars dutifully replied.

The fix required surgery:

1. Stop the gateway

2. Delete her session from sessions.json

3. Delete the conversation transcript file

4. Fix the main session’s lastTo field (which still pointed to her number)

5. Restart the gateway

There’s no OpenClaw sessions delete command. I had to edit JSON by hand.

This is a real failure mode for any autonomous messaging agent. Authorization removal must be complete—not just revoking future access but destroying existing sessions. If you’re building anything that sends messages autonomously, test the removal path as carefully as the setup path.

The Pairing Message Problem (And A Creative Fix)

With the wife situation resolved, a new issue surfaced. Anyone who texted my number got a cryptic pairing message from Tars:

OpenClaw: access not configured.
Your iMessage sender id: +1XXXXXXXXXX
Pairing code: QW29MHCQ
Ask the bot owner to approve with:
openclaw pairing approve imessage QW29MHCQ

My friends don’t know what OpenClaw is. They don’t know what a pairing code is. This message looks like spam.

The pairing message is hardcoded in OpenClaw’s source. There’s no config option to customize it. But I wanted a friendly follow-up that explains what’s happening.

The solution: a file watcher that monitors the pairing store and sends a custom message immediately after OpenClaw’s default one.

# fswatch monitors the pairing file for changes
fswatch -o “$PAIRING_FILE” | while read -r _count; do
 sleep 1
 # Parse new sender IDs, skip ones we’ve already messaged
 # Send custom TARS introduction via imsg
done

Now when a friend texts me and triggers the pairing flow, they get the technical pairing message followed immediately by:

“This is TARS. I am an OpenClaw bot who assists James. You can safely ignore this message or contact James and ask to approve the Pairing message if you would like me (TARS) to be a part of your conversation.”

A LaunchAgent keeps the watcher running persistently. It’s a workaround for a missing feature, built from fswatch and imsg send. Sometimes the best solution to “the framework doesn’t support this” is a 30-line shell script that watches a file.

Quick Reference

Problem - Solution

Agent can’t create files
```
Change tools.profile to “full”
```

ClawPad stuck on “Reconnecting”

Restart ClawPad after any gateway restart

Skill flagged by VirusTotal

Review source manually, install via npm if clean

Skills not detected by OpenClaw
```
Add YAML frontmatter to SKILL.md
```
Cron job won’t run by name
```
Use the UUID, not the name
```

Agent texts unauthorized contacts

Delete session AND transcript, not just allowFrom entry

Pairing message too cryptic

File watcher + custom follow-up message via imsg

This is Part 3 of the Notion Replacement series. Part 1 covered why I’m making the switch. Part 2 covered the install gauntlet. Next up: importing my Notion data. Follow along at As The Geek Learns.