I wanted a setup where my assistant could search my notes locally—fast, private, and without shipping my personal text to a third-party embedding API. The goal was simple:
This post is a practical write-up of what worked, what broke, and the specific gotchas I hit along the way.
The setup has three layers:
1) Assistant (OpenClaw)
2) Memory backend (QMD)
3) Local inference runtime (node-llama-cpp / llama.cpp)
Key idea: the assistant does not load your entire memory into the prompt every message. Instead it does retrieval on demand and injects only the top matching snippets into the model context.
The first mental shift was realizing that “memory” is usually a search problem, not a “stuff every file into context” problem.
In practice:
memory_search.This keeps latency and token usage under control.
I configured OpenClaw to use QMD as the memory backend.
Benefits:
I also leaned into a clean separation:
MEMORY.md and a memory/ folderThat distinction matters because retrieval only works on what you actually index.
On macOS, the goal was: Metal.
This is where things got spicy.
The first time you run commands that touch the model runtime, you can trigger a long build of native components (CMake + C++ compilation). This is normal, and it can look like it’s “stuck” when it’s actually building.
Something I didn’t expect: running certain commands in a non-interactive environment (no TTY, piped output) can behave differently than running them directly in Terminal.
I saw a few failure modes:
SIGKILL near 99% during builds (often resource pressure)EPIPE)My takeaway: for first-run builds and GPU validation, do it in a real Terminal first.
One particularly sneaky issue: macOS may ask for permission (e.g., a Photos/Media prompt) when a library is accessed. If your command is running headless, it may look like a hang.
When I approved the prompt, the status output immediately reported:
Once that happened, I switched the assistant to use the “Metal” configuration by default.
I ended up using a wrapper script so the assistant always invokes QMD with the right environment.
A wrapper is useful for:
metal)This keeps the assistant’s config simple: it just calls one command.
Even with everything “working,” I noticed retrieval could be brittle.
Example:
So I added an easy improvement at the assistant layer:
memory_search returns empty, automatically retry with simplified keywords (1–2 attempts)This made the system feel much more forgiving.
If you’re setting this up yourself, here’s the order I’d recommend:
1) Confirm indexing works (list indexed docs)
2) Confirm you can get a known doc by ID
3) Confirm search returns expected snippets
4) Confirm GPU backend in an interactive terminal
5) Only then switch the assistant to use the GPU command by default
6) Add “retry with simpler keywords” behavior for better UX
Once it’s dialed in, local memory search feels like a superpower:
The biggest surprise wasn’t the GPU work—it was the number of small “systems integration” papercuts: non-interactive shells, OS privacy prompts, and retrieval quirks.
But after a little wiring, it’s solid.