I wanted a setup where my assistant could search my notes locally—fast, private, and without shipping my personal text to a third-party embedding API. The goal was simple:

This post is a practical write-up of what worked, what broke, and the specific gotchas I hit along the way.


The moving pieces

The setup has three layers:

1) Assistant (OpenClaw)

2) Memory backend (QMD)

3) Local inference runtime (node-llama-cpp / llama.cpp)

Key idea: the assistant does not load your entire memory into the prompt every message. Instead it does retrieval on demand and injects only the top matching snippets into the model context.


Step 1: Make memory retrieval explicit (and lightweight)

The first mental shift was realizing that “memory” is usually a search problem, not a “stuff every file into context” problem.

In practice:

This keeps latency and token usage under control.


Step 2: Use a local backend (QMD)

I configured OpenClaw to use QMD as the memory backend.

Benefits:

I also leaned into a clean separation:

That distinction matters because retrieval only works on what you actually index.


Step 3: Get Metal GPU acceleration working

On macOS, the goal was: Metal.

This is where things got spicy.

The build is real

The first time you run commands that touch the model runtime, you can trigger a long build of native components (CMake + C++ compilation). This is normal, and it can look like it’s “stuck” when it’s actually building.

Non-interactive runs can be flaky

Something I didn’t expect: running certain commands in a non-interactive environment (no TTY, piped output) can behave differently than running them directly in Terminal.

I saw a few failure modes:

My takeaway: for first-run builds and GPU validation, do it in a real Terminal first.

macOS privacy prompts can block “status”

One particularly sneaky issue: macOS may ask for permission (e.g., a Photos/Media prompt) when a library is accessed. If your command is running headless, it may look like a hang.

When I approved the prompt, the status output immediately reported:

Once that happened, I switched the assistant to use the “Metal” configuration by default.


Step 4: Wrap it so it’s stable

I ended up using a wrapper script so the assistant always invokes QMD with the right environment.

A wrapper is useful for:

This keeps the assistant’s config simple: it just calls one command.


Step 5: Retrieval quality is (sometimes) about query phrasing

Even with everything “working,” I noticed retrieval could be brittle.

Example:

So I added an easy improvement at the assistant layer:

This made the system feel much more forgiving.


Practical checklist

If you’re setting this up yourself, here’s the order I’d recommend:

1) Confirm indexing works (list indexed docs) 2) Confirm you can get a known doc by ID 3) Confirm search returns expected snippets 4) Confirm GPU backend in an interactive terminal 5) Only then switch the assistant to use the GPU command by default 6) Add “retry with simpler keywords” behavior for better UX


Closing thoughts

Once it’s dialed in, local memory search feels like a superpower:

The biggest surprise wasn’t the GPU work—it was the number of small “systems integration” papercuts: non-interactive shells, OS privacy prompts, and retrieval quirks.

But after a little wiring, it’s solid.