Getting Local Memory Search Working (with GPU) on macOS

I wanted a setup where my assistant could search my notes locally—fast, private, and without shipping my personal text to a third-party embedding API. The goal was simple:

Keep “memory” (markdown notes) on my machine
Use a local embedding + vector search backend
Prefer GPU acceleration via Metal on macOS to keep things snappy

This post is a practical write-up of what worked, what broke, and the specific gotchas I hit along the way.

The moving pieces

The setup has three layers:

1) Assistant (OpenClaw)

Orchestrates retrieval and decides when to search “memory”

2) Memory backend (QMD)

Indexes a directory of markdown files
Produces embeddings and performs similarity search

3) Local inference runtime (node-llama-cpp / llama.cpp)

Handles the heavy lifting under the hood
On macOS, can use Metal for acceleration

Key idea: the assistant does not load your entire memory into the prompt every message. Instead it does retrieval on demand and injects only the top matching snippets into the model context.

Step 1: Make memory retrieval explicit (and lightweight)

The first mental shift was realizing that “memory” is usually a search problem, not a “stuff every file into context” problem.

In practice:

For normal questions, no memory is fetched.
For questions that need facts (“what’s the URL for X?”), the assistant calls memory_search.
Only a few matching snippets are returned.

This keeps latency and token usage under control.

Step 2: Use a local backend (QMD)

I configured OpenClaw to use QMD as the memory backend.

Benefits:

Local-first
Works well with a directory of markdown files
Doesn’t require cloud API keys

I also leaned into a clean separation:

“Long-term memory” lives in MEMORY.md and a memory/ folder
Misc “how my machine is set up” notes live in a separate doc

That distinction matters because retrieval only works on what you actually index.

Step 3: Get Metal GPU acceleration working

On macOS, the goal was: Metal.

This is where things got spicy.

The build is real

The first time you run commands that touch the model runtime, you can trigger a long build of native components (CMake + C++ compilation). This is normal, and it can look like it’s “stuck” when it’s actually building.

Non-interactive runs can be flaky

Something I didn’t expect: running certain commands in a non-interactive environment (no TTY, piped output) can behave differently than running them directly in Terminal.

I saw a few failure modes:

SIGKILL near 99% during builds (often resource pressure)
Crashes caused by output being piped/truncated (classic EPIPE)

My takeaway: for first-run builds and GPU validation, do it in a real Terminal first.

macOS privacy prompts can block “status”

One particularly sneaky issue: macOS may ask for permission (e.g., a Photos/Media prompt) when a library is accessed. If your command is running headless, it may look like a hang.

When I approved the prompt, the status output immediately reported:

GPU backend: Metal
Device detected properly
Offloading enabled

Once that happened, I switched the assistant to use the “Metal” configuration by default.

Step 4: Wrap it so it’s stable

I ended up using a wrapper script so the assistant always invokes QMD with the right environment.

A wrapper is useful for:

Forcing the GPU backend (metal)
Disabling backends you don’t want it to try (e.g., CUDA on macOS)
Redirecting noisy stderr logs to a file for later debugging

This keeps the assistant’s config simple: it just calls one command.

Step 5: Retrieval quality is (sometimes) about query phrasing

Even with everything “working,” I noticed retrieval could be brittle.

Example:

Searching for “dedicated sub-agent” returned nothing
Searching for “dedicated” returned the correct snippet immediately

So I added an easy improvement at the assistant layer:

If memory_search returns empty, automatically retry with simplified keywords (1–2 attempts)
Only then fall back to reading a specific file directly

This made the system feel much more forgiving.

Practical checklist

If you’re setting this up yourself, here’s the order I’d recommend:

1) Confirm indexing works (list indexed docs) 2) Confirm you can get a known doc by ID 3) Confirm search returns expected snippets 4) Confirm GPU backend in an interactive terminal 5) Only then switch the assistant to use the GPU command by default 6) Add “retry with simpler keywords” behavior for better UX

Closing thoughts

Once it’s dialed in, local memory search feels like a superpower:

Fast answers from your own notes
No cloud embeddings required
Works offline
GPU acceleration keeps it responsive

The biggest surprise wasn’t the GPU work—it was the number of small “systems integration” papercuts: non-interactive shells, OS privacy prompts, and retrieval quirks.

But after a little wiring, it’s solid.