SHUR IQ — Memory System Evaluation
April 2026
Exploratory Evaluation

MemPalace: Celebrity, Benchmarks, and What's Actually There

Limore flagged this project after it hit 5,400 GitHub stars in 24 hours. I asked SHUR IQ to run a full evaluation against our memory architecture. Here's what we found.

01

Executive Summary

Not recommended as a SHUR IQ memory component. The headline benchmark claims (96.6% LongMemEval, "first perfect score") are based on a category error—measuring retrieval recall instead of end-to-end question answering. The 100% "perfect score" was achieved by inspecting failing test cases and hardcoding fixes. Independent analyses confirm the claims-to-code gap is significant. However, two architectural ideas are worth absorbing into our own system: the L0–L3 token budget layering and the spatial metaphor for organizing memory by person/project.

02

The Benchmark Problem

MemPalace claims the "highest LongMemEval score ever published." Five specific problems undermine this claim:

Issue 1
Retrieval-Only Metric Passed Off as End-to-End QA

LongMemEval tests retrieve + generate + judge. MemPalace skips generation and judging entirely—it only checks whether the right session ID appears in ChromaDB's top-5 results. That's a fundamentally different (easier) task than what other systems are scored on.

Issue 2
Teaching to the Test

The "perfect 100%" score was achieved by inspecting three specific failing questions, then hardcoding targeted fixes: a quoted-phrase boost, a person-name boost for "Rachel," and pattern matching for high school reunion references. Their own docs admit this openly.

Issue 3
LoCoMo Bypass

MemPalace claims "100% on LoCoMo" using top_k=50, but conversations only contain 19–32 sessions. When retrieval exceeds the candidate pool, you're just dumping everything into a language model. As their own documentation states: "the embedding retrieval step is bypassed entirely."

Issue 4
Lossy "Lossless" Compression

AAAK is advertised as "30x lossless compression." Performance drops 12.4 percentage points with AAAK enabled (84.2% vs 96.6%). Text gets truncated at 55 characters. Lossless compression by definition cannot produce measured quality degradation.

Issue 5
Advertised Features Don't Exist in Code

"Contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them." The codebase contains zero implementation of this. Deduplication only blocks identical triples, allowing conflicting facts to accumulate indefinitely.

For context, the legitimate state-of-the-art on LongMemEval is Mastra's Observational Memory at 94.87% with gpt-5-mini—a proper end-to-end evaluation. The Penfield Labs analysis puts it directly: the variable wasn't engineering quality, it was celebrity attribution.

03

What's Actually There

Beneath the hype, there are real ideas. Here's an honest assessment:

Aspect Assessment
Spatial Metaphor Genuinely nice UX concept for organizing memories by person/project (wings, halls, rooms). Novel naming, but the underlying technique is standard metadata filtering.
Token Budget Design L0–L3 tiered loading (~170 tokens at startup) is sound architecture. Similar to Letta's core/archival split, but the framing is clean.
Zero-LLM Write Path Fully offline, deterministic extraction without API calls. Good for cost. A real advantage for local-first use cases.
Verbatim Preservation Keeping original text in "drawers" while summaries live in "closets" is a healthy separation of source from interpretation.
Maturity 7 commits. 4 test files for 21 modules. Beta status is generous. This is a proof-of-concept, not a production system.
04

How It Compares to SHUR IQ's Memory Stack

Capability SHUR IQ (Current) MemPalace
Structural Intelligence InfraNodus: cluster gaps, bridges, modularity scoring Flat ChromaDB vectors + SQLite triples
Tiered Memory Letta (core/archival/blocks) + Claude memory (fast cache) L0–L3 token budget tiers (comparable concept)
Compression InfraNodus entity extraction, graph-based condensation AAAK (lossy despite "lossless" claim, 12.4pp quality loss)
Contradiction Detection InfraNodus gap analysis finds structural contradictions Advertised but not implemented
Knowledge Graph InfraNodus (1,600+ triples in SBPI), ontology layers SQLite triples, no multi-hop traversal
Modularity 5 systems, each owning a cognitive type, graceful degradation Monolithic ChromaDB + SQLite
Cost Mixed (InfraNodus API, embeddings) Free, fully local, no API keys needed
MCP Integration 3 operational servers (InfraNodus, DEVONthink, Mem0) 19-tool MCP server (quantity, unvalidated)
05

Our Modular Memory Philosophy

Memory in AI is evolving faster than any other layer of the stack. New approaches appear monthly—Letta, Mem0, Observational Memory, MemPalace, MuninnDB, Neotoma. The decision we made early was to never hitch our wagon to any single system, but instead design a modular architecture where memory systems can be swapped, layered, and evaluated independently based on context.

SHUR IQ's memory architecture uses five complementary systems, each owning a specific cognitive role. No system is mandatory except vault files. All others enhance but never block:

Facts
Vault Files

Canonical source of truth. Session records, specs, deliverables, daily notes. Queryable via Bases dashboards and search.

Structure
InfraNodus

Entity relationships, structural gaps, bridge analysis, cluster detection. Turns unstructured knowledge into navigable graphs.

Observation
Letta Witness

Confidence-scored patterns, intent alignment tracking, behavioral guidance. Evolving observer agent with structured memory blocks.

State Cache
Claude Memory

Fast cold-start context. Infrastructure state, project registry, people index. Optimized for session resumption speed.

Semantic Recall
Mem0 / OpenMemory

Vector-based semantic memory. Cross-surface recall across Claude Code, Desktop, and future interfaces. Transitioning to Letta archival.

Provenance
Git History

Intent-in-action via structured commit trailers. Session ID, declared intent, surface, and stakeholder recorded per commit.

This architecture follows three principles: reinforcement over redundancy (same event, different representation per system), graceful degradation (only vault files are mandatory), and explicit conflict resolution (vault > Letta > InfraNodus > Claude > Git).

The modular approach means we can evaluate every new memory system that appears—MemPalace, Mastra's Observational Memory, MuninnDB, Neotoma—without needing to rearchitect. If something proves itself, it slots into the stack. If it doesn't, we lose nothing. That's the competitive advantage of designing for optionality rather than committing to a single vendor's approach.

06

Ideas For Further Consideration

Despite the benchmark issues, two design patterns from MemPalace are worth absorbing:

Pattern 1
L0–L3 Token Budget Layering

MemPalace loads ~170 tokens at startup (identity + critical facts), with deeper layers activated on demand. We already do something similar with Letta's core/archival split and Claude's layered memory files, but formalizing explicit token budgets per layer would sharpen our cold-start efficiency. Concrete next step: define target token counts for each SHUR IQ memory tier.

Pattern 2
Spatial Metaphor for Memory UX

Organizing memories as "wings" (people/projects), "halls" (memory types), and "rooms" (topics) is a strong user-facing metaphor. The underlying mechanism is standard metadata filtering, but the naming makes memory navigable and intuitive. When we build SHUR IQ's memory dashboard, this kind of spatial framing could improve how clients and team members understand what the system remembers and why.

07

Verdict

Don't adopt MemPalace as a SHUR IQ component. The benchmark methodology is unsound, the codebase is pre-alpha (7 commits, 4 test files), and several advertised features don't exist in code. Our existing modular stack—Letta, InfraNodus, Mem0, Claude memory, git provenance—already covers every capability MemPalace offers, with better structural intelligence and production maturity.

Do formalize two patterns from this evaluation: explicit token budgets per memory tier, and spatial metaphor UX for the future memory dashboard. These are lightweight wins that cost nothing to integrate into our design thinking.

The real lesson here: a good narrative plus name recognition drove 5,400 GitHub stars in 24 hours. The engineering didn't justify the attention. But it's a reminder that in a fast-moving landscape, we should keep evaluating everything that crosses our radar—sometimes the signal is in the design, not the benchmarks.

08

Sources