Limore flagged this project after it hit 5,400 GitHub stars in 24 hours. I asked SHUR IQ to run a full evaluation against our memory architecture. Here's what we found.
Not recommended as a SHUR IQ memory component. The headline benchmark claims (96.6% LongMemEval, "first perfect score") are based on a category error—measuring retrieval recall instead of end-to-end question answering. The 100% "perfect score" was achieved by inspecting failing test cases and hardcoding fixes. Independent analyses confirm the claims-to-code gap is significant. However, two architectural ideas are worth absorbing into our own system: the L0–L3 token budget layering and the spatial metaphor for organizing memory by person/project.
MemPalace claims the "highest LongMemEval score ever published." Five specific problems undermine this claim:
LongMemEval tests retrieve + generate + judge. MemPalace skips generation and judging entirely—it only checks whether the right session ID appears in ChromaDB's top-5 results. That's a fundamentally different (easier) task than what other systems are scored on.
The "perfect 100%" score was achieved by inspecting three specific failing questions, then hardcoding targeted fixes: a quoted-phrase boost, a person-name boost for "Rachel," and pattern matching for high school reunion references. Their own docs admit this openly.
MemPalace claims "100% on LoCoMo" using top_k=50, but conversations only contain 19–32 sessions. When retrieval exceeds the candidate pool, you're just dumping everything into a language model. As their own documentation states: "the embedding retrieval step is bypassed entirely."
AAAK is advertised as "30x lossless compression." Performance drops 12.4 percentage points with AAAK enabled (84.2% vs 96.6%). Text gets truncated at 55 characters. Lossless compression by definition cannot produce measured quality degradation.
"Contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them." The codebase contains zero implementation of this. Deduplication only blocks identical triples, allowing conflicting facts to accumulate indefinitely.
For context, the legitimate state-of-the-art on LongMemEval is Mastra's Observational Memory at 94.87% with gpt-5-mini—a proper end-to-end evaluation. The Penfield Labs analysis puts it directly: the variable wasn't engineering quality, it was celebrity attribution.
Beneath the hype, there are real ideas. Here's an honest assessment:
| Aspect | Assessment |
|---|---|
| Spatial Metaphor | Genuinely nice UX concept for organizing memories by person/project (wings, halls, rooms). Novel naming, but the underlying technique is standard metadata filtering. |
| Token Budget Design | L0–L3 tiered loading (~170 tokens at startup) is sound architecture. Similar to Letta's core/archival split, but the framing is clean. |
| Zero-LLM Write Path | Fully offline, deterministic extraction without API calls. Good for cost. A real advantage for local-first use cases. |
| Verbatim Preservation | Keeping original text in "drawers" while summaries live in "closets" is a healthy separation of source from interpretation. |
| Maturity | 7 commits. 4 test files for 21 modules. Beta status is generous. This is a proof-of-concept, not a production system. |
| Capability | SHUR IQ (Current) | MemPalace |
|---|---|---|
| Structural Intelligence | InfraNodus: cluster gaps, bridges, modularity scoring | Flat ChromaDB vectors + SQLite triples |
| Tiered Memory | Letta (core/archival/blocks) + Claude memory (fast cache) | L0–L3 token budget tiers (comparable concept) |
| Compression | InfraNodus entity extraction, graph-based condensation | AAAK (lossy despite "lossless" claim, 12.4pp quality loss) |
| Contradiction Detection | InfraNodus gap analysis finds structural contradictions | Advertised but not implemented |
| Knowledge Graph | InfraNodus (1,600+ triples in SBPI), ontology layers | SQLite triples, no multi-hop traversal |
| Modularity | 5 systems, each owning a cognitive type, graceful degradation | Monolithic ChromaDB + SQLite |
| Cost | Mixed (InfraNodus API, embeddings) | Free, fully local, no API keys needed |
| MCP Integration | 3 operational servers (InfraNodus, DEVONthink, Mem0) | 19-tool MCP server (quantity, unvalidated) |
Memory in AI is evolving faster than any other layer of the stack. New approaches appear monthly—Letta, Mem0, Observational Memory, MemPalace, MuninnDB, Neotoma. The decision we made early was to never hitch our wagon to any single system, but instead design a modular architecture where memory systems can be swapped, layered, and evaluated independently based on context.
SHUR IQ's memory architecture uses five complementary systems, each owning a specific cognitive role. No system is mandatory except vault files. All others enhance but never block:
Canonical source of truth. Session records, specs, deliverables, daily notes. Queryable via Bases dashboards and search.
Entity relationships, structural gaps, bridge analysis, cluster detection. Turns unstructured knowledge into navigable graphs.
Confidence-scored patterns, intent alignment tracking, behavioral guidance. Evolving observer agent with structured memory blocks.
Fast cold-start context. Infrastructure state, project registry, people index. Optimized for session resumption speed.
Vector-based semantic memory. Cross-surface recall across Claude Code, Desktop, and future interfaces. Transitioning to Letta archival.
Intent-in-action via structured commit trailers. Session ID, declared intent, surface, and stakeholder recorded per commit.
This architecture follows three principles: reinforcement over redundancy (same event, different representation per system), graceful degradation (only vault files are mandatory), and explicit conflict resolution (vault > Letta > InfraNodus > Claude > Git).
The modular approach means we can evaluate every new memory system that appears—MemPalace, Mastra's Observational Memory, MuninnDB, Neotoma—without needing to rearchitect. If something proves itself, it slots into the stack. If it doesn't, we lose nothing. That's the competitive advantage of designing for optionality rather than committing to a single vendor's approach.
Despite the benchmark issues, two design patterns from MemPalace are worth absorbing:
MemPalace loads ~170 tokens at startup (identity + critical facts), with deeper layers activated on demand. We already do something similar with Letta's core/archival split and Claude's layered memory files, but formalizing explicit token budgets per layer would sharpen our cold-start efficiency. Concrete next step: define target token counts for each SHUR IQ memory tier.
Organizing memories as "wings" (people/projects), "halls" (memory types), and "rooms" (topics) is a strong user-facing metaphor. The underlying mechanism is standard metadata filtering, but the naming makes memory navigable and intuitive. When we build SHUR IQ's memory dashboard, this kind of spatial framing could improve how clients and team members understand what the system remembers and why.
Don't adopt MemPalace as a SHUR IQ component. The benchmark methodology is unsound, the codebase is pre-alpha (7 commits, 4 test files), and several advertised features don't exist in code. Our existing modular stack—Letta, InfraNodus, Mem0, Claude memory, git provenance—already covers every capability MemPalace offers, with better structural intelligence and production maturity.
Do formalize two patterns from this evaluation: explicit token budgets per memory tier, and spatial metaphor UX for the future memory dashboard. These are lightweight wins that cost nothing to integrate into our design thinking.
The real lesson here: a good narrative plus name recognition drove 5,400 GitHub stars in 24 hours. The engineering didn't justify the attention. But it's a reminder that in a fast-moving landscape, we should keep evaluating everything that crosses our radar—sometimes the signal is in the design, not the benchmarks.