Reproducing HippoRAG Results Challenges

Mar 20, 2026 by Jule 40 views

Reproducing the HippoRAG results isn’t just a technical hurdle - it’s a growing pain point in modern RAG workflows. Recent experiments show accuracy near 42 and recall at 21, far below the 53 - 47 reported in the original paper. But the gap isn’t just about model size or data - it’s about hidden variables in reproducibility.

Here’s what’s actually happening:

The core commit linked to the paper’s experiments remains elusive, no official hash or branch pinned to the results.
The reported hyperparameters - llm_model_max_token_size = 8000, top_k = 4 across all parameters - are standard, yet tweaking them slightly (like adjusting token limits or sampling) often bridges the gap.
The graph size discrepancy - 22k vs. 35k+ nodes - reveals a deeper issue: the preprint likely used Llama-3-8B, not MultihopRAG’s documented base model.

Beyond numbers, there’s a blind spot in how researchers treat context length and prompt engineering. A single tweak - adding a context window hint or adjusting beam search - can transform performance. The real elephant in the room? Without transparent, versioned runs, reproducibility remains wishful thinking.

Don’t assume defaults equal success. Check hyperparameters line by line, verify model identity, and question edge-case assumptions. When reproducing results, treat every variable like a clue in a bucket brigade - each one essential to avoid misleading conclusions.

Is your setup truly aligned with the original experiment? Small mismatches make a world of difference.