Reproducing HippoRAG Results Challenges
Reproducing the HippoRAG results isnāt just a technical hurdle - itās a growing pain point in modern RAG workflows. Recent experiments show accuracy near 42 and recall at 21, far below the 53 - 47 reported in the original paper. But the gap isnāt just about model size or data - itās about hidden variables in reproducibility.
Hereās whatās actually happening:
- The core commit linked to the paperās experiments remains elusive, no official hash or branch pinned to the results.
- The reported hyperparameters -
llm_model_max_token_size = 8000,top_k = 4across all parameters - are standard, yet tweaking them slightly (like adjusting token limits or sampling) often bridges the gap. - The graph size discrepancy - 22k vs. 35k+ nodes - reveals a deeper issue: the preprint likely used Llama-3-8B, not MultihopRAGās documented base model.
Beyond numbers, thereās a blind spot in how researchers treat context length and prompt engineering. A single tweak - adding a context window hint or adjusting beam search - can transform performance. The real elephant in the room? Without transparent, versioned runs, reproducibility remains wishful thinking.
Donāt assume defaults equal success. Check hyperparameters line by line, verify model identity, and question edge-case assumptions. When reproducing results, treat every variable like a clue in a bucket brigade - each one essential to avoid misleading conclusions.
Is your setup truly aligned with the original experiment? Small mismatches make a world of difference.