Participant Guide
About the Challenge
Agentic RAG Legal Challenge 2026 is an international engineering competition focused on building production-grade Retrieval-Augmented Generation (RAG) systems for the legal domain. The challenge runs for two weeks and culminates in winner announcements at Machines Can See 2026 in Dubai.
This is not a hackathon — it is a benchmark-style competition with objective evaluation, private test sets, telemetry, and strict anti-gaming mechanisms.
For technical details (scoring formulas, API format, telemetry schema) refer to the starter kit: EVALUATION.md, API.md, and README.md.
Your Goal
Build a RAG system that maximizes:
Legal accuracy
Grounded retrieval quality
Low latency (TTFT)
Robust document ingestion
Faithfulness / no hallucinations
Production realism
2. What You Will Receive
Participants are provided with:
Document corpus
~300 public legal documents (regulations, case law and etc.), in varied formats.Demo set
30 documents + 100 public questions for pipeline debugging.Final evaluation set
900 private questions across the full corpus.Submission API
Pull-model interface where your system fetches questions, processes them locally, and submits answers with telemetry.Dataset structure
As defined in the dataset spec document:
deterministic answers (numbers, names, booleans, lists, date) + assistant-style free-text answers (max 280 chars).
3. What You Need to Build
A fully functional RAG pipeline. It must include:
1. Document Ingestion & Parsing
Legal documents come in heterogeneous formats. You must handle:
PDF → text extraction (including OCR for scanned documents)
Mixed formatting and structural inconsistencies
Long “stress test” documents
Clause-level segmentation and hierarchy detection
Metadata extraction (titles, sections, case numbers, dates)
Your ingestion stage is critical — evaluation heavily depends on retrieval grounding.
2. Indexing & Chunking
Chunking must respect legal structure, not fixed token windows.
Recommended components:
clause-aware or heading-aware segmentation
dense embeddings + re-ranking
hybrid search (BM25 + semantic)
3. Retrieval
Your system must return:
relevant chunks
minimal noise
high recall (missing evidence leads to penalties)
Every answer must include retrieved_chunk_pages — otherwise retrieval score becomes zero.
4. Generation
Two groups of questions:
Deterministic factual questions
Answer types:number,boolean,name,names,date. If the answer is not present in the corpus, return JSONnull.Free-text assistant questions
Up to 280 characters, legally faithful, concise, well-grounded.
5. Telemetry
Every answer must include:
ttft_mstoken usage
retrieved chunks
total runtime
Missing telemetry → −10% penalty for that answer.
4. Evaluation
Your solution is scored across four dimensions:
1. Deterministic Accuracy
Simple, strict rules:
numeric tolerance (±1%)
exact match for booleans and names
Jaccard similarity for lists
ISO 8601 exact match for dates
JSON
nullfor absent information
(See starter kit — EVALUATION.md — for full details on scoring rules.)
2. Free-Text Quality (LLM Judge)
Each answer is scored on 5 criteria:
Correctness
Completeness
Grounding
Confidence calibration
Clarity & conciseness
Evaluation uses a cascade of multiple LLMs for consistency.
3. Retrieval / Grounding Score
Penalties for:
irrelevant pages (noise)
missing required evidence (recall)
Balanced to reward precise, minimal, faithful retrieval.
4. Latency (TTFT Modifier)
Speed is part of the score:
<1s → +5% bonus
1–2s → +2%
2–3s → no modifier
>3s → penalty up to −15%
5. Submission Workflow
Fill the registration form in the Discord #welcome-challenge channel.
After moderator verifies your submission, you will get login and password to competition platform.
Connect your system to the pull-model API.
Fetch questions in batches or streaming mode.
Run your pipeline locally.
Submit answers with telemetry for each question.
Provide a short Architecture Summary describing your models and retrieval strategy (for transparency and post-competition publication).
6. Rules & Requirements
Mandatory
Only public APIs and public models may be used.
Telemetry required for every answer.
No manual answering or partial automation.
No leaking or sharing private questions.
Allowed
Any embedding model or search engine accessible via public API.
Model ensembles and hybrid pipelines.
Local preprocessing of documents.
Custom re-rankers.
Prohibited
Hardcoding answers.
Synthetic leakage
Manually editing logs or telemetry.
7. Recommendations for a Competitive Solution
Focus on retrieval precision
Legal RAG systems fail on irrelevant context. Use hybrid retrievers, re-ranking, clause-aware chunking
Optimize for TTFT
Consider:
fast small models for retrieval
streaming generation
caching for long documents
batching efficiently
Avoid hallucinations
For many questions the correct answer is JSON null (deterministic types) or a natural-language statement such as "There is no information on this question in the provided documents." (free_text). Return an empty retrieved_chunk_pages array in both cases.
Telemetry correctness matters
Malformed telemetry destroys your score even if the answer is good.
8. Prize Categories
Expected categories include:
1st–3rd overall places
Speed Champion (lowest TTFT)
Efficiency Expert (best score/token ratio)
Retrieval Master (highest grounding score)
Best Publication (blogpost/video)
Teams may win multiple prizes.
9. Final Advice
This challenge rewards engineering rather than brute force.
Strong teams typically:
build robust ingestion pipelines
chunk carefully
test retrieval thoroughly
optimize latency pragmatically
keep answers short, grounded, and legally faithful
verify telemetry early and often
If you treat this like a real production RAG system, you will perform well.