Back to blog

Building LetsGit.IT with an LLM: Quiz Fairness

How we fixed a surprisingly common quiz bug: users could guess the correct answer just by picking the longest option.

Jakub MertaDec 29, 202510 min readEngineering
#letsgit-it#llm#codex-cli#quiz#data-quality#product-building

Summary

A practical build story: we removed “longest answer wins” bias from the LetsGit.IT quiz without sacrificing full explanations outside the quiz.

Key takeaways

  • Measure bias before changing UX.
  • Prefer quiz-specific answer variants over trimming.
  • Make distractors plausible by removing “absolute” cues.
  • Use lightweight generation (trade-off flips) to add subtle wrong options.
  • Ship in clean slices with the commit-work skill.

Building LetsGit.IT with an LLM: Quiz Fairness

This is the first post in the series about building LetsGit.IT with LLM assistance. The focus here isn’t “AI magic”, but the boring, reliable parts of product engineering: measuring what’s wrong, changing the data model (not just the UI), and shipping in a way that keeps the codebase maintainable.

This chapter is based on the 2025-12-29 session notes (sessions/articles/2025-12-29.md). If you prefer Polish, read the PL version here: /pl/blog/tworzenie-letsgit-it-z-llm-uczciwy-quiz.

The problem: the quiz was solvable by length

LetsGit.IT has a multiple-choice quiz mode: one correct answer and (usually) two distractors. In theory, you win by understanding the topic. In practice, the dataset created a strong bias:

  • the correct answer was almost always the longest option
  • wrong answers were short, vague, or “obviously wrong” (often with absolute wording)

That’s not a subtle UX issue. It’s a validity issue: the quiz stops measuring knowledge and starts measuring pattern recognition.

Step 1: measure it (don’t argue about it)

Before touching UI or copy, we ran a simple audit over the seed batches (recruitment-assistant/prisma/data/migrations/*.json), checking per question whether the correct answer is strictly the longest option among the quiz candidates.

The headline result from that audit was blunt: 100% of questions across categories (EN/PL) had the correct answer as the longest in the generated quiz choices.

The key lesson is not the specific number. It’s the workflow:

  1. pick a simple metric that reflects user behavior (“does length predict correctness?”)
  2. make it cheap to recompute
  3. use it as a guardrail for future changes

Step 2: the tempting quick fix — and why we rolled it back

The first remediation attempt was a classic “ship now” move: normalize option height and clamp/trim quiz answers in the quiz UI. The idea was simple: if users can’t see the length difference, the bias disappears.

That approach did reduce the visual cue, but it introduced new problems:

  • trimming can remove the clause that makes an answer correct
  • you end up “hiding meaning” instead of improving content quality
  • it conflicts with the product requirement: full answers must remain available outside the quiz (lists, detail pages)

So we rolled it back and switched to a cleaner model.

Step 3: quiz-specific answer variants (short answers)

The better fix is to admit what the quiz needs: short, quiz-optimized answer variants, while keeping full answers intact everywhere else.

We implemented that with a small, explicit layer:

  • recruitment-assistant/src/data/quiz-answers.ts contains per-question overrides (EN + PL)
  • recruitment-assistant/src/lib/questions.ts resolves the quiz answer variant at runtime
  • the quiz UI consumes the “quiz answer” field; list/detail views still display the full answer

Conceptually:

type QuizAnswerVariant = {
  questionSlug: string;
  en: { quizAnswer: string };
  pl: { quizAnswer: string };
};

This is intentionally boring:

  • it’s explicit (no heuristic trimming at render time)
  • it’s testable (a question either has an override or not)
  • it keeps the product promise: “short answers in quiz, full answers in detail”

This is also where an LLM is actually useful: generating first drafts of short answers that you can then refine. The important constraint: the override must preserve correctness, not just be shorter.

Step 4: make distractors plausible (without rewriting everything by hand)

Once “correct answer length” stopped being a tell, the next weakness became obvious: the wrong answers weren’t competitive. Two patterns stood out:

  • absolute cues (“always”, “never”, “impossible”) that are rarely true in engineering
  • distractors that were either too short or semantically unrelated

We ran an automated cleanup pass over wrong answers and rewrote ~2.4k entries to remove the most obvious tells while preserving the intended “wrongness”.

This isn’t about making the quiz mean. It’s about making it useful. A good distractor should be “close to true”, forcing the learner to recall the detail that matters.

“Exclude list” was generated… then postponed

We also generated a list of “exclude candidates” for questions that are too broad for a multiple-choice quiz, mostly based on answer length and scope. The decision for now was to not exclude questions yet (the list was too aggressive).

The bigger point: automation can propose a list, but product decisions still need a human pass.

Trade-off flips: cheap, believable wrong options

For some answers, you can generate a plausible wrong option by flipping a trade-off. Examples:

  • “higher consistency, higher latency” → “higher consistency, lower latency”
  • “fewer network hops, less flexibility” → “fewer network hops, more flexibility”

We added a lightweight runtime step in the quiz option generator: when the correct answer contains an explicit trade-off pattern, generate an “inverted” variant as an extra candidate distractor.

Here’s the simplified flow:

Rendering diagram...

This is not “LLM generation in production”. It’s deterministic string-level augmentation that improves plausibility when the data already contains a trade-off.

Step 5: ship it cleanly (this is where the commit-work skill matters)

Session notes often stop at “we implemented it”. But the last 10% matters:

  • make sure the change is reviewable later
  • make sure you can bisect issues
  • make sure you can revert safely

In Codex CLI, that’s where the commit-work skill fits. The skill is essentially a structured checklist + helper:

  1. review what changed (and what should not have changed)
  2. propose a logical split into commits (code vs docs, refactor vs content)
  3. produce a Conventional Commit message that matches the intent

In this case, the repo history reflects that shape:

  • feat(quiz): improve distractors and quiz answers
  • docs: update session log

A practical “LLM-assisted but still responsible” shipping loop looks like this:

cd recruitment-assistant
bun lint
bun run test

Then:

  • stage only the intended files
  • run the commit skill to draft a clean message and sanity-check the diff

The meta-lesson: LLMs help, but only if your workflow still enforces boundaries.

What this taught us about building LetsGit.IT with an LLM

  • The dataset is part of the product. If it’s biased, the UX will be biased.
  • UI tricks are not a substitute for content quality. Trimming “fixes” the symptom.
  • Prefer explicit contracts over heuristics. quizAnswer is a contract; “trim in UI” is a hack.
  • Automate passes, then review the edge cases. A bulk rewrite gets you 80%; humans finish the last 20%.
  • Treat shipping as a first-class step. Clean commits make future LLM work cheaper (context is clearer).

Next in the series

In the next posts, we’ll go deeper on:

  • manual per-category tuning (starting with categories like /en/category/data-structures)
  • measuring quiz difficulty and “distractor quality” over time
  • keeping EN/PL content aligned without turning content editing into a chore

If you’re building your own interview-prep tool, you can reuse the pattern even if your stack is different: measure bias → design a content contract → augment distractors deterministically → ship in reviewable slices.