Building LetsGit.IT with an LLM: Quiz Fairness

How we fixed a surprisingly common quiz bug: users could guess the correct answer just by picking the longest option.

Jakub MertaDec 29, 202510 min readEngineering

#letsgit-it#llm#codex-cli#quiz#data-quality#product-building

Summary

A practical build story: we removed “longest answer wins” bias from the LetsGit.IT quiz without sacrificing full explanations outside the quiz.

Key takeaways

Measure bias before changing UX.
Prefer quiz-specific answer variants over trimming.
Make distractors plausible by removing “absolute” cues.
Use lightweight generation (trade-off flips) to add subtle wrong options.
Ship in clean slices with the commit-work skill.

Building LetsGit.IT with an LLM: Quiz Fairness

This is the first post in the series about building LetsGit.IT with LLM assistance. The focus here isn’t “AI magic”, but the boring, reliable parts of product engineering: measuring what’s wrong, changing the data model (not just the UI), and shipping in a way that keeps the codebase maintainable.

This chapter is based on the 2025-12-29 session notes (sessions/articles/2025-12-29.md). If you prefer Polish, read the PL version here: /pl/blog/tworzenie-letsgit-it-z-llm-uczciwy-quiz.

The problem: the quiz was solvable by length

LetsGit.IT has a multiple-choice quiz mode: one correct answer and (usually) two distractors. In theory, you win by understanding the topic. In practice, the dataset created a strong bias:

the correct answer was almost always the longest option
wrong answers were short, vague, or “obviously wrong” (often with absolute wording)

That’s not a subtle UX issue. It’s a validity issue: the quiz stops measuring knowledge and starts measuring pattern recognition.

Step 1: measure it (don’t argue about it)

Before touching UI or copy, we ran a simple audit over the seed batches (recruitment-assistant/prisma/data/migrations/*.json), checking per question whether the correct answer is strictly the longest option among the quiz candidates.

The headline result from that audit was blunt: 100% of questions across categories (EN/PL) had the correct answer as the longest in the generated quiz choices.

The key lesson is not the specific number. It’s the workflow:

pick a simple metric that reflects user behavior (“does length predict correctness?”)
make it cheap to recompute
use it as a guardrail for future changes

Step 2: the tempting quick fix — and why we rolled it back

The first remediation attempt was a classic “ship now” move: normalize option height and clamp/trim quiz answers in the quiz UI. The idea was simple: if users can’t see the length difference, the bias disappears.

That approach did reduce the visual cue, but it introduced new problems:

trimming can remove the clause that makes an answer correct
you end up “hiding meaning” instead of improving content quality
it conflicts with the product requirement: full answers must remain available outside the quiz (lists, detail pages)

So we rolled it back and switched to a cleaner model.

Step 3: quiz-specific answer variants (short answers)

The better fix is to admit what the quiz needs: short, quiz-optimized answer variants, while keeping full answers intact everywhere else.

We implemented that with a small, explicit layer:

recruitment-assistant/src/data/quiz-answers.ts contains per-question overrides (EN + PL)
recruitment-assistant/src/lib/questions.ts resolves the quiz answer variant at runtime
the quiz UI consumes the “quiz answer” field; list/detail views still display the full answer

Conceptually:

type QuizAnswerVariant = {
  questionSlug: string;
  en: { quizAnswer: string };
  pl: { quizAnswer: string };
};

This is intentionally boring:

it’s explicit (no heuristic trimming at render time)
it’s testable (a question either has an override or not)
it keeps the product promise: “short answers in quiz, full answers in detail”

This is also where an LLM is actually useful: generating first drafts of short answers that you can then refine. The important constraint: the override must preserve correctness, not just be shorter.

Step 4: make distractors plausible (without rewriting everything by hand)

Once “correct answer length” stopped being a tell, the next weakness became obvious: the wrong answers weren’t competitive. Two patterns stood out:

absolute cues (“always”, “never”, “impossible”) that are rarely true in engineering
distractors that were either too short or semantically unrelated

We ran an automated cleanup pass over wrong answers and rewrote ~2.4k entries to remove the most obvious tells while preserving the intended “wrongness”.

This isn’t about making the quiz mean. It’s about making it useful. A good distractor should be “close to true”, forcing the learner to recall the detail that matters.

“Exclude list” was generated… then postponed

We also generated a list of “exclude candidates” for questions that are too broad for a multiple-choice quiz, mostly based on answer length and scope. The decision for now was to not exclude questions yet (the list was too aggressive).

The bigger point: automation can propose a list, but product decisions still need a human pass.

Trade-off flips: cheap, believable wrong options

For some answers, you can generate a plausible wrong option by flipping a trade-off. Examples:

“higher consistency, higher latency” → “higher consistency, lower latency”
“fewer network hops, less flexibility” → “fewer network hops, more flexibility”

We added a lightweight runtime step in the quiz option generator: when the correct answer contains an explicit trade-off pattern, generate an “inverted” variant as an extra candidate distractor.

Here’s the simplified flow:

Rendering diagram...

This is not “LLM generation in production”. It’s deterministic string-level augmentation that improves plausibility when the data already contains a trade-off.

Step 5: ship it cleanly (this is where the commit-work skill matters)

Session notes often stop at “we implemented it”. But the last 10% matters:

make sure the change is reviewable later
make sure you can bisect issues
make sure you can revert safely

In Codex CLI, that’s where the commit-work skill fits. The skill is essentially a structured checklist + helper:

review what changed (and what should not have changed)
propose a logical split into commits (code vs docs, refactor vs content)
produce a Conventional Commit message that matches the intent

In this case, the repo history reflects that shape:

feat(quiz): improve distractors and quiz answers
docs: update session log

A practical “LLM-assisted but still responsible” shipping loop looks like this:

cd recruitment-assistant
bun lint
bun run test

Then:

stage only the intended files
run the commit skill to draft a clean message and sanity-check the diff

The meta-lesson: LLMs help, but only if your workflow still enforces boundaries.

What this taught us about building LetsGit.IT with an LLM

The dataset is part of the product. If it’s biased, the UX will be biased.
UI tricks are not a substitute for content quality. Trimming “fixes” the symptom.
Prefer explicit contracts over heuristics. quizAnswer is a contract; “trim in UI” is a hack.
Automate passes, then review the edge cases. A bulk rewrite gets you 80%; humans finish the last 20%.
Treat shipping as a first-class step. Clean commits make future LLM work cheaper (context is clearer).

Next in the series

In the next posts, we’ll go deeper on:

manual per-category tuning (starting with categories like /en/category/data-structures)
measuring quiz difficulty and “distractor quality” over time
keeping EN/PL content aligned without turning content editing into a chore

If you’re building your own interview-prep tool, you can reuse the pattern even if your stack is different: measure bias → design a content contract → augment distractors deterministically → ship in reviewable slices.

Building LetsGit.IT with an LLM: Quiz Fairness

Building LetsGit.IT with an LLM: Quiz Fairness

The problem: the quiz was solvable by length

Step 1: measure it (don’t argue about it)

Step 2: the tempting quick fix — and why we rolled it back

Step 3: quiz-specific answer variants (short answers)

Step 4: make distractors plausible (without rewriting everything by hand)

“Exclude list” was generated… then postponed

Trade-off flips: cheap, believable wrong options

Step 5: ship it cleanly (this is where the commit-work skill matters)

What this taught us about building LetsGit.IT with an LLM

Next in the series

Read next

Building the System Design Simulator with Codex CLI