Building the System Design Simulator with Codex CLI

An end-to-end account of building the System Design Simulator with Codex CLI, covering data modeling, diagrams, scoring, and feedback.

Jakub MertaDec 18, 20259 min readEngineering

#system-design#codex-cli#mermaid#architecture

Summary

Walkthrough of the end-to-end build: data model, scoring matrix, and how decisions translate into a readable system design narrative.

Key takeaways

Model scenarios as structured steps with explicit tradeoffs and feedback.
Treat diagram rendering as a first-class output, not a byproduct.
Keep scoring transparent so users can reason about outcomes.
Store revisions so the simulator evolves without losing history.

Building the System Design Simulator with Codex CLI

Here’s the full build story: what we decided, what we built, and what we actually verified (and what we didn’t).

The simulator is a multi-step system design experience where choices update the diagram and the scoring in real time. This post walks through the path we took to make that work end to end.

If you want the interview framing first, see System Design Interview: From Zero to Hero. For product context, start with System Design Simulator: Uber-like Architecture Walkthrough.

Planning the simulator with Codex CLI

We started with a quick planning pass to make the scope explicit. The non‑negotiables were:

The simulator is a multi-step quiz-style flow.
It should show a diagram on one side that updates after each choice.
There are no strictly correct or incorrect answers.
Every answer should produce per-answer feedback.
Earlier choices should influence later outcomes (in scoring and feedback).
Start with a single scenario, fully fleshed: an Uber-like app.

To keep scope tight, we wrote a clear definition of done: one scenario, a fixed number of steps, a small set of choices per step, fully localized copy, and a results view that explains every signal shift. We also decided early that all metrics would share the same directionality (higher is better) to avoid mixed polarity. That choice shaped the model, the UI, and the feedback language.

We picked Mermaid because it was quick to integrate and good enough for a polished MVP.

Then we laid out a simple build plan:

Add Mermaid dependency.
Create a localized public route for the simulator.
Build a diagram renderer component.
Build a simulator component with step logic and result matrix.
Define a scenario data model with metrics, flags, and conditional effects.
Update dictionaries with simulator UI text.

Before coding, we asked a few blunt questions: when should feedback show up, can users change their mind, how many steps is enough, and how do we keep choices balanced. That quick Q&A prevented scope drift and gave us concrete acceptance criteria.

We also compared a few strategies for the core mechanics:

Diagram rendering: a full graph editor vs. a lightweight renderer. We picked Mermaid for speed, clarity, and low integration cost.
Feedback timing: inline after each step vs. a consolidated end summary. We chose end‑summary to keep the flow focused and to make cross‑step influence easier to understand.
Scoring model: mixed polarity vs. unified polarity. We chose “higher is better” for all metrics to avoid mental overhead.
Data organization: logic in code vs. declarative data. We chose a data‑first model with flags and conditional deltas so the scenario drives behavior.

Once those choices were made, the rest was straightforward: the code was mostly a translation of the plan.

The Uber-like scenario (single, fully fleshed)

We started with a single scenario: an Uber-like ride-hailing platform. It includes:

A base diagram with Rider/Driver apps, an API gateway, and core services.
Five steps, each with three options (3^5 = 243 possible paths).

Step 1: Service boundary strategy

Modular monolith + single Postgres
Service-oriented + per-domain databases
Event-driven core with a shared bus

Step 2: Location storage

Redis Geo with TTL
Postgres + PostGIS
Cassandra wide-column store

Step 3: Real-time updates

WebSockets gateway
Server-Sent Events
Client polling

Step 4: Matching pipeline

Synchronous matching
Queue + matcher workers
Stream processing pipeline

Step 5: Caching and resilience

No cache, rely on DB
Redis cache + fallbacks
Geo-sharded cache + fallback

Each choice includes metric deltas and per-choice feedback. This keeps the
scenario concrete and ensures the final result matrix is meaningful.

If you want to refresh the fundamentals, browse Architecture questions.

System design simulator UI and step engine

Under the hood, the simulator is state-driven:

answers maps step IDs to chosen choice IDs.
currentStepIndex tracks progress.
The diagram is recomputed every time answers change.

Navigation is intentionally simple. You can go back and change an answer, but you cannot move on without picking one for the current step. That keeps the diagram and scoring in sync. When a choice changes, we reset the summary so it always reflects the latest path. Small detail, big difference.

Scoring logic

Scoring works in a few steps:

Initializes a metric totals object with zeros.
Iterates over each answered step.
Applies each choice’s metrics deltas to the totals.
Applies any conditionalMetrics when flags match.
Stores per-step results for the final report.

We do not call answers right or wrong; we label the tradeoffs like this:

Strong if total >= +2
Balanced if between -1 and +1
Risk if total ≤ -2

That is the intent: show tradeoffs instead of right answers.

Scoring matrix and feedback details

At the end, users get a single summary panel. It includes:

Summary metrics (latency, cost, reliability, complexity, scalability)
Per-step strengths and risks
A tradeoff summary
An "influences" section for conditional effects
The new "Why signals moved" section

We keep the flow clean by saving feedback for the end. That lets people stay focused while they choose, then reflect afterward. Pairing per-step notes with global metrics makes the story feel cohesive — you can see both the local reasoning and the overall system behavior.

This is where the story comes together: you can see how early decisions were softened or amplified by later choices.

Because the matrix is derived from the scenario config, new scenarios do not require changes to the rendering logic.

Verification and known gaps (system design simulator)

We did a light validation of the dictionary JSON by parsing it with Node. We did not run the full lint/test suite during this build.

If you want a full verification pass, the next steps should be:

bun install
bun lint
Open the simulator in both English and Polish locales in light and dark themes.

This is the minimum to confirm the simulator renders, Mermaid loads correctly, and the UI looks good under both themes.

Using Codex skills for planning and commits

We leaned on Codex to keep the work tidy. The planning step forced us to answer the tricky questions up front: scope, feedback timing, and how choices should
influence each other. That made the definition of done concrete and kept us from drifting.

If you want the exact skill definitions we used, they live in codex-skills.

We also compared options side by side: diagram engines, scoring polarity, and where logic should live. We chose the simpler, data-driven path so new scenarios could be added without changing the engine. It saved us from over-engineering a custom editor or hardcoding a decision tree.

Here is a trimmed version of the Q&A we kept in front of us:

Q: When should feedback be shown? A: At the end, so users stay in flow while making decisions.

Q: Can users change their mind? A: Yes. Backtracking is allowed, and results recompute immediately.

Q: Are there correct answers? A: No. Every option is viable but comes with tradeoffs.

Q: How do earlier choices affect later ones? A: Use flags + conditional deltas to model dependencies.

Q: How many steps should the MVP include? A: Five steps with three choices each.

We also kept a tiny tradeoff matrix to make decisions explicit:

Decision                     Option A                     Option B
Diagram engine               Mermaid (fast, readable)     Custom editor (powerful, heavy)
Feedback timing              End summary (cohesive)       Inline (interrupts flow)
Scoring polarity             Higher-is-better (clear)     Mixed polarity (confusing)
Logic placement              Data-driven (extensible)     Hardcoded (brittle)
Scenario scope               One deep scenario            Many shallow scenarios

Finally, the commit workflow kept history clean: review the changes, stage them as one coherent unit, and write a clear Conventional Commit message. A clean example looks like this:

feat: add system design simulator with diagram feedback

Final summary

The System Design Simulator is now a first-class experience in the app:

It shows a dynamic architecture diagram that updates after each choice.
It gives a transparent, reasoned explanation of every metric shift.
It captures inter-step dependencies via conditional scoring.
It stays fully localized and theme-aware.

Most importantly, it creates a learning experience where users understand why tradeoffs exist and how architecture decisions compound over time.

If you want a deeper technical breakdown, I can include exact code snippets for any part of the simulator or generate a full decision tree for the 243 possible paths.

Building the System Design Simulator with Codex CLI

Building the System Design Simulator with Codex CLI

Planning the simulator with Codex CLI

The Uber-like scenario (single, fully fleshed)

Step 1: Service boundary strategy

Step 2: Location storage

Step 3: Real-time updates

Step 4: Matching pipeline

Step 5: Caching and resilience

System design simulator UI and step engine

Scoring logic

Scoring matrix and feedback details

Verification and known gaps (system design simulator)

Using Codex skills for planning and commits

Final summary

Read next

System Design Interview: From Zero to Hero

Building LetsGit.IT with an LLM: Quiz Fairness

System Design Simulator: Uber-like Architecture Walkthrough