Home Benchmarks Skills Tools News
← Benchmarks Page 2026-05-13

App Settings Page

Depth pick Claude Opus 4.7 More ARIA, more breakpoints, more design tokens.
VS
Efficiency pick GPT-5.5 ~4× faster, ~9× cheaper, perfect Lighthouse a11y, cleaner HTML.
01

The Prompt

Identical for both models
Verbatim — pasted into a fresh chat
Build a polished application Settings page for a fictional product called "Stack" — the kind of settings UI that ships in a modern SaaS dashboard. It should look and feel like a real product page that could ship today.

Requirements (all must be present):
1. Left-side section nav with at least 4 sections — for example: Profile, Account, Notifications, Appearance. Clicking a nav item switches the content area without a page reload. The current section must be visually indicated and announced to assistive tech (e.g. aria-current="page"). On mobile this nav must collapse (top tabs, accordion, hamburger — your call) while staying keyboard operable.
2. Profile section: avatar placeholder (initials in a colored circle is fine), display-name input, email input, short bio textarea. Each field shows a clear label and a helper-text or error slot.
3. Notifications section: at least 5 settings rendered as toggle switches — e.g. email digest, product updates, security alerts, mention notifications, marketing. Each toggle has a label and a one-line description.
4. Appearance section: theme control (Auto / Light / Dark — segmented control), accent-color picker (4–6 swatches), interface density (Comfortable / Compact). Selecting Light or Dark must actually change the page theme; Auto follows prefers-color-scheme. The chosen accent color must visibly drive at least one accent in the UI (primary button background, active nav indicator, etc.).
5. A sticky "save bar" at the bottom of the content area that appears only when at least one field has unsaved changes. It must contain a primary "Save changes" button and a secondary "Discard" button. Save and Discard must both clear the dirty state.
6. Dark / Light support wired throughout the page — every section, control, and the save bar must look polished in both themes. (The Appearance theme control drives this.)
7. Responsive: works at mobile (375px), tablet (768px), and desktop (1280px+). On mobile, the sidebar nav must collapse and content must remain readable; no horizontal scroll at any of those widths.
8. Accessible:
   - All controls reachable and operable by keyboard, with visible focus rings.
   - Toggles use role="switch" (or <button> with aria-checked) — not raw checkboxes hidden under fake skins.
   - Segmented controls use role="group" with aria-pressed per option (or role="radiogroup" + role="radio").
   - Honor prefers-reduced-motion — no large animations for users who disable motion.
   - Color contrast must meet WCAG AA in both themes.
9. Visually polished: thoughtful typography, spacing, hierarchy, hover/focus/active states, subtle micro-interactions where they help (toggle thumb animation, save-bar slide-in, etc.).

Output:
- A single self-contained index.html file.
- All CSS inline in <style>. All JavaScript inline in <script>. No external dependencies, no CDN, no images, no icon libraries (inline SVG is fine).
- The file must render correctly when opened directly in a browser (file://) or embedded in an iframe at any of the three viewport widths above.
02

Live Result

The actual rendered output, side-by-side
Viewport 1280 px
Opus 4.7 Open standalone ↗
GPT-5.5 Open standalone ↗
03

The Code

Raw source as written by each model
single self-contained index.html
Loading source…
Loading source…
04

Scorecard

The metrics behind what you just saw
Duration
~26m vs ~7m
Opus 4.7 GPT-5.5

Wall clock from prompt submit to last file write. Approximate.

Bundle size
50.1 KB vs 40.0 KB
Opus 4.7 GPT-5.5

Bytes of the produced index.html.

Lines of code
1,379 vs 1,327
Opus 4.7 GPT-5.5

Total lines including blank lines.

Inline JS
14.1 KB vs 6.4 KB
Opus 4.7 GPT-5.5

Bytes inside <script> tags.

Inline CSS
18.5 KB vs 20.0 KB
Opus 4.7 GPT-5.5

Bytes inside <style> tags.

HTML errors
14 vs 2
Opus 4.7 GPT-5.5

W3C Nu HTML Checker error count.

ARIA attributes
86 vs 59
Opus 4.7 GPT-5.5

Total aria-* attribute usages — proxy for a11y attention.

Spec adherence tie
20/20 vs 20/20
Opus 4.7 GPT-5.5

Automated structural checks on the 9 prompt requirements.

Lighthouse a11y
96/100 vs 100/100
Opus 4.7 GPT-5.5

Both: Perf 100, BP 100, SEO 82.

Cost est.
~$1.10 vs ~$0.12
Opus 4.7 GPT-5.5

Estimated from output size × public per-token pricing.

05

Reviewer Notes

Per-model deep-dive

Claude Opus 4.7

Richer build
Visual polish
8 / 10
Code quality
9 / 10
Spec adherence
10 / 10
Would you ship it?
With minor edits

The architecture is the standout. State lives on <html> as data-theme / data-accent / data-density, so theming + accent + density are all pure CSS — no JS style writes, no per-element class flipping. 90 CSS custom properties, 86 ARIA usages, 31 explicit role attributes, and six media-query breakpoints. Reads like something a senior dev wrote on purpose.

The polish is "stock SaaS-good," not "designer-good." Type rhythm, spacing, hover states are all correct; nothing is wrong. But the overall feel is the safe interpretation of the brief — solid, readable, slightly utilitarian. The W3C validator also flags 14 errors, almost all the same pattern: <button> elements missing type="button" plus a handful of misplaced aria-pressed. Mechanical to fix; ship-it after a 5-minute pass.

GPT-5.5

Lean artifact
Visual polish
9 / 10
Code quality
8 / 10
Spec adherence
10 / 10
Would you ship it?
Yes

The visual taste shows. 28px corner radii, glassmorphic save bar with backdrop-filter, HSL-component-split tokens (--accent-h/s/l) so tints + shades derive live from one accent. Less ARIA layering than Opus (59 vs 86) but the spec rules don't punish that — both meet every keyboard + role requirement. This is what shows up in a Dribbble shot.

It also did it ~4× faster, smaller, and cleaner. 7 minutes wall-clock, 40.0 KB total, 6.4 KB of inline JS — less than half of Opus's. HTML validation came back nearly clean: only 2 errors, both the same minor pattern (aria-label on a bare <div>). For a one-shot first draft, this is the closer of the two to "drop it in and ship."

06 — Methodology

Every benchmark on this site follows the same rules. The point is to measure the model's first instinct under realistic conditions, not its ceiling after coaching.

  • Same prompt, verbatim. Pasted into a fresh chat. No system-prompt edits, no rules pre-loaded, no skills attached.
  • Single shot. One prompt in, one response out. No follow-ups, no "fix the toggle" iterations. Whatever ships first is what gets scored.
  • Agent mode. Cursor's Agent mode is on. The model may use any tool calls it wants; tool-call count is a metric, not a constraint.
  • Same starting state. Output goes into an empty benchmarks/<slug>/<model-slug>/ directory. No scaffolding, no example file to imitate.
  • Self-contained output. All HTML, CSS, and JS must live inside the index.html the model produces. No CDN, no external assets, no images.
  • Captured immediately. Duration, tool calls, tokens in/out, and cost are recorded the moment the run completes — closing the chat in Cursor wipes those numbers.
← Previous benchmark AI Tool Pricing Section All benchmarks Back to the index
STATUS ● BUILDING THE FUTURE
MISSION LLM RESOURCES
VERSION BETA 3.0

BUILD WITH AI. SHIP WITH CONFIDENCE.

@WEBDEVELOPERHQ ↗
TERMS / PRIVACY
FRIENDS
Authentic Jobs ↗
Web Reference ↗
Ready.dev ↗
Fullres ↗
© 2026 WEB DEVELOPER / ALL RIGHTS RESERVED