← Benchmarks Page 2026-05-13

App Settings Page

Depth pick Claude Opus 4.7 More ARIA, more breakpoints, more design tokens.

Efficiency pick GPT-5.5 ~4× faster, ~9× cheaper, perfect Lighthouse a11y, cleaner HTML.

The Prompt

Identical for both models

Verbatim — pasted into a fresh chat

Build a polished application Settings page for a fictional product called "Stack" — the kind of settings UI that ships in a modern SaaS dashboard. It should look and feel like a real product page that could ship today.

Requirements (all must be present):
1. Left-side section nav with at least 4 sections — for example: Profile, Account, Notifications, Appearance. Clicking a nav item switches the content area without a page reload. The current section must be visually indicated and announced to assistive tech (e.g. aria-current="page"). On mobile this nav must collapse (top tabs, accordion, hamburger — your call) while staying keyboard operable.
2. Profile section: avatar placeholder (initials in a colored circle is fine), display-name input, email input, short bio textarea. Each field shows a clear label and a helper-text or error slot.
3. Notifications section: at least 5 settings rendered as toggle switches — e.g. email digest, product updates, security alerts, mention notifications, marketing. Each toggle has a label and a one-line description.
4. Appearance section: theme control (Auto / Light / Dark — segmented control), accent-color picker (4–6 swatches), interface density (Comfortable / Compact). Selecting Light or Dark must actually change the page theme; Auto follows prefers-color-scheme. The chosen accent color must visibly drive at least one accent in the UI (primary button background, active nav indicator, etc.).
5. A sticky "save bar" at the bottom of the content area that appears only when at least one field has unsaved changes. It must contain a primary "Save changes" button and a secondary "Discard" button. Save and Discard must both clear the dirty state.
6. Dark / Light support wired throughout the page — every section, control, and the save bar must look polished in both themes. (The Appearance theme control drives this.)
7. Responsive: works at mobile (375px), tablet (768px), and desktop (1280px+). On mobile, the sidebar nav must collapse and content must remain readable; no horizontal scroll at any of those widths.
8. Accessible:
   - All controls reachable and operable by keyboard, with visible focus rings.
   - Toggles use role="switch" (or <button> with aria-checked) — not raw checkboxes hidden under fake skins.
   - Segmented controls use role="group" with aria-pressed per option (or role="radiogroup" + role="radio").
   - Honor prefers-reduced-motion — no large animations for users who disable motion.
   - Color contrast must meet WCAG AA in both themes.
9. Visually polished: thoughtful typography, spacing, hierarchy, hover/focus/active states, subtle micro-interactions where they help (toggle thumb animation, save-bar slide-in, etc.).

Output:
- A single self-contained index.html file.
- All CSS inline in <style>. All JavaScript inline in <script>. No external dependencies, no CDN, no images, no icon libraries (inline SVG is fine).
- The file must render correctly when opened directly in a browser (file://) or embedded in an iframe at any of the three viewport widths above.

Live Result

The actual rendered output, side-by-side

Viewport 1280 px

Opus 4.7 Open standalone ↗

GPT-5.5 Open standalone ↗

The Code

Raw source as written by each model

single self-contained index.html

Loading source…

Loading source…

Scorecard

The metrics behind what you just saw

Duration

~26m ~7m

Opus 4.7 GPT-5.5

Wall clock from prompt submit to last file write. Approximate.

Bundle size

50.1 KB 40.0 KB

Opus 4.7 GPT-5.5

Bytes of the produced index.html.

Lines of code

1,379 1,327

Opus 4.7 GPT-5.5

Total lines including blank lines.

Inline JS

14.1 KB 6.4 KB

Opus 4.7 GPT-5.5

Bytes inside <script> tags.

Inline CSS

18.5 KB 20.0 KB

Opus 4.7 GPT-5.5

Bytes inside <style> tags.

HTML errors

14 2

Opus 4.7 GPT-5.5

W3C Nu HTML Checker error count.

ARIA attributes

86 59

Opus 4.7 GPT-5.5

Total aria-* attribute usages — proxy for a11y attention.

Spec adherence

20/20 20/20

Opus 4.7 GPT-5.5

Automated structural checks on the 9 prompt requirements.

Lighthouse a11y

96/100 100/100

Opus 4.7 GPT-5.5

Both: Perf 100, BP 100, SEO 82.

Cost

~$1.10 ~$0.12

Opus 4.7 GPT-5.5

Estimated from output size × public per-token pricing.

Reviewer Notes

Per-model deep-dive

Claude Opus 4.7

Richer build

Visual polish: 8 / 10
Code quality: 9 / 10
Spec adherence: 10 / 10
Would you ship it?: With minor edits

The architecture is the standout. State lives on <html> as data-theme / data-accent / data-density, so theming + accent + density are all pure CSS — no JS style writes, no per-element class flipping. 90 CSS custom properties, 86 ARIA usages, 31 explicit role attributes, and six media-query breakpoints. Reads like something a senior dev wrote on purpose.

The polish is "stock SaaS-good," not "designer-good." Type rhythm, spacing, hover states are all correct; nothing is wrong. But the overall feel is the safe interpretation of the brief — solid, readable, slightly utilitarian. The W3C validator also flags 14 errors, almost all the same pattern: <button> elements missing type="button" plus a handful of misplaced aria-pressed. Mechanical to fix; ship-it after a 5-minute pass.

GPT-5.5

Lean artifact

Visual polish: 9 / 10
Code quality: 8 / 10
Spec adherence: 10 / 10
Would you ship it?: Yes

The visual taste shows. 28px corner radii, glassmorphic save bar with backdrop-filter, HSL-component-split tokens (--accent-h/s/l) so tints + shades derive live from one accent. Less ARIA layering than Opus (59 vs 86) but the spec rules don't punish that — both meet every keyboard + role requirement. This is what shows up in a Dribbble shot.

It also did it ~4× faster, smaller, and cleaner. 7 minutes wall-clock, 40.0 KB total, 6.4 KB of inline JS — less than half of Opus's. HTML validation came back nearly clean: only 2 errors, both the same minor pattern (aria-label on a bare <div>). For a one-shot first draft, this is the closer of the two to "drop it in and ship."

06 — Methodology

Every benchmark on this site follows the same rules. The point is to measure the model's first instinct under realistic conditions, not its ceiling after coaching.

Same prompt, verbatim. Pasted into a fresh chat. No system-prompt edits, no rules pre-loaded, no skills attached.
Single shot. One prompt in, one response out. No follow-ups, no "fix the toggle" iterations. Whatever ships first is what gets scored.
Agent mode. Cursor's Agent mode is on. The model may use any tool calls it wants; tool-call count is a metric, not a constraint.
Same starting state. Output goes into an empty benchmarks/<slug>/<model-slug>/ directory. No scaffolding, no example file to imitate.
Self-contained output. All HTML, CSS, and JS must live inside the index.html the model produces. No CDN, no external assets, no images.
Captured immediately. Duration, tool calls, tokens in/out, and cost are recorded the moment the run completes — closing the chat in Cursor wipes those numbers.