· 6 min read

A Design Assistant That Runs On Your Device — No Account, No Server

PinePaper's on-device AI assistant turns a prompt into a real scene using a language model that runs entirely in your browser. Here's the architecture — constrained tool calls, two providers, and an honest account of what a tiny model can and can't do.

The premise

Most AI design tools send your prompt to a server, run a large model, and send code back. That needs an account, a network round-trip, and trust that your work-in-progress transits someone else's infrastructure.

We wanted to know how far the other extreme goes: a design assistant where the model runs on your own machine. No account. No upload. The prompt never leaves the browser. This is now live in PinePaper's editor as an experimental AI / Code → Assistant tab, and this post is an honest account of how it works and where it falls short.

Why not just ask the model for code?

The obvious approach is to ask the on-device model to write JavaScript against PinePaper's API and run it. We tried it. It fails badly.

Small models — the ones that fit on a laptop — have no knowledge of a specific app's API. Ask a 0.5–2B model to call PinePaper.create() and it improvises: a key the method doesn't read (an SVG-style fill where the API expects color), a method it imagined, arguments in the wrong shape. The output looks like code and silently does the wrong thing.

So we don't ask for code. We ask for a constrained list of tool calls:

[
  { "name": "pinepaper_set_background_color", "arguments": { "color": "#0F0F1A" } },
  { "name": "pinepaper_create_item", "arguments": { "itemType": "star", "x": 400, "y": 300, "radius": 90, "color": "#E74C3C", "animationType": "pulse" } }
]

The same vocabulary PinePaper's MCP server exposes to external agents — but emitted by a model running locally, and executed on the canvas through a small dispatcher.

Making invalid output impossible

The key move is constrained decoding. Instead of hoping the model produces a valid shape, we constrain what tokens it's even allowed to emit:

  • On Chrome's built-in Prompt API (Gemini Nano), we pass a JSON Schema as a response constraint. The model can only produce objects the schema permits — with additionalProperties: false, any argument the schema doesn't define is literally unrepresentable.
  • On WebLLM (an open model running on WebGPU), we attach an EBNF grammar via XGrammar. The grammar dictates the structure; the model only fills in the values.

Both produce the same tool-call shape. A small repair pass then salvages the near-misses a tiny model still makes — a dropped name, arguments without their wrapper — by inferring the tool from the argument shape. The result is that "draw a green pentagon" becomes a real create_item call instead of an error.

Two providers, one contract

On-device AI isn't one thing — it depends on the browser:

  • Browser AI — Chrome's built-in Gemini Nano (window.LanguageModel). The model ships with the browser; there's nothing to download from us. This is the most reliable on-device path today, because the JSON-schema constraint is cheap and well-supported.
  • PinePaper AIWebLLM running a Qwen2.5 model in any WebGPU browser. It downloads a model once (cached thereafter), then runs offline.
  • Languages — non-English prompts are translated to English on-device first (via the browser's built-in translator), because the small code-oriented models are English-centric. The generated scene is the same regardless of prompt language.

The user picks the engine; everything downstream — the constraint, the executor, the canvas — is identical.

Editing, not just generating

A one-shot generator isn't an assistant. To support "make the star red," the model needs to know what's already on the canvas. So each turn we feed it a compact snapshot of the current items — their ids, types, and colors — exactly like a server-side chat would. "Make the star red" then resolves to a real modify_item call against the item's id. (And because a small model sometimes refers to "the_circle" when it means the star, the executor resolves fuzzy references against the live canvas.)

The conversation log is kept on your device in local storage. Nothing about it is sent anywhere — unless you explicitly opt in to share failed prompts, which helps us improve the prompts and grammar.

When the small model isn't enough — escalate

Here's the honest part: a 0.5–2B model is the weakest tier. Constrained decoding guarantees valid structure, but the model still has to pick the right tool and sensible values, and it won't always. Ask for a pentagon and an under-specified model gives you a hexagon; ask twice for "red" and it might run a generator instead.

So the architecture treats on-device as the first tier, not the only one. After a few failed attempts the assistant offers to hand the whole conversation up to the Cloud model — same intent, same canvas, a far more capable model — and you continue exactly where you left off. Low fidelity, free and private, with a one-click path to high fidelity when you need it.

What we've learned

  • Constraints beat prompting. A grounded system prompt helps; a grammar/schema that makes invalid output impossible helps far more. The single biggest reliability jump came from constrained decoding, not from a better prompt.
  • Model size still dominates. Going from a 0.5B to a 1.5B model noticeably improved how often the assistant picks the right tool. There is no prompt that turns a tiny model into a smart one.
  • The full grammar can be too big. Our first grammar encoded every possible operation — including a bounded-length escape hatch for arbitrary drawing code. It was so large that constrained decoding stalled the page. A compact grammar covering the common operations is the right default; the full surface is opt-in.
  • Run it off the main thread. A model doing inference on the UI thread freezes the page. WebLLM runs in a Web Worker so the editor stays responsive while a model loads and generates.

What's next

This is experimental and improving. On the roadmap:

  • Wider tool coverage in the constrained path — more of PinePaper's operations expressible without the freeform escape hatch.
  • Larger on-device models as WebGPU model catalogs grow, trading download size for reliability.
  • A tighter eval loop — using opted-in failure reports to measure which prompts trip the on-device models and harden the grammar against them.
  • A smoother on-device → Cloud handoff, so escalation feels like turning up the quality dial rather than switching tools.

The throughline: the same declarative tool vocabulary drives a model on either side of the browser boundary. An external agent calls these tools over MCP; an on-device model calls the same shapes locally. One contract, agents wherever they run.

Try it in the editor — open AI / Code → Assistant. It's free, runs on your device, and needs no account.

Ready to create?

Start making animated GIFs, videos, and graphics — free, no signup.

Open PinePaper Editor