Object Detection

4 min read

Detect the objects inside an image and compose graphics onto them — for example, drop a heart on a dog so it stays on the dog when you move the image. Detection runs entirely on your device (nothing is uploaded), using a built-in AI model that recognizes 80 everyday object classes (people, animals, vehicles, food, furniture, and more).

This is bounding-box detection — it finds where each object is and labels it. It is not pixel-perfect cut-out (mask) segmentation.

Detect objects in an image

Select an image on the canvas (import one from the Add → Import panel or the SVGs library first).
In the Item panel, click Detect objects. (The button only appears when an image is selected.)
The first run downloads the AI model once (a few seconds); after that it’s instant. Detection runs on your GPU when available, otherwise on the CPU.
Each detected object is outlined and labeled on the canvas.

Place a shape on a detected object

After detection, you’re in compose mode:

Pick a shape (heart, star, circle, box) and a placement — On, Around, or Above — from the banner at the top of the canvas.
Click a detected object. The shape is placed there and bound to the object with a follows relation.
Because the binding tracks the object, the shape stays put when you move or scale the image.
Press Esc or Done when you’re finished.

You can place as many shapes as you like before exiting.

What each detected object becomes

Every detection becomes an addressable design node on the canvas — an invisible anchor you can target with relations. It carries:

a label (e.g. “dog”, “car”, “cup”),
a type drawn from a public, real-world vocabulary (so agents and the AI assistant understand what it is), and
a position that tracks the image.

That means anything relation-based works against it: follows to ride the object, circumscribes to ring it, or points_at to aim an arrow at it.

Find — search your scene, or describe what to look for

The Find box searches your whole scene by meaning, not pixels: it matches your text against every item’s content, name, type, and any detected labels — so “robot” or “parasol” selects matching items instantly, with no model and no download. (It searches the design graph, so it works on drawn shapes, text, SVGs, and detected objects alike.)

If nothing in the scene matches and a photo is selected, Find offers to look inside that photo with open-vocabulary AI — type what you want (“a red mug”, “a parasol”) and it finds it even if it isn’t one of the 80 standard classes. This uses a larger model, so it’s opt-in (see below).

Models & sources

Detection runs on your device via transformers.js + ONNX runtime (Apache-2.0). The image never leaves your device — only the model weights download (once, then cached).

Feature	Model	License	Download	What it does
Detect objects	DETR ResNet-50 — Meta AI	Apache-2.0	on first use; runs fp16 on WebGPU / q8 on CPU	Finds 80 common classes (COCO: people, animals, vehicles, food, furniture…).
Find → AI (open-vocabulary)	OWL-ViT base-patch32 — Google Research	Apache-2.0	~150 MB, opt-in — downloads once, only after you agree	Finds anything you describe in text, not just fixed classes.

Open each linked model card for full capabilities, training data, and known limitations. Both models are photo-trained, so they work best on photographic content (see limits below).

Notes & limits

Private by design — the image and the results never leave your device; only model weights download (from the model host, like any on-device AI here).
Two vocabularies — Detect objects knows 80 common classes; Find → AI (OWL-ViT) is open-vocabulary but heavier (opt-in download).
Photo-trained — both models are trained on photographs, so they’re unreliable on flat vector/illustration art. For vector, use Find (graph search) or Break apart instead.
Boxes, not masks — detections are rectangular regions, not pixel-accurate cut-outs.
Multiple images — detect across many images; each detection stays anchored to its own image and traces back to it.
Video — detection currently works on still images. Per-frame tracking for video is on the roadmap.

Relations — the behaviors (follows, circumscribes, points_at) that bind a shape to a detected object.
On-Device AI Assistant — the local AI that also powers detection.