Guide

Edge AI and on-device inference explained

For years, "AI" meant sending data to a remote datacenter and waiting for a response. That still works — cloud APIs remain the default for frontier reasoning — but a parallel stack is maturing fast: edge AI, where models run on the phone, laptop, or embedded device in your hand. Dedicated NPUs (neural processing units), aggressive quantization, and runtime engines like Core ML, ONNX Runtime, and llama.cpp make it possible to run speech recognition, vision classifiers, and even small language models without a network round trip. This guide explains what edge inference actually is, why hardware vendors are betting on it, how hybrid local-cloud architectures route requests, and when on-device beats cloud — and when it does not.

What "edge AI" means in practice

Edge AI is inference — running a trained model to produce predictions — on hardware close to the data source rather than in a centralized cloud. The "edge" can be a smartphone, a laptop with an integrated NPU, a factory camera, a car ECU, or a home router. The unifying idea: minimize the distance between sensor input and model output.

Edge AI is not the same as training models locally. Training still overwhelmingly happens in GPU clusters. Edge deployments consume compressed checkpoints — often INT8 or INT4 weights — optimized for latency and power, not for gradient updates. The model arrives pre-trained; the device executes forward passes only.

Three deployment tiers matter for product architects:

  • Cloud-only — every prompt or image goes to an API like GPT-4o or Claude. Maximum capability, highest latency and privacy exposure.
  • On-device only — a quantized 1B–8B model runs entirely locally. Works offline, but quality ceilings are lower.
  • Hybrid — a small local model handles easy tasks (dictation cleanup, intent classification, PII redaction) and escalates hard queries to the cloud. This is where most consumer products are heading in 2026.

Why run models on-device?

Cloud inference is convenient, but it carries costs that edge deployments avoid or reduce:

  • Latency — a round trip over LTE or Wi-Fi adds 50–300 ms before inference even starts. Voice assistants, live camera filters, and game NPCs feel sluggish above ~100 ms. On-device inference can respond in single-digit milliseconds for small models.
  • Privacy — medical dictation, keyboard suggestions, and document summarization on a lawyer's laptop are sensitive. Processing locally keeps plaintext off third-party servers. This is a regulatory selling point under GDPR and HIPAA, not just marketing.
  • Offline and intermittent connectivity — planes, rural areas, and congested conference Wi-Fi break cloud-only apps. Edge models keep core features alive.
  • Marginal cost at scale — cloud API bills scale linearly with tokens. Shipping a 3B-parameter model on-device shifts compute to hardware the user already paid for. At hundreds of millions of daily active users, that math drives platform decisions.
  • Availability — your feature does not go down when OpenAI or Anthropic has an outage, rate-limits your key, or changes pricing overnight.

The trade-off is capability. A 3B INT4 model on a phone cannot match a 400B cloud model on complex reasoning, long-context synthesis, or rare-domain knowledge. Edge wins on speed, privacy, and cost for narrow, high-frequency tasks — not on open-ended research questions.

Hardware: NPUs, GPUs, and the memory wall

General-purpose CPUs can run neural networks, but matrix math at scale wants parallel hardware. Three accelerator classes dominate edge inference today:

  • NPUs / AI engines — dedicated matrix-multiply blocks on Apple Silicon (Neural Engine), Qualcomm Hexagon, Google Tensor, Intel AI Boost, and AMD XDNA. Tuned for low-precision INT8/INT4 ops at minimal wattage. Apple and mobile SoC vendors lead here because battery life is non-negotiable.
  • Integrated and discrete GPUs — Intel, AMD, and NVIDIA GPUs run larger edge models on laptops and desktops. A 7B Q4_K_M GGUF model fits comfortably on 16 GB unified-memory MacBooks; discrete RTX cards handle 13B+ with room for context.
  • CPU fallback — AVX-512 and ARM NEON vector instructions run llama.cpp on machines without GPUs. Slower, but universal — useful for servers, CI, and older hardware.

The binding constraint on edge LLMs is usually RAM bandwidth, not raw FLOPs. Autoregressive generation reads the full weight matrix once per output token. A 7B model in FP16 needs ~14 GB just for weights; INT4 quantization cuts that to ~4 GB with acceptable quality loss on many tasks. See our quantization guide for format details (GPTQ, AWQ, GGUF, Core ML palettes).

Thermal limits matter on phones. Sustained inference triggers throttling after 30–60 seconds. Product design must account for burst vs sustained workloads — a photo filter runs 200 ms; a 2,000-token summary may heat the chassis and drain battery noticeably.

Runtimes and model formats

A trained PyTorch or TensorFlow checkpoint is not what ships on a phone. Edge deployment converts weights into a runtime-specific format:

  • Core ML (Apple) — .mlpackage bundles for Neural Engine and GPU execution on iOS and macOS. Xcode compiles models with palettization (4-bit weight compression) and flexible input shapes.
  • ONNX Runtime — cross-platform graph format with execution providers for CUDA, DirectML, CoreML, and TensorRT. Common for Windows and cross-vendor apps.
  • TensorRT / OpenVINO — NVIDIA and Intel-optimized engines for datacenter-edge gateways and industrial PCs.
  • llama.cpp / MLX — open-source C++ and Apple MLX runtimes for GGUF-quantized LLMs. Popular for local ChatGPT-style apps on Mac and PC (LM Studio, Ollama, Jan).
  • ExecuTorch / LiteRT — Meta and Google's mobile-focused runtimes for on-device vision and small language models on Android.

Format choice locks you into tooling. Core ML models do not port trivially to Android; GGUF via llama.cpp is more portable but misses Neural Engine optimizations unless you re-export. Plan the export pipeline early — re-quantizing a 70B model because you picked the wrong runtime is expensive.

Hybrid architectures: routing between local and cloud

Pure on-device and pure cloud are extremes. Production systems increasingly use hybrid routing:

  1. Classifier gate — a tiny local model (100M–1B params) labels the query: "simple FAQ", "needs web search", "requires frontier reasoning". Cheap and fast.
  2. Local attempt — if confidence is high, the on-device 3B–8B model answers directly. User sees instant response.
  3. Cloud escalation — low-confidence or high-stakes queries forward to a cloud API with full context. Optionally show the user: "Getting a better answer online…"
  4. Result merge — cloud response returns; local model may rewrite for tone consistency or strip sensitive spans before display.

Hybrid design pairs well with agent tool use: local models handle intent parsing and PII scrubbing; cloud models execute code generation or multi-step planning. RAG retrieval can also split — embed and search a local document index on-device, then send only relevant chunks (not the full corpus) to the cloud LLM.

Routing policy is a product decision, not just engineering. Aggressive local-first saves API cost but increases hallucination risk on hard questions. Conservative cloud-first preserves quality but loses offline and privacy benefits. A/B test escalation thresholds against your actual query distribution.

Multimodal edge: vision and speech on-device

Text LLMs get the headlines, but most edge AI today is still classical deep learning: speech-to-text, face detection, object classification, OCR, and on-camera segmentation. These models are smaller (1–500M parameters), run at 30+ FPS, and power features users expect to work instantly — keyboard dictation, photo search, live translation overlays.

Multimodal LLMs that accept images and audio in the same prompt are harder at the edge. A vision-language model needs both a vision encoder and a language decoder in memory. Phone vendors ship 1B–3B multimodal variants (Gemini Nano-class, Apple on-device vision APIs) by aggressively quantizing both towers and limiting image resolution. Full GPT-4o-class vision still requires cloud offload for high-resolution document analysis or chart reasoning.

Speech pipelines often chain edge stages: on-device wake-word detection, local streaming ASR for low-latency partial transcripts, cloud ASR for rare words or noisy environments. Each stage is a separate model with its own latency budget.

Security, privacy, and update mechanics

On-device inference improves privacy but does not eliminate risk:

  • Model extraction — weights in an app bundle can be reverse-engineered. Treat edge models as obfuscated, not secret. Do not embed API keys or proprietary data in weights.
  • Adversarial inputs — local models are still vulnerable to prompt injection if they call tools or render HTML. Sandboxing matters on-device too.
  • Model updates — cloud models update server-side instantly. Edge models require app updates or over-the-air delta downloads (often 500 MB–2 GB for LLMs). Plan staged rollouts and rollback.
  • Telemetry — "on-device" marketing fails if you log every prompt to analytics. Audit what leaves the device even when inference is local.

Choosing edge vs cloud: a decision framework

Use this checklist when scoping a feature:

  • Latency SLA — need sub-100 ms? Edge or bust. Can tolerate 2–5 s? Cloud is fine.
  • Quality bar — legal/medical/coding tasks likely need cloud escalation at least as fallback.
  • Data sensitivity — PII, biometrics, or air-gapped environments strongly favor edge.
  • Query volume per user — high-frequency autocomplete justifies edge amortization; once-a-week summaries do not.
  • Offline requirement — hard constraint? Edge is mandatory for core path.
  • Model size budget — how much RAM and storage can you consume? A 7B Q4 model needs ~5 GB; a 1B INT8 model fits in ~700 MB.
  • Update cadence — weekly model improvements favor cloud; stable behavior favors pinned edge checkpoints.

Many teams ship v1 cloud-only to validate product-market fit, then add edge layers once query patterns stabilize and unit economics hurt. That sequencing is rational — premature edge optimization burns engineering time on the wrong bottleneck.

Key takeaways

  • Edge AI runs inference on local hardware — phones, laptops, embedded devices — to cut latency, improve privacy, and reduce cloud API costs.
  • NPUs and quantization (INT8/INT4) make billion-parameter models feasible within phone and laptop memory budgets.
  • Hybrid routing — local model for easy tasks, cloud escalation for hard ones — is the dominant 2026 consumer architecture.
  • Edge excels at high-frequency, narrow tasks; cloud still wins on frontier reasoning and long-context synthesis.
  • Plan runtime format (Core ML, ONNX, GGUF), thermal limits, and OTA update size before committing to on-device LLMs.
  • Privacy claims require auditing telemetry — local inference alone is not enough.

Related reading