Guide

Gradio fundamentals explained

You trained a classifier, wired up a Hugging Face Transformers pipeline, and now stakeholders want to click buttons instead of reading Jupyter cells. Gradio is a Python library that wraps your inference function in a web UI — text boxes, image uploaders, audio recorders, chat panes — and serves it locally or on Hugging Face Spaces with a shareable URL. It sits between a notebook prototype and a full FastAPI product: minutes to demo, hours to harden, not months of frontend work. This guide covers the Interface and Blocks APIs, component types, event wiring, ChatInterface for LLMs, queuing and authentication, Spaces deployment, a Harbor Support ticket triage UI worked example, a tooling decision table, common pitfalls, and a practitioner checklist.

What Gradio is (and is not)

Gradio is a Python-first UI framework for machine learning demos. You write a function that accepts typed inputs (string, image, audio, file) and returns typed outputs; Gradio renders matching widgets and calls your function on submit. The library handles WebSocket streaming, file uploads, progress bars, and optional public tunnels via share=True.

It is not a general web application framework, a model trainer, or a production API gateway. Complex multi-page apps with custom auth flows belong in React plus FastAPI. Gradio excels at single-purpose inference surfaces: try my model, compare two prompts, label this batch, preview this RAG answer. Pair it with Python backends and Ollama or cloud APIs for the model layer; use LangChain inside your handler when retrieval or agents are needed.

Core concepts

Components — typed UI widgets (gr.Textbox, gr.Image, gr.Audio, gr.Dropdown) with validation and serialization.
Interface — high-level wrapper: one function, declared inputs/outputs, auto-layout.
Blocks — low-level layout API with rows, columns, tabs, and arbitrary event graphs.
Events — .click(), .submit(), .change() bind UI actions to Python callables.
State — gr.State persists values across events without showing them to the user.
Queue — serializes concurrent requests so GPU-heavy handlers do not OOM.
Client — gradio_client calls a deployed Space or local app programmatically from another script.

Installation and your first Interface

Install with pip: pip install gradio. A minimal text classifier demo:

import gradio as gr

def classify(text: str) -> dict:
    # call your model here
    label = "billing" if "invoice" in text.lower() else "general"
    scores = {label: 0.92, "other": 0.08}
    return scores

demo = gr.Interface(
    fn=classify,
    inputs=gr.Textbox(lines=4, placeholder="Paste support ticket..."),
    outputs=gr.Label(num_top_classes=3),
    title="Ticket triage preview",
    examples=[["My payment failed"], ["How do I reset my password?"]],
)
demo.launch()

Run the script; Gradio opens http://127.0.0.1:7860. The examples array pre-fills inputs so reviewers can smoke-test without typing. For batch review, pass batch=True on inputs so Gradio sends lists to your function.

When to reach for Blocks instead

Switch to gr.Blocks() when you need multiple buttons updating different panels, conditional visibility, tabs, or chained events (upload image, then crop, then classify). Blocks mirrors a declarative layout tree:

with gr.Blocks() as demo:
    with gr.Row():
        inp = gr.Textbox(label="Question")
        btn = gr.Button("Ask")
    out = gr.Textbox(label="Answer")
    btn.click(fn=answer, inputs=inp, outputs=out)

gr.Row() and gr.Column() control responsive layout; gr.Tab() groups unrelated flows on one page.

ChatInterface and streaming LLM demos

LLM chat UIs are Gradio’s sweet spot. gr.ChatInterface wraps a function that accepts message history and returns the assistant reply — including streaming via Python generators:

def respond(message, history):
    for chunk in llm.stream(message, history):
        yield chunk

gr.ChatInterface(
    fn=respond,
    type="messages",  # OpenAI-style role/content pairs
    title="Policy Q&A preview",
).launch()

Set type="messages" for modern chat models; legacy tuple history still works with type="tuples". Enable retry and undo buttons for reviewer workflows. For RAG, load retrieved chunks inside respond and optionally expose a second output (gr.JSON) showing citations — transparency builds trust in internal demos.

Multimodal inputs

Vision and speech demos combine components: gr.Image(type="pil") feeds PIL images to your handler; gr.Audio(sources=["microphone"]) returns (sample_rate, numpy_array). Always document expected dtypes in your function signature — Gradio infers component types from annotations when using gr.Interface with typed hints.

Queuing, auth, and deployment

GPU inference is slow; without a queue, ten simultaneous clicks spawn ten model loads and crash VRAM. Call demo.queue(max_size=20) before launch(). Tune default_concurrency_limit per event when some handlers are cheap (format JSON) and others are expensive (diffusion).

Authentication and networking

auth=("user", "pass") or a custom auth function gates internal demos.
server_name="0.0.0.0" binds publicly — pair with reverse-proxy TLS, never raw exposure.
share=True creates a temporary Gradio tunnel; fine for quick reviews, not production.
root_path mounts behind nginx subpaths (e.g. /demo/).

Hugging Face Spaces

Push a app.py plus requirements.txt to a Space with SDK Gradio; Hugging Face builds and hosts the UI. Use gradio>=4.0 pins and secrets for API keys. Spaces give you HTTPS, versioning, and a discoverable URL for open models — ideal for community releases. Private Spaces mirror internal staging. When traffic outgrows free tiers, export the same app.py to your own container behind Docker.

Worked example: Harbor Support ticket triage UI

Harbor Support needed a reviewer UI for a fine-tuned ticket router without building React. The Gradio Blocks app:

Left column — gr.Textbox for raw ticket body, gr.Dropdown for priority override, gr.Button("Classify").
Right column — gr.Label for predicted queue, gr.JSON for top-3 logits, gr.Textbox for suggested macro reply (from a secondary LLM call).
State — gr.State stores ticket ID from pasted CRM JSON header.
Feedback loop — thumbs-up/down buttons append rows to a local SQLite log for relabeling.
Queue — demo.queue(default_concurrency_limit=2) because the 7B model shares one GPU.

Reviewers classified 200 tickets in a pilot week; misroute rate dropped 18% versus manual queue guessing because logits were visible. The same handler later moved behind FastAPI — Gradio proved the I/O contract.

Tooling decision table

Goal	Favor	Avoid
Quick ML demo for stakeholders	Gradio Interface or ChatInterface, examples preloaded	Custom React SPA before validating the workflow
Public open-model release	Hugging Face Space, pinned requirements, model card link	`share=True` tunnels as permanent hosting
Multi-tab internal tool with auth	Gradio Blocks + `auth` + queue	Unauthenticated `0.0.0.0` on a GPU server
High-QPS production API	FastAPI + dedicated inference (vLLM, TGI)	Gradio as the customer-facing backend
Data exploration dashboard	Streamlit or Plotly Dash for charts	Gradio when layout is mostly tables and plots, not inference
Programmatic batch from CI	`gradio_client` against a deployed Space	Headless browser clicking the UI

Common pitfalls

No queue on GPU handlers — concurrent requests OOM the GPU; always demo.queue().
Blocking the event loop — long synchronous CPU work freezes the UI; offload to threads or async handlers.
Leaking secrets — API keys in app.py committed to a public Space; use HF Secrets or env vars.
Wrong output component — returning a dict to gr.Textbox shows [object Object]; match types.
Giants unbounded uploads — 50 MB images without resize exhaust RAM; validate size in the handler.
Gradio as production auth boundary — basic auth is fine for internal demos, not customer PII at scale.
Version drift — Gradio 3 vs 4 API differences break Spaces; pin gradio==4.x in requirements.
No error surfacing — bare exceptions become opaque traces; catch and return user-readable gr.Error messages.

Production checklist

Define handler input/output types and document expected tensor or PIL shapes.
Prototype with gr.Interface; refactor to Blocks only when layout demands it.
Add examples= for one-click smoke tests by non-engineers.
Enable demo.queue() before any GPU or LLM call.
Load models once at module scope, not inside every button click.
Configure auth or place behind SSO reverse proxy for internal tools.
Pin Gradio and torch/transformers versions in requirements.txt.
Store API keys in environment variables or HF Secrets, never in git.
Validate upload size and MIME type before passing to inference.
Log latency and error rate; set show_error=True only in staging.
Plan graduation path to FastAPI when QPS or custom UX exceeds Gradio’s scope.

Key takeaways

Gradio turns Python inference functions into shareable web UIs in minutes.
Interface covers simple demos; Blocks handles complex layouts and event chains.
ChatInterface plus queuing is the fastest path to an LLM review surface.
Hugging Face Spaces host public demos; private deployments need auth and pinned deps.
Treat Gradio as a prototype and review layer — production APIs still want FastAPI and dedicated inference servers.