Guide
Gradio fundamentals explained
You trained a classifier, wired up a Hugging Face Transformers pipeline, and now stakeholders want to click buttons instead of reading Jupyter cells. Gradio is a Python library that wraps your inference function in a web UI — text boxes, image uploaders, audio recorders, chat panes — and serves it locally or on Hugging Face Spaces with a shareable URL. It sits between a notebook prototype and a full FastAPI product: minutes to demo, hours to harden, not months of frontend work. This guide covers the Interface and Blocks APIs, component types, event wiring, ChatInterface for LLMs, queuing and authentication, Spaces deployment, a Harbor Support ticket triage UI worked example, a tooling decision table, common pitfalls, and a practitioner checklist.
What Gradio is (and is not)
Gradio is a Python-first UI framework for machine learning demos.
You write a function that accepts typed inputs (string, image, audio, file) and
returns typed outputs; Gradio renders matching widgets and calls your function on
submit. The library handles WebSocket streaming, file uploads, progress bars, and
optional public tunnels via share=True.
It is not a general web application framework, a model trainer, or a production API gateway. Complex multi-page apps with custom auth flows belong in React plus FastAPI. Gradio excels at single-purpose inference surfaces: try my model, compare two prompts, label this batch, preview this RAG answer. Pair it with Python backends and Ollama or cloud APIs for the model layer; use LangChain inside your handler when retrieval or agents are needed.
Core concepts
- Components — typed UI widgets (
gr.Textbox,gr.Image,gr.Audio,gr.Dropdown) with validation and serialization. - Interface — high-level wrapper: one function, declared inputs/outputs, auto-layout.
- Blocks — low-level layout API with rows, columns, tabs, and arbitrary event graphs.
- Events —
.click(),.submit(),.change()bind UI actions to Python callables. - State —
gr.Statepersists values across events without showing them to the user. - Queue — serializes concurrent requests so GPU-heavy handlers do not OOM.
- Client —
gradio_clientcalls a deployed Space or local app programmatically from another script.
Installation and your first Interface
Install with pip: pip install gradio. A minimal text classifier demo:
import gradio as gr
def classify(text: str) -> dict:
# call your model here
label = "billing" if "invoice" in text.lower() else "general"
scores = {label: 0.92, "other": 0.08}
return scores
demo = gr.Interface(
fn=classify,
inputs=gr.Textbox(lines=4, placeholder="Paste support ticket..."),
outputs=gr.Label(num_top_classes=3),
title="Ticket triage preview",
examples=[["My payment failed"], ["How do I reset my password?"]],
)
demo.launch()
Run the script; Gradio opens http://127.0.0.1:7860. The
examples array pre-fills inputs so reviewers can smoke-test without
typing. For batch review, pass batch=True on inputs so Gradio sends
lists to your function.
When to reach for Blocks instead
Switch to gr.Blocks() when you need multiple buttons updating different
panels, conditional visibility, tabs, or chained events (upload image, then crop,
then classify). Blocks mirrors a declarative layout tree:
with gr.Blocks() as demo:
with gr.Row():
inp = gr.Textbox(label="Question")
btn = gr.Button("Ask")
out = gr.Textbox(label="Answer")
btn.click(fn=answer, inputs=inp, outputs=out)
gr.Row() and gr.Column() control responsive layout;
gr.Tab() groups unrelated flows on one page.
ChatInterface and streaming LLM demos
LLM chat UIs are Gradio’s sweet spot. gr.ChatInterface wraps a
function that accepts message history and returns the assistant reply —
including streaming via Python generators:
def respond(message, history):
for chunk in llm.stream(message, history):
yield chunk
gr.ChatInterface(
fn=respond,
type="messages", # OpenAI-style role/content pairs
title="Policy Q&A preview",
).launch()
Set type="messages" for modern chat models; legacy tuple history
still works with type="tuples". Enable retry and
undo buttons for reviewer workflows. For RAG, load retrieved
chunks inside respond and optionally expose a second output
(gr.JSON) showing citations — transparency builds trust in
internal demos.
Multimodal inputs
Vision and speech demos combine components: gr.Image(type="pil") feeds
PIL images to your handler; gr.Audio(sources=["microphone"]) returns
(sample_rate, numpy_array). Always document expected dtypes in your
function signature — Gradio infers component types from annotations when
using gr.Interface with typed hints.
Queuing, auth, and deployment
GPU inference is slow; without a queue, ten simultaneous clicks spawn ten model
loads and crash VRAM. Call demo.queue(max_size=20) before
launch(). Tune default_concurrency_limit per event when
some handlers are cheap (format JSON) and others are expensive (diffusion).
Authentication and networking
auth=("user", "pass")or a custom auth function gates internal demos.server_name="0.0.0.0"binds publicly — pair with reverse-proxy TLS, never raw exposure.share=Truecreates a temporary Gradio tunnel; fine for quick reviews, not production.root_pathmounts behind nginx subpaths (e.g./demo/).
Hugging Face Spaces
Push a app.py plus requirements.txt to a Space with SDK
Gradio; Hugging Face builds and hosts the UI. Use gradio>=4.0 pins
and secrets for API keys. Spaces give you HTTPS, versioning, and a discoverable URL
for open models — ideal for community releases. Private Spaces mirror internal
staging. When traffic outgrows free tiers, export the same app.py to
your own container behind
Docker.
Worked example: Harbor Support ticket triage UI
Harbor Support needed a reviewer UI for a fine-tuned ticket router without building React. The Gradio Blocks app:
- Left column —
gr.Textboxfor raw ticket body,gr.Dropdownfor priority override,gr.Button("Classify"). - Right column —
gr.Labelfor predicted queue,gr.JSONfor top-3 logits,gr.Textboxfor suggested macro reply (from a secondary LLM call). - State —
gr.Statestores ticket ID from pasted CRM JSON header. - Feedback loop — thumbs-up/down buttons append rows to a local SQLite log for relabeling.
- Queue —
demo.queue(default_concurrency_limit=2)because the 7B model shares one GPU.
Reviewers classified 200 tickets in a pilot week; misroute rate dropped 18% versus manual queue guessing because logits were visible. The same handler later moved behind FastAPI — Gradio proved the I/O contract.
Tooling decision table
| Goal | Favor | Avoid |
|---|---|---|
| Quick ML demo for stakeholders | Gradio Interface or ChatInterface, examples preloaded | Custom React SPA before validating the workflow |
| Public open-model release | Hugging Face Space, pinned requirements, model card link | share=True tunnels as permanent hosting |
| Multi-tab internal tool with auth | Gradio Blocks + auth + queue |
Unauthenticated 0.0.0.0 on a GPU server |
| High-QPS production API | FastAPI + dedicated inference (vLLM, TGI) | Gradio as the customer-facing backend |
| Data exploration dashboard | Streamlit or Plotly Dash for charts | Gradio when layout is mostly tables and plots, not inference |
| Programmatic batch from CI | gradio_client against a deployed Space |
Headless browser clicking the UI |
Common pitfalls
- No queue on GPU handlers — concurrent requests OOM the GPU; always
demo.queue(). - Blocking the event loop — long synchronous CPU work freezes the UI; offload to threads or async handlers.
- Leaking secrets — API keys in
app.pycommitted to a public Space; use HF Secrets or env vars. - Wrong output component — returning a dict to
gr.Textboxshows[object Object]; match types. - Giants unbounded uploads — 50 MB images without resize exhaust RAM; validate size in the handler.
- Gradio as production auth boundary — basic auth is fine for internal demos, not customer PII at scale.
- Version drift — Gradio 3 vs 4 API differences break Spaces; pin
gradio==4.xin requirements. - No error surfacing — bare exceptions become opaque traces; catch and return user-readable
gr.Errormessages.
Production checklist
- Define handler input/output types and document expected tensor or PIL shapes.
- Prototype with
gr.Interface; refactor to Blocks only when layout demands it. - Add
examples=for one-click smoke tests by non-engineers. - Enable
demo.queue()before any GPU or LLM call. - Load models once at module scope, not inside every button click.
- Configure
author place behind SSO reverse proxy for internal tools. - Pin Gradio and torch/transformers versions in
requirements.txt. - Store API keys in environment variables or HF Secrets, never in git.
- Validate upload size and MIME type before passing to inference.
- Log latency and error rate; set
show_error=Trueonly in staging. - Plan graduation path to FastAPI when QPS or custom UX exceeds Gradio’s scope.
Key takeaways
- Gradio turns Python inference functions into shareable web UIs in minutes.
- Interface covers simple demos; Blocks handles complex layouts and event chains.
- ChatInterface plus queuing is the fastest path to an LLM review surface.
- Hugging Face Spaces host public demos; private deployments need auth and pinned deps.
- Treat Gradio as a prototype and review layer — production APIs still want FastAPI and dedicated inference servers.
Related reading
- Hugging Face Transformers explained — pipelines and models Gradio demos wrap
- FastAPI fundamentals explained — graduate demos to production HTTP APIs
- LLM fine-tuning explained — train the model your Gradio UI previews
- RAG explained — ground chat demos with retrieved context