Guide
Jupyter fundamentals explained
A product manager asks why checkout conversion dropped 4% last Tuesday. You could
grep log files in a terminal, but the fastest path to an answer is often a
Jupyter notebook: load yesterday’s Parquet export, plot funnel
steps, slice by browser and payment method, and annotate findings in Markdown
beside the code that produced them. Jupyter is the de facto interactive environment
for Python data work — and increasingly for R, Julia, and SQL kernels too.
This guide covers what notebooks are and how they differ from scripts, JupyterLab
vs classic Notebook vs VS Code, cells and kernel lifecycle, IPython magic commands,
widgets and rich outputs, reproducibility with version pins and
nbconvert, production patterns with Papermill and scheduled runs, a
Harbor Analytics funnel dropout worked example, a Jupyter vs scripts vs Streamlit
decision table, common pitfalls, and a production checklist. Pair it with our
pandas fundamentals guide,
Python fundamentals guide,
and
Plotly fundamentals guide
for the full analysis stack.
What Jupyter is and why data teams adopt it
Jupyter is an open-source project for computational notebooks: documents that interleave executable code, formatted text, equations, and visualizations. The name reflects its original languages — Julia, Python, and R — though Python dominates today. A kernel executes code in a chosen language; the frontend (JupyterLab or Notebook) sends cells to the kernel and renders outputs inline.
Notebooks excel at exploratory analysis: you inspect a DataFrame,
notice a skew, filter, replot, and iterate without re-running an entire script from
scratch. Narrative Markdown cells document assumptions (“we exclude bot
traffic where user_agent matches our crawler list”) next to the
code that enforces them. That pairing is why Jupyter ships in university courses,
Kaggle kernels, and enterprise data platforms.
When notebooks are the right default
- Ad hoc investigation — funnel drops, fraud spikes, A/B test sanity checks.
- Teaching and onboarding — runnable tutorials with inline plots.
- Prototyping ML features — quick sklearn fits before hardening a training pipeline.
- Parameterized reports — weekly KPI notebooks executed by Papermill or cron.
Graduate to plain .py modules when logic stabilizes, needs unit tests,
or must run in CI without a browser. Graduate to
Streamlit
when non-technical stakeholders need self-serve dashboards, not developer cells.
JupyterLab, Notebook, VS Code, and Colab
JupyterLab is the current flagship interface: a tabbed IDE with a
file browser, terminal, extension manager, and draggable notebook panes. Install with
pip install jupyterlab and launch via jupyter lab. Most
teams should start here in 2026.
Classic Jupyter Notebook (jupyter notebook) is the
older single-document UI. It still works but receives fewer features; migrate to
Lab unless you depend on a legacy extension.
VS Code and PyCharm embed notebook editors with
IntelliSense, Git diff, and debugger breakpoints. Developers who live in an IDE
often prefer this; the kernel protocol is identical, so .ipynb files
move between Lab and VS Code freely.
Google Colab and JupyterHub host notebooks on remote GPUs or shared servers. Colab is free-tier friendly for ML experiments; JupyterHub scales notebooks for classrooms and internal platforms. Treat hosted runtimes as ephemeral — download artifacts and pin versions locally.
Cells, kernels, and the execution model
Notebooks store content as JSON (.ipynb). Each cell has
a type:
- Code — sent to the kernel; outputs (text, images, DataFrames) attach below.
- Markdown — rendered headings, lists, LaTeX, and links for narrative.
- Raw — passed through unexecuted; rare outsidenbconvert templates.
The kernel maintains state between cell runs: imports, variables,
and open connections persist until you restart. That is powerful (define
df once, slice it in ten cells) and dangerous (run cells out of order
and df may mean something different than the saved output suggests).
Kernel hygiene
- Restart & Run All before sharing or committing — proves top-to-bottom reproducibility.
- Clear outputs in Git diffs with
nbstripoutor Jupyter’s “Clear All Outputs” to avoid megabyte JSON commits. - One kernel per notebook — mixing Python 3.11 and 3.12 kernels across copies causes silent dtype bugs.
- Interrupt vs restart — interrupt stops a hung cell; restart wipes memory when imports fail mysteriously.
Select kernels from a named conda or venv environment. Document the environment in
the first Markdown cell: python 3.12, pandas 2.2, pyarrow 16.
IPython magic commands and shell integration
Jupyter code cells run on an IPython kernel, which adds
magic commands — line magics (%) and cell magics
(%%):
%timeit df.groupby("region")["revenue"].sum()
%pip install --quiet polars
%%bash
aws s3 cp s3://harbor-analytics/events/2026-06-08.parquet ./data/
Common magics:
%time/%timeit— profile slow pandas groupbys before rewriting in Polars.%matplotlib inline— render plots under cells (still useful in Lab).%load_ext autoreload+%autoreload 2— reload edited.pymodules without kernel restart during library development.%%writefile utils.py— extract stabilized helpers into scripts (use sparingly; prefer the editor).?function_nameor??class— introspect docstrings and source.
Prefixing a line with ! runs shell commands:
!pytest tests/test_features.py -q. Keep shell calls out of production
pipelines; they break on Windows paths and skip dependency tracking.
Rich outputs, widgets, and interactive exploration
Modern kernels render HTML tables for pandas, interactive Plotly figures, and image
thumbnails inline. Control display with from IPython.display import display, Markdown
when building multi-step reporting cells.
ipywidgets add sliders, dropdowns, and buttons bound to Python callbacks — useful for threshold tuning (“show funnel chart when minimum session count > N”). Widget state is not saved in the notebook JSON by default; document default parameter values in Markdown when sharing static exports.
For publication-quality static reports, export with
jupyter nbconvert --to html notebook.ipynb or use
Quarto / Jupyter Book to compile notebooks into
PDF and websites with cross-references.
Reproducibility: versions, data paths, and git discipline
The number-one notebook failure mode is “works on my machine.” Harden reproducibility early:
- Pin dependencies in
pyproject.tomlorrequirements.txt; run%pip listin a hidden cell or print versions in the header. - Parameterize paths —
DATA_DIR = Path(os.getenv("HARBOR_DATA", "./data"))instead of hard-coded/Users/alice/.... - Seed randomness —
np.random.seed(42)andrandom_state=42in sklearn splits. - Strip outputs before commit — use pre-commit hooks with
nbstripoutso Git tracks source, not megabyte plots. - Separate data from notebook — store Parquet in object storage; notebooks reference URIs, not embedded CSV blobs.
Treat notebooks like lab notebooks: enough narrative that a colleague can Restart & Run All on Monday and reach the same conclusions you did on Friday.
Production patterns: Papermill, scheduling, and promotion to code
Exploration notebooks should not run unchecked in production cron. Mature teams use:
- Papermill — execute a parameterized notebook:
pm.execute_notebook("template.ipynb", "output.ipynb", parameters={"report_date": "2026-06-08"}). Inject dates and tenant IDs without editing cells by hand. - nbconvert — headless execution for CI smoke tests on tutorial notebooks.
- Promote stable logic — move validated transforms into importable
harbor_analytics/funnel.pymodules covered by pytest. - Orchestrators — Airflow or Dagster tasks that trigger Papermill after upstream ETL succeeds.
The notebook remains the spec; tested Python packages become the contract downstream services import.
Worked example: Harbor Analytics checkout funnel dropout
Harbor Analytics operates a self-serve BI product. On 2026-06-07, the CEO noticed
signup-to-paid conversion fell from 12.1% to 8.4% week-over-week. An analyst opens
funnel_dropout_2026-06.ipynb in JupyterLab with the project’s
Python 3.12 kernel.
- Load events —
pd.read_parquet("s3://harbor/events/dt=2026-06-07/*.parquet", columns=["session_id","step","ts","browser","plan"])with categorical dtypes for memory. - Define funnel steps — map raw event names to ordered stages:
landing → signup_start → signup_complete → payment_submit → payment_success. - Session-level max step —
df.groupby("session_id")["step_rank"].max()to see where sessions stall. - Compare weeks — join against prior-week Parquet; compute step conversion rates with
pd.crosstabnormalized by row. - Slice the drop — filter
browser == "Mobile Safari"andplan == "pro"; discover 80% of the regression concentrates there. - Plot with Plotly — stacked bar of drop-off counts by step; annotate the payment_submit → payment_success cliff.
- Hypothesis cell — Markdown notes that a Stripe Elements upgrade shipped 2026-06-06; query error logs for
card_declinedspikes. - Export artifact — save summary CSV and HTML chart to Slack; file ticket with notebook link and pinned kernel spec.
Total time: 45 minutes. The notebook is archived in Git (outputs stripped); Papermill
reruns the template every Monday with a fresh report_date parameter for
the ops review.
Jupyter vs scripts vs Streamlit vs Quarto
| Need | Jupyter notebook | Python script / package | Streamlit / Dash | Quarto / Jupyter Book |
|---|---|---|---|---|
| Exploratory analysis with narrative | Best fit | Awkward without REPL | Overkill | Good for polished reports |
| Unit-tested production logic | Promote to modules | Best fit | Thin UI layer only | Not applicable |
| Non-technical stakeholder UI | Poor | Requires frontend | Best fit | Static dashboards OK |
| Parameterized weekly reports | Papermill +nbconvert | Cron + matplotlib | Hosted app | Render to PDF/HTML |
| Git-friendly diffs | Strip outputs; consider Jupytext | Native | Moderate | Source is Markdown |
| GPU remote experiments | Colab, SageMaker Studio | SSH + script | Rare | Not applicable |
Common pitfalls
- Out-of-order execution — saved outputs reflect cell 12 run before cell 8; always Restart & Run All before publishing.
- Notebook as untested production — cron-running a 40-cell notebook with
!pip installmid-flight breaks silently on version drift. - Committing outputs — bloated repos and leaked PII in error tracebacks; strip outputs and scrub samples.
- Hidden global state — mutating
dfin place across cells makes debugging painful; assigndf_clean = df[df["valid"]]explicitly. - Hard-coded absolute paths — notebooks that only run on one laptop; use env vars and
pathlib. - Giant monolith notebooks — 200 cells without headings; split by question or promote helpers to modules.
- Mixing ETL and analysis — heavy Spark jobs belong in pipelines; notebooks read curated Parquet slices.
- Ignoring kernel death — OOM from loading full tables; sample with
.head(100_000)or push to Polars lazy scans.
Production checklist
- Install JupyterLab in a dedicated venv; document
jupyter lablaunch in README. - Pin kernel Python version and core libs (pandas, pyarrow, matplotlib) in lockfiles.
- Add a header Markdown cell with purpose, data sources, and owner contact.
- Parameterize dates and paths; avoid machine-specific directories.
- Run Restart & Run All before merge; fix any ordering bugs.
- Configure
nbstripoutor equivalent pre-commit hook. - Keep raw data out of Git; reference S3/GCS URIs with access docs.
- Extract reusable functions into tested packages once logic stabilizes.
- Use Papermill for scheduled reports with injected parameters.
- Export HTML/PDF for stakeholders who do not run kernels locally.
Key takeaways
- Jupyter pairs code and narrative in cells executed by a persistent kernel.
- JupyterLab is the default frontend; VS Code and Colab share the same
.ipynbformat. - Reproducibility requires Restart & Run All, version pins, stripped outputs, and parameterized paths.
- Magic commands accelerate profiling and shell tasks during exploration only.
- Promote stable logic to tested Python modules; use Papermill for parameterized reporting.
Related reading
- Pandas fundamentals explained — DataFrames, groupby, and I/O patterns notebooks rely on
- NumPy fundamentals explained — ndarray operations beneath pandas columns
- Plotly fundamentals explained — interactive charts inside notebook outputs
- Streamlit fundamentals explained — when analysis graduates to a dashboard