Guide

Pandas fundamentals explained

pandas is the standard Python library for loading, cleaning, reshaping, and analyzing tabular data. Built on NumPy, it introduces two core structures: Series (one-dimensional labeled arrays) and DataFrame (two-dimensional tables with named columns and an index). Data scientists reach for pandas when CSV exports, SQL query results, API logs, and feature stores need exploration before modeling — it is the glue between raw data and libraries like scikit-learn. Unlike spreadsheet tools, pandas operations are scriptable, reproducible, and scale to millions of rows on a single machine when you choose efficient dtypes and avoid Python loops. This guide covers object fundamentals, I/O, indexing with loc and iloc, dtype hygiene, missing values, groupby aggregations, merges and joins, time-series basics, a Harbor Fleet trip analytics worked example, a tooling decision table, common pitfalls, and a practitioner checklist — alongside our Python fundamentals guide, machine learning overview, and feature engineering guide.

Series and DataFrame — the tabular model

A Series is a single column with an index — think one spreadsheet column with row labels. A DataFrame is a collection of aligned Series sharing the same index. Columns can hold different dtypes: integers, floats, strings, booleans, datetimes, and pandas category for low-cardinality enums.

import pandas as pd

df = pd.DataFrame({
    "trip_id": [101, 102, 103],
    "driver_id": ["D12", "D07", "D12"],
    "fare_usd": [14.50, 22.00, 9.75],
    "completed_at": pd.to_datetime(["2026-06-01 08:12", "2026-06-01 09:44", "2026-06-01 11:03"]),
})
df.dtypes  # inspect column types before analysis

The index defaults to 0, 1, 2… but can be a datetime, user ID, or composite key. Setting a meaningful index (e.g. trip_id) speeds lookups and joins. df.shape returns (rows, columns); df.info() prints non-null counts and memory; df.describe() summarizes numeric columns. Always run these three after loading unfamiliar data.

Why dtypes matter

Loading CSVs often infers object dtype for numeric columns with a few bad strings, or stores ZIP codes as floats. Downcasting saves memory and prevents silent bugs: use pd.to_numeric(errors="coerce") for messy numbers, astype("category") for country codes with dozens of unique values, and pd.to_datetime with an explicit format or utc=True for timestamps. A 10 GB float64 column becomes ~2.5 GB as float32 when precision allows.

Loading and exporting data

pandas reads most tabular formats through a consistent API:

  • pd.read_csv() — delimited text; set parse_dates, dtype, and usecols to control inference and memory.
  • pd.read_parquet() — columnar binary format; preserves dtypes; preferred for analytics pipelines and S3 archives.
  • pd.read_sql() — pull query results from PostgreSQL or SQLite via SQLAlchemy connections.
  • pd.read_json() — nested JSON with lines=True for newline-delimited logs.

Export with symmetric writers: to_csv, to_parquet, to_sql. For large files, pass chunksize to read_csv and process incrementally instead of loading everything into RAM. Parquet with Snappy or Zstd compression is the default interchange format between ETL jobs and notebook exploration.

Indexing and selection — loc, iloc, and boolean filters

pandas offers label-based and position-based indexing. Mixing them incorrectly is the most common beginner bug.

loc — label-based

df.loc[row_label, column_label] includes the end point in slices. Select rows where fare_usd > 15 and only the driver_id column:

high_fares = df.loc[df["fare_usd"] > 15, "driver_id"]

iloc — integer position

df.iloc[0:2, 1:3] selects the first two rows and columns at positions 1 and 2 (half-open interval, like Python slicing). Use iloc when order matters and labels are irrelevant.

Boolean indexing and query

Combine conditions with &, |, and ~ — wrap each condition in parentheses. df.query("fare_usd > 10 and driver_id == 'D12'") reads cleanly for ad hoc filters. For repeated filters in production, prefer explicit boolean masks stored as variables so logic is testable.

SettingWithCopyWarning appears when you chain filters then assign — pandas cannot tell if you are modifying a view or a copy. Fix it with .loc on the filtered frame or call .copy() after filtering before mutating.

Cleaning missing values and duplicates

Real data is incomplete. pandas represents missing values as NaN (float columns) or pd.NA (nullable integer and string dtypes introduced in pandas 1.0+).

  • isna() / notna() — boolean masks for filtering or counting nulls per column.
  • dropna(subset=["col"], how="all") — remove rows or columns with missing values; specify thresh to keep rows with at least N non-null fields.
  • fillna(value) / ffill() / bfill() — impute constants or propagate last/next valid observation; dangerous on shuffled data without time order.
  • interpolate() — linear or time-based gap filling for ordered series.

For modeling pipelines, imputation belongs inside sklearn pipelines so validation folds do not leak statistics. In exploratory pandas work, df.isna().mean() quickly ranks columns by missingness. Deduplicate with drop_duplicates(subset=["trip_id"], keep="last") after ingestion — API retries often produce duplicate primary keys.

GroupBy — split, apply, combine

groupby is pandas’ answer to SQL GROUP BY. Split rows by one or more keys, compute aggregations per group, and return a summarized DataFrame or Series.

driver_stats = (
    df.groupby("driver_id", as_index=False)
      .agg(
          trips=("trip_id", "count"),
          total_fare=("fare_usd", "sum"),
          avg_fare=("fare_usd", "mean"),
      )
)

Named aggregations (pandas 0.25+) produce clear column names. Multi-key grouping — df.groupby(["driver_id", df["completed_at"].dt.date]) — builds daily per-driver rollups. transform returns a Series aligned to the original index (useful for per-group z-scores); apply runs arbitrary functions per group but is slower — prefer built-in aggregations when possible.

Pivot tables and crosstabs

pd.pivot_table(df, index="driver_id", columns="weekday", values="fare_usd", aggfunc="sum") reshapes long data into a matrix — the pandas equivalent of spreadsheet pivot tables. pd.crosstab counts categorical co-occurrences for quick contingency analysis.

Merges, joins, and concatenation

Relational data lives in multiple tables. pandas mirrors SQL join semantics:

  • pd.merge(left, right, on="key", how="inner") — inner, left, right, and outer joins on shared columns or index levels.
  • pd.concat([df1, df2], axis=0) — stack tables vertically (union rows) or horizontally (add columns) when indexes align.
  • df.join(other, on="key") — index-based join shorthand.

Always validate join cardinality. A one-to-many merge duplicates left rows — expected when attaching order line items to orders, catastrophic when merging user demographics onto event logs if the right table has duplicate user IDs. Check merged.shape against left.shape and run right.duplicated(subset=["key"]).sum() before merging. For heavy join workloads on warehouse data, push logic to SQL or DuckDB and pull summarized results into pandas.

Time series essentials

Datetime columns unlock resampling, rolling windows, and calendar-aware arithmetic. Parse with pd.to_datetime, set as index if the series is ordered:

ts = df.set_index("completed_at").sort_index()
hourly = ts["fare_usd"].resample("h").sum()
rolling_7d = ts["fare_usd"].rolling("7D").mean()

dt accessor extracts components — .dt.hour, .dt.dayofweek, .dt.tz_convert("America/New_York"). Store timestamps in UTC at ingestion; convert to local zones only for reporting. Shift and lag features for feature engineering use .shift(n) and .diff() on sorted frames — never on shuffled train/test splits without grouping by entity first.

Worked example: Harbor Fleet trip analytics

Harbor Fleet operates a regional ride-hailing service. The ops team exports nightly Parquet files to S3: one row per completed trip with trip_id, driver_id, rider_id, fare_usd, distance_km, duration_min, and completed_at UTC. The weekly analytics notebook must answer: which drivers earn the most per hour, which hours see surge pricing lift, and are there duplicate trip records from webhook retries?

  1. Load with dtypespd.read_parquet("trips_2026-06-01.parquet", columns=[...]) with driver_id as category saves 40% RAM on 2M rows.
  2. Audit qualitydf.duplicated("trip_id").sum() finds 1,842 dupes; drop_duplicates(subset=["trip_id"], keep="last") keeps the final settlement row.
  3. Feature columnsdf["fare_per_km"] = df["fare_usd"] / df["distance_km"]; df["earnings_per_hour"] = df["fare_usd"] / (df["duration_min"] / 60); clip outliers above the 99th percentile before aggregating.
  4. Driver leaderboardgroupby("driver_id").agg(trips=("trip_id","count"), revenue=("fare_usd","sum"), mph=("earnings_per_hour","median")).sort_values("mph", ascending=False).
  5. Hour-of-day demand — extract df["hour"] = df["completed_at"].dt.hour, then groupby("hour")["trip_id"].count() to staff support peaks.
  6. Join driver tenurepd.merge(driver_stats, tenure_df, on="driver_id", how="left") to test whether new drivers have lower mph and need onboarding.
  7. Export for ML — write a cleaned Parquet slice for the fraud team’s scikit-learn pipeline; never pass raw object columns with mixed types.

The notebook runs in four minutes on a laptop. Ops shifts Sunday staffing based on the hour-of-day chart; driver success coaches contact the bottom decile on earnings_per_hour after controlling for market zone.

When to use pandas vs alternatives

Tool Best for Trade-offs
pandas Notebook exploration, medium data on one machine, sklearn preprocessing, quick joins Single-threaded core; easy to write slow apply loops; memory-heavy dtypes
Polars Large single-node tables, lazy query plans, faster groupby and joins Smaller ecosystem; different API; fewer legacy tutorials
SQL / DuckDB Filtering and aggregating warehouse-scale data close to storage Less flexible for iterative plots and ML feature prototyping
PySpark Cluster-scale ETL across terabytes Cluster overhead; slower for small data than pandas
NumPy alone Homogeneous numeric arrays, custom vectorized math No labeled columns, joins, or groupby — build everything yourself

Start with pandas for datasets that fit in memory after sensible dtypes. Move to Polars or push down to SQL when profiling shows groupby or join time dominating. Export a sample to pandas for visualization even when the pipeline runs in Spark.

Common pitfalls

  • Chained indexingdf[df["a"] > 0]["b"] = 1 may not persist; use df.loc[df["a"] > 0, "b"] = 1.
  • Modifying while iteratingfor idx, row in df.iterrows() is slow and error-prone; vectorize or use apply sparingly.
  • Ignoring the index after groupby — reset with as_index=False or .reset_index() before merging.
  • Timezone-naive datetimes — mixing local and UTC causes off-by-hours joins; standardize at ingest.
  • Silent dtype coercion — concatenating int and NaN columns upgrades to float; use nullable Int64 when missing integers are valid.
  • Leaking future data in time splits — sort by time before rolling features; validate with forward-chaining splits, not random train_test_split.
  • Loading entire CSVs repeatedly — cache Parquet intermediates; specify usecols and dtype on read.

Practitioner checklist

  • Run shape, info(), and describe() immediately after every load.
  • Set explicit dtypes and parse dates at read time, not in a second pass.
  • Profile memory with df.memory_usage(deep=True) before complaining about RAM.
  • Validate primary keys with duplicated() before merges and aggregations.
  • Prefer groupby.agg with named aggregations over custom apply.
  • Use .loc for label-based assignment; call .copy() when unsure about views.
  • Store timestamps in UTC; document timezone conversions in reporting layers.
  • Write cleaned Parquet checkpoints between heavy notebook cells.
  • Unit-test non-obvious transforms on small fixtures, not only the full 2M-row frame.
  • Hand off modeling-ready matrices to sklearn pipelines without manual scaling outside CV.

Key takeaways

  • pandas provides labeled Series and DataFrames — the default in-memory table format for Python data work.
  • loc, iloc, and boolean masks are the core selection tools; avoid chained assignment.
  • groupby, merges, and pivot tables cover most SQL-shaped analytics without leaving Python.
  • Dtype discipline and Parquet I/O matter as much as algorithm choice for performance.
  • Clean and validate in pandas; train models in sklearn with pipeline boundaries that prevent leakage.

Related reading