Guide
Pandas fundamentals explained
pandas is the standard Python library for loading, cleaning, reshaping,
and analyzing tabular data. Built on NumPy, it introduces two core structures:
Series (one-dimensional labeled arrays) and DataFrame
(two-dimensional tables with named columns and an index). Data scientists reach for pandas
when CSV exports, SQL query results, API logs, and feature stores need exploration before
modeling — it is the glue between raw data and libraries like
scikit-learn.
Unlike spreadsheet tools, pandas operations are scriptable, reproducible, and scale to
millions of rows on a single machine when you choose efficient dtypes and avoid Python
loops. This guide covers object fundamentals, I/O, indexing with loc and
iloc, dtype hygiene, missing values, groupby aggregations,
merges and joins, time-series basics, a Harbor Fleet trip analytics worked example, a
tooling decision table, common pitfalls, and a practitioner checklist — alongside our
Python fundamentals guide,
machine learning overview,
and
feature engineering guide.
Series and DataFrame — the tabular model
A Series is a single column with an index — think one spreadsheet
column with row labels. A DataFrame is a collection of aligned Series
sharing the same index. Columns can hold different dtypes: integers, floats, strings,
booleans, datetimes, and pandas category for low-cardinality enums.
import pandas as pd
df = pd.DataFrame({
"trip_id": [101, 102, 103],
"driver_id": ["D12", "D07", "D12"],
"fare_usd": [14.50, 22.00, 9.75],
"completed_at": pd.to_datetime(["2026-06-01 08:12", "2026-06-01 09:44", "2026-06-01 11:03"]),
})
df.dtypes # inspect column types before analysis
The index defaults to 0, 1, 2… but can be a datetime, user ID,
or composite key. Setting a meaningful index (e.g. trip_id) speeds lookups
and joins. df.shape returns (rows, columns);
df.info() prints non-null counts and memory; df.describe()
summarizes numeric columns. Always run these three after loading unfamiliar data.
Why dtypes matter
Loading CSVs often infers object dtype for numeric columns with a few bad
strings, or stores ZIP codes as floats. Downcasting saves memory and prevents silent bugs:
use pd.to_numeric(errors="coerce") for messy numbers,
astype("category") for country codes with dozens of unique values, and
pd.to_datetime with an explicit format or utc=True
for timestamps. A 10 GB float64 column becomes ~2.5 GB as
float32 when precision allows.
Loading and exporting data
pandas reads most tabular formats through a consistent API:
pd.read_csv()— delimited text; setparse_dates,dtype, andusecolsto control inference and memory.pd.read_parquet()— columnar binary format; preserves dtypes; preferred for analytics pipelines and S3 archives.pd.read_sql()— pull query results from PostgreSQL or SQLite via SQLAlchemy connections.pd.read_json()— nested JSON withlines=Truefor newline-delimited logs.
Export with symmetric writers: to_csv, to_parquet,
to_sql. For large files, pass chunksize to
read_csv and process incrementally instead of loading everything into RAM.
Parquet with Snappy or Zstd compression is the default interchange format between
ETL jobs and
notebook exploration.
Indexing and selection — loc, iloc, and boolean filters
pandas offers label-based and position-based indexing. Mixing them incorrectly is the most common beginner bug.
loc — label-based
df.loc[row_label, column_label] includes the end point in slices.
Select rows where fare_usd > 15 and only the driver_id column:
high_fares = df.loc[df["fare_usd"] > 15, "driver_id"]
iloc — integer position
df.iloc[0:2, 1:3] selects the first two rows and columns at positions 1 and 2
(half-open interval, like Python slicing). Use iloc when order matters and
labels are irrelevant.
Boolean indexing and query
Combine conditions with &, |, and ~ — wrap each
condition in parentheses. df.query("fare_usd > 10 and driver_id == 'D12'")
reads cleanly for ad hoc filters. For repeated filters in production, prefer explicit
boolean masks stored as variables so logic is testable.
SettingWithCopyWarning appears when you chain filters then assign —
pandas cannot tell if you are modifying a view or a copy. Fix it with
.loc on the filtered frame or call .copy() after filtering
before mutating.
Cleaning missing values and duplicates
Real data is incomplete. pandas represents missing values as NaN (float columns)
or pd.NA (nullable integer and string dtypes introduced in pandas 1.0+).
isna()/notna()— boolean masks for filtering or counting nulls per column.dropna(subset=["col"], how="all")— remove rows or columns with missing values; specifythreshto keep rows with at least N non-null fields.fillna(value)/ffill()/bfill()— impute constants or propagate last/next valid observation; dangerous on shuffled data without time order.interpolate()— linear or time-based gap filling for ordered series.
For modeling pipelines, imputation belongs inside
sklearn pipelines
so validation folds do not leak statistics. In exploratory pandas work,
df.isna().mean() quickly ranks columns by missingness. Deduplicate with
drop_duplicates(subset=["trip_id"], keep="last") after ingestion — API
retries often produce duplicate primary keys.
GroupBy — split, apply, combine
groupby is pandas’ answer to SQL GROUP BY. Split rows by
one or more keys, compute aggregations per group, and return a summarized DataFrame or
Series.
driver_stats = (
df.groupby("driver_id", as_index=False)
.agg(
trips=("trip_id", "count"),
total_fare=("fare_usd", "sum"),
avg_fare=("fare_usd", "mean"),
)
)
Named aggregations (pandas 0.25+) produce clear column names. Multi-key grouping —
df.groupby(["driver_id", df["completed_at"].dt.date]) — builds daily
per-driver rollups. transform returns a Series aligned to the original index
(useful for per-group z-scores); apply runs arbitrary functions per group
but is slower — prefer built-in aggregations when possible.
Pivot tables and crosstabs
pd.pivot_table(df, index="driver_id", columns="weekday", values="fare_usd", aggfunc="sum")
reshapes long data into a matrix — the pandas equivalent of spreadsheet pivot tables.
pd.crosstab counts categorical co-occurrences for quick contingency analysis.
Merges, joins, and concatenation
Relational data lives in multiple tables. pandas mirrors SQL join semantics:
pd.merge(left, right, on="key", how="inner")— inner, left, right, and outer joins on shared columns or index levels.pd.concat([df1, df2], axis=0)— stack tables vertically (union rows) or horizontally (add columns) when indexes align.df.join(other, on="key")— index-based join shorthand.
Always validate join cardinality. A one-to-many merge duplicates left rows — expected
when attaching order line items to orders, catastrophic when merging user demographics
onto event logs if the right table has duplicate user IDs. Check
merged.shape against left.shape and run
right.duplicated(subset=["key"]).sum() before merging. For heavy join
workloads on warehouse data, push logic to SQL or
DuckDB and pull summarized
results into pandas.
Time series essentials
Datetime columns unlock resampling, rolling windows, and calendar-aware arithmetic.
Parse with pd.to_datetime, set as index if the series is ordered:
ts = df.set_index("completed_at").sort_index()
hourly = ts["fare_usd"].resample("h").sum()
rolling_7d = ts["fare_usd"].rolling("7D").mean()
dt accessor extracts components — .dt.hour,
.dt.dayofweek, .dt.tz_convert("America/New_York"). Store
timestamps in UTC at ingestion; convert to local zones only for reporting. Shift and
lag features for
feature engineering
use .shift(n) and .diff() on sorted frames — never on
shuffled train/test splits without grouping by entity first.
Worked example: Harbor Fleet trip analytics
Harbor Fleet operates a regional ride-hailing service. The ops team exports nightly
Parquet files to S3: one row per completed trip with trip_id,
driver_id, rider_id, fare_usd,
distance_km, duration_min, and completed_at UTC.
The weekly analytics notebook must answer: which drivers earn the most per hour, which
hours see surge pricing lift, and are there duplicate trip records from webhook retries?
- Load with dtypes —
pd.read_parquet("trips_2026-06-01.parquet", columns=[...])withdriver_idascategorysaves 40% RAM on 2M rows. - Audit quality —
df.duplicated("trip_id").sum()finds 1,842 dupes;drop_duplicates(subset=["trip_id"], keep="last")keeps the final settlement row. - Feature columns —
df["fare_per_km"] = df["fare_usd"] / df["distance_km"];df["earnings_per_hour"] = df["fare_usd"] / (df["duration_min"] / 60); clip outliers above the 99th percentile before aggregating. - Driver leaderboard —
groupby("driver_id").agg(trips=("trip_id","count"), revenue=("fare_usd","sum"), mph=("earnings_per_hour","median")).sort_values("mph", ascending=False). - Hour-of-day demand — extract
df["hour"] = df["completed_at"].dt.hour, thengroupby("hour")["trip_id"].count()to staff support peaks. - Join driver tenure —
pd.merge(driver_stats, tenure_df, on="driver_id", how="left")to test whether new drivers have lowermphand need onboarding. - Export for ML — write a cleaned Parquet slice for the fraud team’s scikit-learn pipeline; never pass raw object columns with mixed types.
The notebook runs in four minutes on a laptop. Ops shifts Sunday staffing based on the
hour-of-day chart; driver success coaches contact the bottom decile on
earnings_per_hour after controlling for market zone.
When to use pandas vs alternatives
| Tool | Best for | Trade-offs |
|---|---|---|
| pandas | Notebook exploration, medium data on one machine, sklearn preprocessing, quick joins | Single-threaded core; easy to write slow apply loops; memory-heavy dtypes |
| Polars | Large single-node tables, lazy query plans, faster groupby and joins | Smaller ecosystem; different API; fewer legacy tutorials |
| SQL / DuckDB | Filtering and aggregating warehouse-scale data close to storage | Less flexible for iterative plots and ML feature prototyping |
| PySpark | Cluster-scale ETL across terabytes | Cluster overhead; slower for small data than pandas |
| NumPy alone | Homogeneous numeric arrays, custom vectorized math | No labeled columns, joins, or groupby — build everything yourself |
Start with pandas for datasets that fit in memory after sensible dtypes. Move to Polars or push down to SQL when profiling shows groupby or join time dominating. Export a sample to pandas for visualization even when the pipeline runs in Spark.
Common pitfalls
- Chained indexing —
df[df["a"] > 0]["b"] = 1may not persist; usedf.loc[df["a"] > 0, "b"] = 1. - Modifying while iterating —
for idx, row in df.iterrows()is slow and error-prone; vectorize or useapplysparingly. - Ignoring the index after
groupby— reset withas_index=Falseor.reset_index()before merging. - Timezone-naive datetimes — mixing local and UTC causes off-by-hours joins; standardize at ingest.
- Silent dtype coercion — concatenating int and NaN columns upgrades to float; use nullable
Int64when missing integers are valid. - Leaking future data in time splits — sort by time before rolling features; validate with forward-chaining splits, not random
train_test_split. - Loading entire CSVs repeatedly — cache Parquet intermediates; specify
usecolsanddtypeon read.
Practitioner checklist
- Run
shape,info(), anddescribe()immediately after every load. - Set explicit dtypes and parse dates at read time, not in a second pass.
- Profile memory with
df.memory_usage(deep=True)before complaining about RAM. - Validate primary keys with
duplicated()before merges and aggregations. - Prefer
groupby.aggwith named aggregations over customapply. - Use
.locfor label-based assignment; call.copy()when unsure about views. - Store timestamps in UTC; document timezone conversions in reporting layers.
- Write cleaned Parquet checkpoints between heavy notebook cells.
- Unit-test non-obvious transforms on small fixtures, not only the full 2M-row frame.
- Hand off modeling-ready matrices to sklearn pipelines without manual scaling outside CV.
Key takeaways
- pandas provides labeled Series and DataFrames — the default in-memory table format for Python data work.
loc,iloc, and boolean masks are the core selection tools; avoid chained assignment.groupby, merges, and pivot tables cover most SQL-shaped analytics without leaving Python.- Dtype discipline and Parquet I/O matter as much as algorithm choice for performance.
- Clean and validate in pandas; train models in sklearn with pipeline boundaries that prevent leakage.
Related reading
- scikit-learn fundamentals explained — estimators and pipelines on pandas-ready matrices
- Python fundamentals explained — syntax, environments, and the broader ecosystem
- Feature engineering explained — transforms from raw columns to model inputs
- ETL and ELT data pipelines explained — where pandas fits in warehouse workflows