Guide

Knowledge graphs explained

A knowledge graph (KG) stores facts as a network: entities (people, products, genes, cities) connected by typed relationships (works_at, compatible_with, causes). Unlike flat document indexes, graphs make multi-hop reasoning explicit — “Which suppliers of our suppliers are in sanctioned countries?” becomes a traversable path, not a keyword guess. Google’s Knowledge Panel, Wikidata, enterprise master data, and modern GraphRAG pipelines all lean on the same idea: structure beats bag-of-words for relational questions. This guide covers triples and ontologies, RDF vs property-graph models, construction from databases and text, querying with SPARQL and Cypher, integration with semantic search and LLMs, a worked e-commerce catalog example, a technology decision table, common pitfalls, and a production checklist.

Triples: the atomic unit of a knowledge graph

Every fact is a triple: (subject, predicate, object). Example: (Tesla, founded_by, Elon Musk). Subjects and objects are entities (nodes); predicates are relation types (edges). Literals can be objects too: (Tesla, employee_count, 140000) with a typed datatype.

Triples are semantic — the predicate name carries meaning agreed by schema, not an opaque foreign-key column. That agreement is what lets disparate data sources merge: your CRM’s acct_mgr and a partner feed’s accountOwner both map to managed_by.

Reification and qualifiers

Simple triples cannot express “Alice employed Bob from 2019 to 2023” without extensions. Patterns include reification (a node representing the employment event with start/end dates) or RDF-star quoted triples. Plan for temporal and provenance metadata early — retrofitting is painful.

Ontologies: schema for meaning

An ontology defines classes (Person, Product), properties (employs, price), domains and ranges (employs links Organization to Person), and constraints (a Person cannot be a subclass of Product). The TBox is terminological schema; the ABox is instance data.

Standards matter for interoperability:

  • Schema.org — web markup for SEO rich results; lightweight vocabulary.
  • OWL / RDF Schema — formal reasoning, subclass inference, disjointness.
  • Industry ontologies — FIBO (finance), SNOMED CT (medicine), schema-on-read for your domain.

Start with a minimal viable ontology — 20–50 relation types cover most enterprise catalogs. Resist modeling every SQL column as a predicate; aggregate stable business concepts instead.

RDF vs property graphs

Two storage paradigms dominate:

RDF (Resource Description Framework)

Triple stores (Apache Jena, Virtuoso, Amazon Neptune RDF) treat everything as subject-predicate-object with IRIs as identifiers. Query language: SPARQL. Strengths: standards, inference, publishing linked open data. Weaknesses: verbose modeling for heavy attributes on nodes.

Property graphs

Labeled nodes and typed edges with key-value properties (Neo4j, Amazon Neptune openCypher, TigerGraph). Query languages: Cypher, Gremlin. Strengths: developer ergonomics, fast traversals, flexible attributes on edges. Weaknesses: fewer formal reasoning tools than OWL stacks.

Many teams use property graphs internally and export RDF subsets for partners. Pick based on team skills and whether you need W3C linked-data publishing, not religion.

Building a knowledge graph

Construction is usually incremental across three sources:

Structured lift-and-shift

ETL relational tables into entities: customers, orders, SKUs become nodes; foreign keys become edges. Map column names to ontology predicates in a transformation layer. Keep source-system IDs as external identifiers for sync.

Semi-structured and text extraction

Parse JSON APIs, PDF spec sheets, and support tickets with named entity recognition, relation extraction models, and LLM-assisted tuple mining. Human review queues for low-confidence extractions — garbage triples poison traversal.

Entity linking and deduplication

“Apple Inc.”, “AAPL”, and “Apple Computer” must resolve to one canonical node. Entity resolution uses string similarity, embedding clustering, and blocking rules. Maintain sameAs links to external authorities (Wikidata QIDs, LEI codes).

Provenance

Attach source, confidence, and last_verified to every edge. When the LLM and the warehouse disagree, provenance decides the winner.

Querying and reasoning

Graph queries express patterns — paths, cycles, constraints — that SQL recursive CTEs can approximate but rarely match in clarity.

SPARQL example (RDF)

SELECT ?competitor ?product WHERE {
  :OurCompany :competes_with ?competitor .
  ?competitor :offers ?product .
  ?product :category :CloudStorage .
}

Cypher example (property graph)

MATCH (c:Company {name:'Acme'})-[:SUPPLIES]->(p:Part)-[:USED_IN]->(prod:Product)
WHERE p.lead_time_days > 30
RETURN prod.name, collect(p.sku) AS slow_parts

Graph algorithms (PageRank, community detection, shortest path) surface influencers, fraud rings, and supply-chain bottlenecks. Precompute expensive metrics offline; serve hot traversals from memory or indexed adjacency lists.

Reasoners on OWL ontologies infer implicit facts: if Manager is a subclass of Employee, typing someone as Manager implies Employee. Useful in regulated domains; optional for product catalogs.

Knowledge graphs with LLMs and RAG

Vector-only RAG struggles on questions needing joined facts across documents (“List all drugs that interact with both Drug A and Drug B”). GraphRAG retrieves relevant subgraphs — entities plus 1–2 hop neighbors — and serializes them as context for the LLM.

Typical pipeline:

  1. Parse user question; extract entity mentions and intent (lookup vs path vs aggregation).
  2. Link mentions to graph node IDs (entity linking).
  3. Run bounded graph traversal or SPARQL/Cypher template.
  4. Format triples as natural language or JSON for the prompt.
  5. LLM answers with citation to node IDs; UI renders fact provenance.

Hybrid retrieval — BM25 over document chunks plus graph expansion — often beats either alone. See hybrid search for score fusion patterns.

LLM-to-graph construction is trendy but risky: models hallucinate relations. Use LLMs to propose triples; commit only after schema validation and confidence thresholds.

Use cases by domain

  • Search and discovery — faceted browse, “people also bought”, compatibility graphs (Will this lens fit my camera?).
  • Fraud and compliance — traverse ownership chains, detect circular trading, sanctions screening.
  • Biomedical — drug–gene–disease networks; clinical trial eligibility.
  • Support and IT — dependency maps: which services break if Redis cluster B fails?
  • Personalization — taste graphs connecting users, genres, and creators without sparse collaborative matrices alone.

Public references: Google Knowledge Graph powers entity cards; Wikidata is the open backbone; LinkedIn Economic Graph models professional relationships at scale.

Worked example: consumer electronics catalog

An online retailer sells phones, cases, and chargers. Goal: answer “Which cases fit iPhone 15 and support MagSafe?” without hand-authored FAQ pages.

  1. Ontology — classes: Product, Brand, Accessory; relations: compatible_with, manufactured_by, supports_feature.
  2. Ingest — PIM database to nodes; supplier CSVs to compatible_with edges with source=vendor_spec.
  3. NER pass — mine compatibility claims from product descriptions; queue low-confidence edges for merchandiser approval.
  4. Canonical IDs — merge “iPhone 15” / “Apple iPhone 15 (128GB)” under one Device node with SKU children.
  5. Query — Cypher: match cases where path (case)-[:compatible_with]->(iphone15) and (case)-[:supports_feature]->(magsafe).
  6. LLM layer — GraphRAG feeds matched products + specs into prompt for natural-language shopping assistant with SKU citations.
  7. Metrics — precision of compatibility edges (returns due to misfit), query latency p95, assistant answer faithfulness audits weekly.

Technology decision table

Need Prefer knowledge graph Prefer alternative
Multi-hop relational questions Explicit traversals, proven joins Vector RAG alone may miss bridging entities
Long-form prose search Link entities mentioned in docs BM25 / dense retrieval on chunks
Strict schema, ACID transactions Property graph + application validation Relational DB remains source of truth
Publish open linked data RDF + SPARQL endpoint Proprietary property graph export RDF subset
Similarity (“documents like this”) Entity embeddings on graph Vector database on text chunks
Small static FAQ (<500 facts) Overkill Structured JSON or CMS fields
Real-time event stream Incremental graph updates via CDC Stream processor + materialized graph view

Common pitfalls

  • Ontology sprawl — hundreds of relation types nobody uses; merge synonyms aggressively.
  • No entity resolution — duplicate nodes break traversals and inflate metrics.
  • Trusting LLM-extracted triples blindly — hallucinated edges propagate to user-facing answers.
  • Ignoring provenance — impossible to debug wrong facts or prioritize authoritative sources.
  • Unbounded traversals — exponential path explosion; always cap hop depth and fan-out.
  • Graph as only store — keep transactional systems of record; graph is often a derived view.
  • Skipping evaluation — measure edge precision/recall on labeled QA sets, not vibes.
  • Wrong tool for tabular aggregates — “total revenue by region” is still SQL’s job.

Production checklist

  • Versioned ontology document with owners and change process.
  • Canonical entity IDs with external sameAs mappings where available.
  • ETL/CDC pipeline from systems of record with idempotent upserts.
  • Human-in-the-loop review for extracted or LLM-proposed edges below confidence threshold.
  • Provenance metadata on every fact (source, timestamp, confidence).
  • Query templates with hop limits and timeout guards.
  • Hybrid retrieval plan if paired with document RAG or semantic search.
  • Monitoring: graph size growth, orphan nodes, query latency, QA regression suite.
  • Access control on sensitive subgraphs (PII, unreleased products).
  • Disaster recovery — graph rebuild procedure from source systems documented.

Key takeaways

  • Knowledge graphs encode entities and typed relationships for multi-hop reasoning.
  • Ontology discipline matters more than database brand — RDF and property graphs both work.
  • Construction blends ETL, extraction, and entity resolution; provenance is non-optional.
  • GraphRAG complements vector search for relational questions in LLM applications.
  • Use graphs where relationships are first-class; do not force every problem into triples.

Related reading