Monday, May 25, 2026

PostgreSQL: Diagnosing and Fixing a Sequential Scan with No Filter Predicate

A sequential scan is not always a problem — PostgreSQL's planner often chooses one correctly for small tables or queries that return most of a table's rows. The dangerous variant is the sequential scan with no filter predicate: the plan reads every row in the table and returns all of them, with zero selectivity applied at the scan level. This pattern means the query has no WHERE clause narrowing the result, or the join/filter column has no index backing it. On any large table in a transactional workload it produces unnecessary I/O, longer runtimes, and contention under load. This post shows how to spot it and what to do about it.


1. Capture the execution plan

Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) against the slow query. The BUFFERS option exposes how many shared-buffer pages were read from disk versus cache, making the real I/O cost visible alongside the plan structure.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM your_table;

-- The warning sign in the output:
-- Seq Scan on your_table  (cost=0.00..XXXXX rows=NNNNN width=NN)
--                         (actual time=X.XXX..X.XXX rows=NNNNN loops=1)
--   Buffers: shared read=XXXXX
-- (No "Filter:" line beneath the Seq Scan node = no predicate pushing down)
  

2. Visualize and confirm the issue in PlanTrace

Copy the full plan text and paste it into PlanTrace, my free browser-based EXPLAIN visualizer for PostgreSQL and Amazon Redshift. It will render the execution plan as an interactive node graph, flags the costliest nodes with a color-coded cost semaphore, shows buffer usage per node, and surfaces tuning insights automatically, and like this posts, will include the absence of a filter predicate on a sequential scan. 

In the node graph, a Seq Scan node with no child filter annotation and a high shared read buffer count is the fingerprint of this issue. The cost semaphore on that node will typically appear in red or orange on a large table, making it easy to prioritize against other nodes in a complex plan.


3. Confirm the table is large enough to decide your next action

Not every sequential scan needs fixing. If the table has only a few thousand rows, the planner's choice is almost always correct — a sequential scan on a small table is faster than the overhead of an index lookup. Query pg_class to check the row estimate before proceeding.

SELECT
    relname                              AS table_name,
    reltuples::bigint                    AS estimated_rows,
    pg_size_pretty(pg_total_relation_size(oid)) AS total_size
FROM   pg_class
WHERE  relname = 'your_table'
  AND  relkind = 'r';

-- If estimated_rows is under ~1,000, the sequential scan is likely correct. Stop here.
  

4. Identify the missing filter or join predicate

The root cause is almost always one of two things: the query has no WHERE clause at all, or it has one but the filtered column has no supporting index. Check scan counters in pg_stat_user_tables to understand how often the table is being hit with sequential versus index scans, and verify which indexes currently exist.

-- Scan counter ratio
SELECT
    relname        AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    last_analyze,
    last_autoanalyze
FROM   pg_stat_user_tables
WHERE  relname = 'your_table';

-- Existing indexes
SELECT indexname, indexdef
FROM   pg_indexes
WHERE  tablename = 'your_table';
  

5. Add a WHERE clause and create the right index

The remediation follows directly from the diagnosis. If the query has no filter, add one — most production queries have a natural narrow condition such as a date range, a status value, or a foreign key. If the filter column exists but has no index, create one. Use CONCURRENTLY on production systems to avoid blocking writes during the build, as described in the PostgreSQL CREATE INDEX documentation. For frequently queried subsets, a partial index covers only the qualifying rows, keeping the structure compact. For queries that project only a few columns, an index with INCLUDE enables an index-only scan and removes the need for a heap access entirely.

-- Option A: Standard B-tree on the filter or join key
CREATE INDEX CONCURRENTLY idx_your_table_col
    ON your_table (filter_column);

-- Option B: Partial index — only index the rows you actually query
CREATE INDEX CONCURRENTLY idx_your_table_partial
    ON your_table (filter_column)
    WHERE status = 'active';

-- Option C: Covering index — avoid heap access for narrow SELECT lists
CREATE INDEX CONCURRENTLY idx_your_table_covering
    ON your_table (filter_column)
    INCLUDE (col_a, col_b);
  

6. Keep statistics current so the planner uses the index

An index alone is not enough. PostgreSQL's cost-based planner relies on table statistics to decide whether an index scan is cheaper than a sequential scan. If statistics are stale — because the table grew rapidly or autovacuum thresholds are too loose — the planner's row-count estimates will be wrong and it may still choose a sequential scan even with a valid index in place. Run ANALYZE manually after bulk loads, and tighten per-table autovacuum settings on high-write tables.

-- Refresh statistics after a bulk load or after creating the index
ANALYZE your_table;

-- Tighten autovacuum thresholds on a high-write table
ALTER TABLE your_table SET (
    autovacuum_analyze_scale_factor  = 0.01,
    autovacuum_analyze_threshold     = 500
);
  

7. Re-run the plan and verify the fix in PlanTrace

After creating the index and running ANALYZE, re-execute EXPLAIN (ANALYZE, BUFFERS) on the same query and paste the new plan into PlanTrace. The insight should be gone (or show new ones) — confirming that the optimization took effect.


Wrapping Up

A sequential scan with no filter predicate is a full-table read with no early exit: every page is touched, every row is returned, and nothing is filtered at the scan level. The pattern applies to any table in any schema — the fix is always the same combination: add a selective WHERE clause, back it with the right index type (standard, partial, or covering), and run ANALYZE so the planner has accurate statistics to make the right choice. 
PlanTrace makes both steps easier — paste the plan before and after the change and the node graph shows immediately whether the sequential scan has been replaced and how much the buffer cost dropped.


Thursday, May 21, 2026

Stop Using Pandas for Aggregations — Try DuckDB Instead

If you've ever loaded a 2 GB CSV into pandas just to run a few aggregations — and watched your machine struggle — there's a better tool for the job. DuckDB is a free, open-source, in-process OLAP database that runs SQL directly inside your application, with no server to set up and no external dependencies. Think of it as SQLite for analytics: install it in seconds, query Parquet files, CSVs, JSON, or even pandas DataFrames, and get columnar-speed performance on your laptop.


1. Installing DuckDB

DuckDB is available for Python, R, the CLI, and several other runtimes. The most common way to get started is via pip or the CLI binary from the official installation page.

# Python
pip install duckdb
 
# R
install.packages("duckdb")
 
# macOS CLI via Homebrew
brew install duckdb
 
# Windows CLI via winget
winget install DuckDB.cli
  

2. What "in-process" actually means

Traditional databases run as a separate server process. Your application connects over a network socket, serializes the query, waits for the server to deserialize and execute it, and then serializes the result back. That overhead adds up quickly, especially for analytical workloads. DuckDB runs inside your process — no server, no network round-trip. When you use the Python package, for example, DuckDB can run queries directly on an existing pandas DataFrame without ever copying the data. This makes it extremely fast for ad-hoc analytics and ETL transformations.

# Install
pip install duckdb
 
# Query a pandas DataFrame directly — no data copy
import duckdb, pandas as pd
 
df = pd.read_csv("sales.csv")
 
result = duckdb.sql("""
    SELECT region, SUM(amount) AS total
    FROM df
    GROUP BY region
    ORDER BY total DESC
""").df()
  

3. Querying files directly — no table creation needed

One of DuckDB's most practical features is its ability to query files directly from the FROM clause. You can point it at a local Parquet file, a remote CSV over HTTPS, or even an S3 bucket — without defining a schema or creating a table first. DuckDB auto-detects column names and types, so exploratory work is fast. This is documented in the official DuckDB Guides.

-- Query a local Parquet file
SELECT customer_id, COUNT(*) AS orders
FROM   'orders_2024.parquet'
WHERE  status = 'completed'
GROUP BY customer_id
ORDER BY orders DESC
LIMIT  10;
 
-- Or query a remote CSV over HTTPS (httpfs extension auto-loads)
SELECT COUNT(*) FROM 'https://data.example.com/events.csv';
  

4. Where DuckDB fits in a data engineering workflow

DuckDB is not a replacement for SQL Server, Redshift, or PostgreSQL in transactional or multi-user systems. Its strength is in local analytical processing — the kind of work that typically lands in a Python script or Jupyter notebook. Common patterns include:

  • ETL pre-processing: filter, join, and reshape raw CSV or Parquet files before loading into a data warehouse.
  • Format conversion: read JSON or CSV and write out clean Parquet with a single SQL statement using COPY ... TO.
  • Ad-hoc analytics on exports: query a data warehouse export locally without spinning up another cloud service.
  • Privacy-first analysis: process sensitive data (health records, financial statements) entirely on-premises, with data never leaving the machine.
-- Convert a CSV to Parquet in one shot
COPY (
    SELECT *
    FROM   'raw_data.csv'
    WHERE  event_date >= '2024-01-01'
) TO 'clean_data.parquet' (FORMAT PARQUET);
  

5. Columnar execution — why it's fast

DuckDB uses a columnar-vectorized query engine. Instead of processing one row at a time (like PostgreSQL or MySQL do internally for OLTP), DuckDB processes large batches of column values together in a single CPU operation. This reduces the per-value overhead dramatically, making aggregations and scans over wide tables significantly faster than row-oriented engines. For a data engineer, this means you can comfortably run GROUP BY queries over tens of millions of rows on a standard laptop in seconds.


Wrapping Up

DuckDB fills a gap that pandas alone struggles with: fast, SQL-native, columnar analytics that run in-process with zero infrastructure. It reads Parquet, CSV, and JSON natively; integrates cleanly into Python and R workflows; and handles files too large for memory by spilling to disk automatically. For data engineers doing local transformations, format conversions, or exploratory analysis on raw exports, DuckDB is often the fastest path from raw data to insight. Start with the official DuckDB documentation — the installation takes under a minute, and the CLI is ready to query immediately.


Monday, May 18, 2026

Optimizing Redshift Performance by Configuring WLM Queues

Efficient query performance in Amazon Redshift often comes down to how well you manage workload concurrency. Redshift's Workload Management (WLM) queues enable you to control how queries share resources, helping avoid bottlenecks during peak loads. Properly configuring WLM queues ensures critical queries get the resources they need while maximizing cluster throughput.

This post will guide you through the basics of WLM queue configuration and share actionable steps to optimize query execution and concurrency for your Redshift workloads.


Automatic WLM vs. Manual WLM

Before diving into manual configuration, it's important to note that Amazon recommends Automatic WLM as the default approach for most workloads. With Automatic WLM, Redshift dynamically determines resource utilization as queries arrive and adjusts concurrency and memory allocation on the fly — without requiring manual tuning. It maximizes total throughput and handles unpredictable or changing workloads more efficiently than static manual configuration.

Manual WLM is best suited for scenarios where your workload patterns are predictable, you need to guarantee multiple workload types run simultaneously with fixed resource allocations, or you need to throttle certain query types at specific times of day. If you proceed with manual WLM, be prepared to monitor and tune it regularly as workloads evolve.


Step-by-Step Guide (Manual WLM)

1. Understand Redshift WLM Queues and Slots
WLM enables you to define queues with resource slots. Each query consumes one or more slots depending on queue configuration. Slots allocate memory and concurrency allowing multiple queries to run simultaneously without resource contention.
Learn more about WLM fundamentals here: Redshift WLM Documentation.


2. Analyze Your Query Workload
Identify query types and their resource demands. Separate lengthy ETL or analytical queries from short, high-priority reporting queries. Segmenting workloads allows for assigning appropriate queue configurations matching query resource profiles.


3. Create or Modify Queues Based on Workload Types
For example, create a "high_priority" queue with fewer concurrency slots but higher memory allocation for fast-running reports, and a "default" queue for ad-hoc or batch queries running with more concurrency but less memory per query.


4. Adjust Slots and Memory Allocations
Use the AWS Console or Redshift parameter groups to configure slots via the wlm_json_configuration parameter. An example config snippet:

-- Example JSON for WLM queue definition in parameter group:
[
  {
    "query_group": "high_priority",
    "concurrency": 5,
    "memory_percent_to_use": 30
  },
  {
    "query_group": "default",
    "concurrency": 10,
    "memory_percent_to_use": 20
  }
]

This example balances memory and concurrency to prioritize faster queries effectively.


5. Monitor and Tune Using System Views
Query views like STV_WLM_QUERY_STATE and STL_WLM_QUERY to analyze queue usage, wait times, and query runtimes. Use this insight to iteratively adjust concurrency and memory settings.


Conclusion

Optimizing Redshift WLM queues is key to balancing concurrency and performance in mixed workloads. For most users, Automatic WLM is the recommended starting point. If manual WLM better fits your use case, classifying queries, assigning queues based on priority, and tuning memory and slot allocations will ensure efficient resource use and a better experience for high-priority queries.
Regular monitoring and tuning with Redshift system views complete the feedback loop for continuous improvement.

Official Documentation for more details:
Amazon Redshift Workload Management


Wednesday, May 13, 2026

Data Engineering Books Worth Having on Your Shelf (or your tablet)

Good documentation gets you started. Good books get you deep. After years of working with cloud data platforms, SQL engines, and machine learning pipelines, a handful of titles keep coming up as genuinely useful — not just for beginners, but for practitioners who want to go beyond the basics. Here are five books that cover the tools this blog is built around.

Disclosure: the links below are Amazon affiliate links. If you purchase through them, this site earns a small commission at no extra cost to you.


Generative AI on AWS

Chris Fregly, Antje Barth & Shelbee Eigenbrode — O'Reilly

Covers how to build context-aware, multimodal reasoning applications on AWS using generative AI. Written by AWS practitioners, it bridges the gap between the hype and the actual engineering work — prompt engineering, RAG pipelines, fine-tuning, and deploying models at scale. Useful for data engineers who are being asked to build AI-adjacent infrastructure and want a grounded, AWS-native perspective.


Amazon Redshift: The Definitive Guide

Tyler Scherr & James Tunick — O'Reilly

The most thorough single reference for Redshift available in print. Goes well beyond the documentation: cluster architecture, distribution styles, sort keys, workload management, Redshift Spectrum, and data sharing. Whether you are running a provisioned cluster or Redshift Serverless, this book will save you hours of trial and error on performance and cost optimization.


Data Science on AWS

Chris Fregly & Antje Barth — O'Reilly

A hands-on guide to building end-to-end, continuous AI and ML pipelines on AWS. Covers SageMaker, data ingestion, feature engineering, model training, deployment, and monitoring — with real code throughout. The focus on continuous pipelines rather than one-off notebooks makes this especially relevant for teams trying to productionize their ML work.


Just Use Postgres!

Brandur Leach

The title says it plainly. PostgreSQL is a remarkably capable database that teams routinely underutilize before reaching for more complex infrastructure. This book is written for application developers and database practitioners who want to squeeze more out of Postgres — from indexing strategies and transactions to JSON, full-text search, and advanced concurrency patterns. A practical counter-argument to unnecessary architectural complexity.


Practical SQL, 2nd Edition

Anthony DeBarros — No Starch Press

SQL books tend to be either too shallow or too academic. This one hits the right balance. Written by a journalist and data analyst, it teaches SQL through real datasets and actual analysis problems using PostgreSQL. The focus on finding the story inside data — rather than just memorizing syntax — makes the concepts stick. Concepts transfer directly to MySQL, SQL Server, Redshift, and most other relational engines.


A very personal note on buying technical books

Technical books go out of date, but the fundamentals they cover rarely do.
A well-chosen book on Redshift architecture or SQL query design will still be useful three or more years after purchase — the core concepts around distribution, joins, and execution plans do not change with every minor release.


Monday, May 11, 2026

Quick Tip: Check Most DTU Expensive Queries in Azure SQL Database

Optimizing Azure SQL Database performance often begins with identifying the most resource-intensive queries. Understanding which queries consume the most DTU (Database Transaction Unit) helps prioritize tuning efforts and improve overall efficiency.

Azure SQL Database provides built-in tools to easily surface the queries with the highest DTU usage, making it straightforward for data engineers and DBAs to pinpoint performance bottlenecks.

Step-by-Step Guide

1. Connect to Your Azure SQL Database Using VSCode or SQL Server Management Studio (SSMS).
These tools allow you to run queries directly against your database.

2. Query the sys.dm_db_resource_stats DMV to Get Recent Resource Usage Data.
This dynamic management view shows DTU usage metrics aggregated for your database and is updated every 15 seconds. You can correlate this with queries identified in query stats views.
Example query to find top DTU-consuming queries:

SELECT TOP 10 
    qs.sql_handle,
    qs.plan_handle,
    total_worker_time/1000 AS total_cpu_ms,
    total_logical_reads, 
    execution_count,
    (total_worker_time / execution_count) / 1000 AS avg_cpu_ms,
    SUBSTRING(st.text, (qs.statement_start_offset/2) + 1, 
        ((CASE qs.statement_end_offset 
            WHEN -1 THEN DATALENGTH(st.text) 
            ELSE qs.statement_end_offset 
          END - qs.statement_start_offset)/2) + 1) AS query_text
FROM 
    sys.dm_exec_query_stats AS qs
CROSS APPLY 
    sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY 
    total_cpu_ms DESC;

This query highlights queries with the highest total CPU time, which mostly correlates with high DTU consumption.



3. Use Azure Portal's Query Performance Insight for Visual Analysis.
Navigate to your Azure SQL Database resource in the Azure Portal, select "Query Performance Insight," and explore the top resource-intensive queries by DTU consumption over a configurable time window.
Learn more: Query Performance Insight Documentation

4. Investigate Execution Plans for Identified Queries.
Analyze execution plans to understand why queries consume high DTUs and target specific improvements such as indexing, query rewriting, or parameter sniffing fixes.

Conclusion

By regularly identifying the most DTU expensive queries using DMV queries and Azure Query Performance Insight, you can effectively target optimization efforts on your Azure SQL Database. 

This proactive monitoring helps maintain smooth performance and cost efficiency without guesswork.

Additional tips:
- Automate these checks with scheduled scripts or alerts.
- Combine resource analysis with Azure SQL Intelligent Insights for AI-driven recommendations.

Friday, May 8, 2026

PlanTrace Now Supports PostgreSQL

PlanTrace Now Supports PostgreSQL PostgreSQL

The same plan analysis you know from Redshift, now for PostgreSQL — including EXPLAIN ANALYZE, buffer stats, and JSON format.

When I built PlanTrace, it was a Redshift-only tool. Paste your EXPLAIN output, get an interactive graph, cost heatmap, and tuning insights — all client-side, nothing stored. It worked well, but Redshift is only part of the picture for most teams.

PostgreSQL is where most development happens, where staging environments live, and where a lot of production workloads run. So I extended PlanTrace to support it — same interface, same zero-friction flow, automatic engine detection.


How it works

Paste your plan. PlanTrace detects whether it's PostgreSQL or Redshift automatically — no dropdown to set, no configuration. The engine badge in the top-right corner shows what was detected, and you can override it if needed.

PostgreSQL plans unlock extra data that Redshift plans don't have: actual timing, actual row counts, loop counts, buffer stats, WAL metrics. When that data is present, PlanTrace surfaces it — misestimate badges, buffer usage sections in the node detail panel, duration bubbles on each node.

Supported EXPLAIN variants

All standard PostgreSQL EXPLAIN formats are supported:

-- Estimated plan only
EXPLAIN SELECT ...;

-- Actual timing and row counts (recommended)
EXPLAIN ANALYZE SELECT ...;

-- Buffer and I/O stats
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Full detail, JSON format
EXPLAIN (ANALYZE, BUFFERS, WAL, VERBOSE, SETTINGS, FORMAT JSON) SELECT ...;

-- Plan shape without cost numbers
EXPLAIN (COSTS OFF) SELECT ...;

Tip: EXPLAIN ANALYZE actually executes your query. To avoid side-effects on write queries, wrap it: BEGIN; EXPLAIN ANALYZE ...; ROLLBACK;

JSON format is auto-detected and parsed identically to text format. The QUERY PLAN header, separator lines, and trailing (N rows) footer are all stripped automatically — paste the raw output as-is.

What the PostgreSQL engine detects

The insight engine runs a set of PostgreSQL-specific rules on your plan:

  • Sequential scans on large tables — with and without filter predicates
  • Sort nodes spilling to disk (external merge sort)
  • Hash Join and HashAggregate spills (Batches > 1) — with batch count and disk size
  • Nested Loop joins with large outer sides
  • High row count misestimates between planner estimates and actual rows
  • Low-reduction aggregations (near 1:1 input/output ratio)
  • Sort inside a subquery feeding a Hash Join — wasted work

Each finding is prioritized (critical / warning / info), tied to a specific node in the plan, and comes with a concrete recommendation — not generic advice.

Try it

plantrace.studyyourdata.com
Paste any PostgreSQL or Redshift EXPLAIN output and analyze it instantly. Free, browser-based, no account required. Your query plans never leave your browser.

If you run into a plan that doesn't parse correctly or a finding that seems off, I'd like to know. The rule engine is actively being improved — feedback from real plans is the best way to sharpen it.

Wednesday, May 6, 2026

PlanTrace: Stop Reading Redshift EXPLAIN Plans. Start Seeing Them

PlanTrace: Stop Reading Redshift EXPLAIN Plans. Start Seeing Them

Introducing PlanTrace — a free, browser-based tool that turns raw execution plans into interactive graphs and actionable tuning insights.

plantrace.studyyourdata.com
PlanTrace graph view showing an interactive execution plan with color-coded cost nodes, arrows, and the insights drawer open

PlanTrace graph view — color-coded cost nodes, execution flow arrows, and the tuning insights drawer, all in one screen.

If you work with Amazon Redshift, you've been there. You run EXPLAIN on a slow query, and you get back something like this:

XN Limit  (cost=2000000000000.00..2000000000125.00 rows=100 width=256)
  ->  XN Merge  (cost=2000000000000.00..2000000000125.00 rows=100 width=256)
        ->  XN Hash Join DS_DIST_BOTH  (cost=384729183746.25..903741293847.50 rows=32000000000 width=256)
              Hash Cond: ("outer".customer_id = "inner".customer_id)
              ->  XN Hash Join DS_BCAST_INNER  (cost=182938471623.00..723481920384.00 rows=28000000000 width=224)
                    ...

Now imagine 80+ operators deep. Nested joins, multiple redistributions, broadcasts stacked on top of each other, costs in the trillions. You're supposed to manually scan this and figure out what's killing performance.

Execution plans are not logs. They are structured graphs of data movement and compute decisions. They deserve to be treated that way.

That's why I built PlanTrace.


What PlanTrace does

PlanTrace takes your raw EXPLAIN output, parses it into a structured tree, and gives you three ways to reason about it — all in your browser, with no data ever leaving your machine.

The graph view

The moment you paste a plan and click Visualize, PlanTrace renders it as an interactive graph — nodes for every operator, arrows showing execution flow, and a minimap for large plans.

Each node is color-coded by cost intensity using a 6-level semaphore — from green (cheap) to red (expensive). You stop reading plans. You see them. Hotspots are obvious in seconds.

Hover over any node to see operator documentation inline, distribution movement notes (DS_BCAST_INNER, DS_DIST_BOTH, etc.), and cost range, estimated rows, and output width. Click a node to pin its detail panel — useful for digging into join conditions without losing your place in the graph.

The table view

PlanTrace table view showing sortable columns with color-coded cost pills for each operator

Table view — sortable by any column, with cost severity pills making hotspots immediately visible.

Not a graph person? The table view flattens all operators into sortable columns — cost, rows, width, distribution code. Sort by total cost to instantly surface the most expensive operators. Toggle tree-order mode to preserve the parent/child structure while keeping sortable columns.

The cost severity pills make it hard to miss the bad stuff — red means it's on fire, and in a real production query, you'll know immediately which joins to attack first.

The chart view

PlanTrace chart view showing operators plotted on a cost vs estimated rows scatter chart with quadrant labels

Chart view — each dot is a plan operator. Top-left quadrant (high cost, few rows) flags overestimated work that needs ANALYZE. Dot size encodes row width.

The chart view plots every operator on a cost × estimated-rows scatter chart (both log scale). Four quadrants tell you immediately what kind of problem you're dealing with:

  • Top-left (Overestimated Work) — high cost, few rows. Run ANALYZE — statistics are likely stale.
  • Top-right (Balanced) — high cost, many rows. Expected for big operations; focus on join strategy.
  • Bottom-left (Ideal) — low cost, few rows. This is what you're aiming for.
  • Bottom-right (Cardinality Risk) — low cost estimate, many rows. Potential runtime spill.

Dot size encodes row width, so you can also spot wide rows that will blow up memory. Hover any dot for the full operator detail.

The operator legend

PlanTrace operator legend modal showing the Join operator documentation

Every operator type has built-in documentation — click the legend to understand exactly what each operator does and when it becomes a problem.

One of the things I wanted PlanTrace to be is educational, not just diagnostic. Every operator type in the legend links to inline documentation explaining what it does, when it's fast, and when it's a problem. You don't need to have the Redshift docs open in a second tab.

The insights engine

This is the part I'm most proud of. PlanTrace runs a rule-based insight engine across your plan and surfaces prioritized, actionable recommendations sorted by severity — as many as the plan warrants. On a complex query, you might get 14 or more distinct findings.

To give you a sense of what it catches, here's what it found on a real monster query — 118 operators, 27 tables, costs reaching 2 trillion:

118Total operators
14Tuning insights
2TMax total cost
27Tables involved

The engine flagged, among other things:

  • A redistribution chain of 6 DS_DIST_BOTH joins in the same branch — at least 6 full-cluster redistributions before any result is produced
  • 14 broadcast joins (DS_BCAST_INNER), several of them on fact-sized tables with hundreds of millions of rows
  • 8 direct fact-to-fact joins without pre-aggregation — a classic source of runaway cost
  • A useless sort inside a subquery feeding directly into a hash join, which discards sort order anyway
  • Scan amplification of 10,000,000× on one table — reading 1 billion rows to return 100
  • 27 tables flagged for ALTER DISTSTYLE AUTO with ready-to-run SQL for each

Every insight is tied to an actual signal in the plan, no generic advice, no "consider your indexes" stuffs.

PDF export

Once you've analyzed a plan, click ↓ PDF in the toolbar. PlanTrace generates a formatted A4 report with an executive summary, all insights with severity color bands, an operator distribution quadrant breakdown, and the top 10 most expensive operators.

It's a ready-to-share document you can drop into a Slack thread, attach to a Jira ticket, or present to your team without having to explain what an EXPLAIN plan is.

Privacy by design

PlanTrace is entirely client-side. Your query plans never leave your browser. No backend, no analytics on your plans, no storage. Paste your plan, get your insights, close the tab — nothing is retained.

This matters because EXPLAIN output often reveals table names, column names, join structures, and data volumes that you probably don't want floating around in some SaaS vendor's logs.


How to use it

  1. Go to plantrace.studyyourdata.com
  2. Run EXPLAIN <your query>; in your Redshift client and copy the output
  3. Paste it into the left panel and click Visualize
  4. Switch between Graph / Table / Chart views
  5. Open the insights drawer for tuning recommendations
  6. Download the PDF if you want a shareable report

It's free. No account required. No install.

What it's not

PlanTrace is deliberately focused. It is not a query runner, not a monitoring dashboard, and not a replacement for runtime telemetry from views like SVL_QUERY_SUMMARY or STL_ALERT_EVENT_LOG. Those require a live connection and runtime execution data that EXPLAIN alone can't provide.

PlanTrace's job is narrower and more immediate: take a plan you already have, make it understandable, and tell you what to look at first.


PlanTrace is free to use — no account, no install, no data leaves your browser.

plantrace.studyyourdata.com →

If you work with Redshift and have a slow query you've been staring at, paste the EXPLAIN output and see what PlanTrace finds. I'd love to hear what you think — drop a comment below or find me on LinkedIn.

PostgreSQL: Diagnosing and Fixing a Sequential Scan with No Filter Predicate

A sequential scan is not always a problem — PostgreSQL's planner often chooses one correctly for small tables or queries that ...