Monday, June 1, 2026

Data Skew in Data Engineering: What It Is and How to Fix It

You kick off a distributed job expecting it to finish in minutes — but one task keeps running while all others have long since completed. The culprit is almost always data skew. It is one of the most common performance killers in distributed data engineering, yet it is often misdiagnosed or ignored until it starts breaking SLAs.

Data skew happens when data is unevenly distributed across partitions or compute nodes. In systems like Apache Spark, Amazon Redshift, or Azure Synapse, the workload is designed to run in parallel. When one partition holds a disproportionate share of the data — because of a high-frequency key or a poor distribution choice — that one node does all the heavy lifting while the rest sit idle. The job cannot complete until every partition finishes, so your entire pipeline is only as fast as the slowest, most overloaded worker.


1. What causes data skew

The root cause is almost always a skewed join or grouping key. If 60% of your orders belong to a single customer_id, or your event log has millions of rows for one status = 'active' value, Spark or your MPP engine will route all of those rows to the same partition during a shuffle. The result: one executor handles millions of rows while ten others process a few hundred.

In Amazon Redshift, the problem manifests at the storage layer. If your DISTKEY column has low cardinality or a heavily concentrated value distribution, Redshift routes too many rows to the same compute node slice, forcing it to do more work than the others during every query that touches that table.


Types of data skew

Data skew is not a single pattern — it shows up in different shapes depending on the root cause. The four most common types in distributed pipelines:

Key skew — hot partition key

One key (e.g. a single customer_id) floods one partition while all others stay near-empty.

Partition skew — uneven data split

Wide size variation across partitions due to a low-cardinality key like status or country.

Join skew — NULL key explosion

A large share of rows with NULL or a default value all land in the same executor during shuffle.

Temporal skew — time-based spike

Event data clusters at specific timestamps (e.g. end-of-day batch), overloading those time-aligned partitions.

Normal partition Skewed / hot partition Ideal even distribution / spike window

2. Issues data skew causes in your pipelines

The consequences go beyond slow queries. Here are the most damaging effects data skew causes in practice:

  • Pipeline bottlenecks: A single overloaded partition stalls the entire stage. Every downstream task waits on it.
  • Out-of-memory errors: The skewed executor receives far more data than it can fit in memory, causing spills to disk or outright failures.
  • Wasted cluster resources: Most nodes finish quickly and remain idle while one struggles. You are paying for compute that does nothing.
  • Unpredictable SLA breaches: The job duration becomes tied to the size of the hot partition rather than total data volume, making runtime estimates unreliable.
  • Cascading failures in multi-step ETL: A skew-induced timeout in one stage can trigger retries, delay downstream jobs, and corrupt incremental watermarks.

3. Detecting skew before it becomes an issue

In Spark (including Azure Synapse Analytics and Databricks), open the Spark UI and look at the task duration histogram under a shuffle stage. A long tail of one task while all others finish quickly is the signature of skew. You can also profile the key distribution directly before running a heavy join:

# Quick skew check in PySpark
df.groupBy("customer_id")
  .count()
  .orderBy(col("count").desc())
  .show(20)
  

In Redshift, the SVV_TABLE_INFO view exposes a skew_rows column that reports the ratio between the most-populated and least-populated node slices. AWS documentation recommends reviewing the distribution key on any table with a skew_rows value of 4.0 or higher — values below that are generally acceptable natural imbalance. The Identifying tables with data skew guide in the AWS docs walks through how to interpret this output.

-- Check skew_rows ratio in Redshift
SELECT "table", diststyle, skew_rows
FROM   svv_table_info
WHERE  skew_rows >= 4.0
ORDER BY skew_rows DESC;
  

4. Technique: Salting to break hot keys

Salting is the most widely used fix for skewed joins and aggregations. The idea is to append a random integer suffix to the skewed key, artificially spreading one large partition across multiple smaller ones. On the smaller (lookup) side of the join, you replicate each row once per salt value so that both sides still match correctly. After the join, you strip the salt column.

The Azure Synapse Spark performance guide explicitly recommends salting the entire key — or only the subset of skewed keys — as the primary remedy for shuffle-induced slowdowns.

from pyspark.sql import functions as F

SALT_BUCKETS = 10

# Salt the large (skewed) table
df_large = df_large.withColumn(
    "salt", (F.rand() * SALT_BUCKETS).cast("int")
)

# Explode the small (lookup) table to match all salt values
df_small = df_small.withColumn(
    "salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)]))
)

# Join on both the original key and the salt
result = df_large.join(df_small, ["customer_id", "salt"]).drop("salt")
  

5. Technique: Broadcast joins for small dimension tables

When one side of a join is small enough to fit in executor memory, a broadcast join eliminates the shuffle entirely. Instead of moving data between nodes, Spark sends a full copy of the small table to every executor, so each worker can join locally. This is the fastest fix when the skewed side is the larger table and the lookup side is compact.

from pyspark.sql.functions import broadcast

result = df_large.join(
    broadcast(df_small), "customer_id"
)
  

In Redshift, the equivalent approach is setting DISTSTYLE ALL on small dimension tables. This stores a full copy of the table on every compute node, so join colocation is guaranteed without any redistribution at query time. The Redshift distribution styles documentation recommends this for slow-moving reference tables under a few hundred megabytes.


6. Technique: Adaptive Query Execution (Spark 3.x)

Since Spark 3.0, Adaptive Query Execution (AQE) can detect and automatically split skewed partitions at runtime without any manual intervention. When AQE identifies a task whose partition size significantly exceeds the median, it breaks that partition into smaller sub-tasks and replicates the corresponding data from the other join side to match. You can enable it with two configuration properties:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
  

AQE is a strong first line of defence on modern Spark environments. If it does not resolve the skew fully — for example when the hot key is extreme — fall back to manual salting as described in the previous section.


7. Fixing skew in Amazon Redshift: choosing the right DISTKEY

In Redshift, skew is a storage-level problem. The fix is selecting a distribution key with high cardinality and uniform value distribution. Columns like order_id or event_id make far better distribution keys than status or country_code. The AWS prescriptive guidance on Redshift table design recommends validating cardinality before committing to a key column — if you notice skewness, change the distribution key.

For tables used heavily in joins, align the distribution key with the join column used most often. This ensures that matching rows already reside on the same node slice, eliminating costly data redistribution at query time. If automatic table optimization is enabled, Redshift will manage distribution style changes in the background as table sizes evolve.


Wrapping Up

Data skew is a structural problem that compounds over time: as tables grow, hot keys attract more rows, and pipelines that once ran in minutes begin taking hours. The pattern to follow is consistent — profile key distributions early, detect skew via the Spark UI or system views like SVV_TABLE_INFO in Redshift, and apply the right fix for the context: salting for heavy shuffle joins, broadcast joins when the lookup side is small, AQE for automatic handling in Spark 3.x, and a well-chosen DISTKEY for Redshift table design. The most reliable rule: never use a column with low cardinality as a partition or distribution key.


Friday, May 29, 2026

5 SQL Tricks Worth Remembering Before You Close the Laptop

Disclosure: this post may contain links to books as an affiliate link. If you purchase through it, this site may earn a small commission at no extra cost to you.

It is Friday. The sprint is done, the deploys are out, and nobody is scheduling a meeting in the next 30 minutes. This one is short — five SQL tricks that are genuinely useful, easy to forget, and satisfying to rediscover. No deep dives, no architecture decisions. Just good SQL to carry into next week.


1. Fill missing dates in a report with generate_series (PostgreSQL)

Your daily sales query skips dates with zero transactions. Instead of patching this in application code, generate a full date spine in the database and left join your data against it. The generate_series function handles this cleanly — no temp tables, no loops.

SELECT
    d.day::date,
    COALESCE(SUM(s.amount), 0) AS total_sales
FROM
    generate_series(
        '2025-05-01'::timestamp,
        '2025-05-31'::timestamp,
        '1 day'::interval
    ) AS d(day)
LEFT JOIN sales s
    ON s.sale_date = d.day::date
GROUP BY d.day
ORDER BY d.day;
  

2. Get the last day of any month without a calendar lookup (SQL Server)

Month-end reporting always needs the last day of the month, and it changes every month, and February exists. EOMONTH handles leap years, 30-day months, and everything else automatically. The optional second argument lets you offset forward or backward.

-- Last day of the current month
SELECT EOMONTH(GETDATE())  -- 2025-05-31
 
-- Last day of the previous month
SELECT EOMONTH(GETDATE(), -1) -- 2025-04-30
 
-- First day of the current month
SELECT DATEADD(DAY, 1, EOMONTH(GETDATE(), -1)) -- 2025-05-01
  

3. Use a CTE to avoid repeating yourself (PostgreSQL & SQL Server)

If you find yourself copy-pasting the same subquery twice in one statement, a Common Table Expression cleans it up immediately. CTEs are not just for readability — they also prevent you from calculating the same thing twice and getting different results if the underlying data changes mid-query.

WITH monthly_totals AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount)                       AS revenue
    FROM orders
    GROUP BY 1
)
SELECT
    month,
    revenue,
    ROUND(revenue * 100.0 / SUM(revenue) OVER (), 2) AS pct_of_total
FROM monthly_totals
ORDER BY month;
  

4. Spot duplicate rows instantly with GROUP BY and HAVING

Before running a deduplication job, it pays to know exactly which keys are duplicated and how many copies exist. This pattern works identically across PostgreSQL, SQL Server, MySQL, and Redshift — no library needed, just standard SQL.

SELECT
    customer_id,
    email,
    COUNT(*) AS occurrences
FROM customers
GROUP BY
    customer_id,
    email
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;
  

Add more columns to the GROUP BY to tighten the definition of "duplicate" for your specific case. The result set tells you exactly what to target before touching any data.


5. LAG() to compare a row against the previous one (PostgreSQL & SQL Server)

Week-over-week, day-over-day, month-over-month — any time you need to compare a metric against the prior row, LAG() saves you a self-join. It is a window function available in PostgreSQL, SQL Server, MySQL 8+, and Redshift.

SELECT
    sale_date,
    daily_revenue,
    LAG(daily_revenue) OVER (ORDER BY sale_date)
        AS prev_day_revenue,
    ROUND(
        (daily_revenue - LAG(daily_revenue) OVER (ORDER BY sale_date))
        * 100.0
        / NULLIF(LAG(daily_revenue) OVER (ORDER BY sale_date), 0),
    2) AS pct_change
FROM daily_sales
ORDER BY sale_date;
  

The NULLIF(..., 0) wrapper on the denominator prevents a division-by-zero error on days where the previous revenue was zero — a small detail that saves a lot of Friday-afternoon debugging.


Wrapping Up

None of these are exotic — generate_series, EOMONTH, CTEs, HAVING, and LAG() are bread-and-butter SQL. They just tend to get forgotten under deadline pressure and rediscovered on a quiet Friday. Keep them close. The one worth memorizing cold: NULLIF on every division denominator. Division by zero is always the error you find in production.

If this kind of thing clicks for you and you want more of it — the patterns, the thinking behind the queries, the instinct for when to reach for a window function versus a subquery — Practical SQL, 2nd Edition by Anthony DeBarros is one of the best books on the subject. It is written for people who want to actually understand their data, not just run queries.


Monday, May 25, 2026

PostgreSQL: Diagnosing and Fixing a Sequential Scan with No Filter Predicate

A sequential scan is not always a problem — PostgreSQL's planner often chooses one correctly for small tables or queries that return most of a table's rows. The dangerous variant is the sequential scan with no filter predicate: the plan reads every row in the table and returns all of them, with zero selectivity applied at the scan level. This pattern means the query has no WHERE clause narrowing the result, or the join/filter column has no index backing it. On any large table in a transactional workload it produces unnecessary I/O, longer runtimes, and contention under load. This post shows how to spot it and what to do about it.


1. Capture the execution plan

Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) against the slow query. The BUFFERS option exposes how many shared-buffer pages were read from disk versus cache, making the real I/O cost visible alongside the plan structure.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM your_table;

-- The warning sign in the output:
-- Seq Scan on your_table  (cost=0.00..XXXXX rows=NNNNN width=NN)
--                         (actual time=X.XXX..X.XXX rows=NNNNN loops=1)
--   Buffers: shared read=XXXXX
-- (No "Filter:" line beneath the Seq Scan node = no predicate pushing down)
  

2. Visualize and confirm the issue in PlanTrace

Copy the full plan text and paste it into PlanTrace, my free browser-based EXPLAIN visualizer for PostgreSQL and Amazon Redshift. It will render the execution plan as an interactive node graph, flags the costliest nodes with a color-coded cost semaphore, shows buffer usage per node, and surfaces tuning insights automatically, and like this posts, will include the absence of a filter predicate on a sequential scan. 

In the node graph, a Seq Scan node with no child filter annotation and a high shared read buffer count is the fingerprint of this issue. The cost semaphore on that node will typically appear in red or orange on a large table, making it easy to prioritize against other nodes in a complex plan.


3. Confirm the table is large enough to decide your next action

Not every sequential scan needs fixing. If the table has only a few thousand rows, the planner's choice is almost always correct — a sequential scan on a small table is faster than the overhead of an index lookup. Query pg_class to check the row estimate before proceeding.

SELECT
    relname                              AS table_name,
    reltuples::bigint                    AS estimated_rows,
    pg_size_pretty(pg_total_relation_size(oid)) AS total_size
FROM   pg_class
WHERE  relname = 'your_table'
  AND  relkind = 'r';

-- If estimated_rows is under ~1,000, the sequential scan is likely correct. Stop here.
  

4. Identify the missing filter or join predicate

The root cause is almost always one of two things: the query has no WHERE clause at all, or it has one but the filtered column has no supporting index. Check scan counters in pg_stat_user_tables to understand how often the table is being hit with sequential versus index scans, and verify which indexes currently exist.

-- Scan counter ratio
SELECT
    relname        AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    last_analyze,
    last_autoanalyze
FROM   pg_stat_user_tables
WHERE  relname = 'your_table';

-- Existing indexes
SELECT indexname, indexdef
FROM   pg_indexes
WHERE  tablename = 'your_table';
  

5. Add a WHERE clause and create the right index

The remediation follows directly from the diagnosis. If the query has no filter, add one — most production queries have a natural narrow condition such as a date range, a status value, or a foreign key. If the filter column exists but has no index, create one. Use CONCURRENTLY on production systems to avoid blocking writes during the build, as described in the PostgreSQL CREATE INDEX documentation. For frequently queried subsets, a partial index covers only the qualifying rows, keeping the structure compact. For queries that project only a few columns, an index with INCLUDE enables an index-only scan and removes the need for a heap access entirely.

-- Option A: Standard B-tree on the filter or join key
CREATE INDEX CONCURRENTLY idx_your_table_col
    ON your_table (filter_column);

-- Option B: Partial index — only index the rows you actually query
CREATE INDEX CONCURRENTLY idx_your_table_partial
    ON your_table (filter_column)
    WHERE status = 'active';

-- Option C: Covering index — avoid heap access for narrow SELECT lists
CREATE INDEX CONCURRENTLY idx_your_table_covering
    ON your_table (filter_column)
    INCLUDE (col_a, col_b);
  

6. Keep statistics current so the planner uses the index

An index alone is not enough. PostgreSQL's cost-based planner relies on table statistics to decide whether an index scan is cheaper than a sequential scan. If statistics are stale — because the table grew rapidly or autovacuum thresholds are too loose — the planner's row-count estimates will be wrong and it may still choose a sequential scan even with a valid index in place. Run ANALYZE manually after bulk loads, and tighten per-table autovacuum settings on high-write tables.

-- Refresh statistics after a bulk load or after creating the index
ANALYZE your_table;

-- Tighten autovacuum thresholds on a high-write table
ALTER TABLE your_table SET (
    autovacuum_analyze_scale_factor  = 0.01,
    autovacuum_analyze_threshold     = 500
);
  

7. Re-run the plan and verify the fix in PlanTrace

After creating the index and running ANALYZE, re-execute EXPLAIN (ANALYZE, BUFFERS) on the same query and paste the new plan into PlanTrace. The insight should be gone (or show new ones) — confirming that the optimization took effect.


Wrapping Up

A sequential scan with no filter predicate is a full-table read with no early exit: every page is touched, every row is returned, and nothing is filtered at the scan level. The pattern applies to any table in any schema — the fix is always the same combination: add a selective WHERE clause, back it with the right index type (standard, partial, or covering), and run ANALYZE so the planner has accurate statistics to make the right choice. 
PlanTrace makes both steps easier — paste the plan before and after the change and the node graph shows immediately whether the sequential scan has been replaced and how much the buffer cost dropped.


Thursday, May 21, 2026

Stop Using Pandas for Aggregations — Try DuckDB Instead

If you've ever loaded a 2 GB CSV into pandas just to run a few aggregations — and watched your machine struggle — there's a better tool for the job. DuckDB is a free, open-source, in-process OLAP database that runs SQL directly inside your application, with no server to set up and no external dependencies. Think of it as SQLite for analytics: install it in seconds, query Parquet files, CSVs, JSON, or even pandas DataFrames, and get columnar-speed performance on your laptop.


1. Installing DuckDB

DuckDB is available for Python, R, the CLI, and several other runtimes. The most common way to get started is via pip or the CLI binary from the official installation page.

# Python
pip install duckdb
 
# R
install.packages("duckdb")
 
# macOS CLI via Homebrew
brew install duckdb
 
# Windows CLI via winget
winget install DuckDB.cli
  

2. What "in-process" actually means

Traditional databases run as a separate server process. Your application connects over a network socket, serializes the query, waits for the server to deserialize and execute it, and then serializes the result back. That overhead adds up quickly, especially for analytical workloads. DuckDB runs inside your process — no server, no network round-trip. When you use the Python package, for example, DuckDB can run queries directly on an existing pandas DataFrame without ever copying the data. This makes it extremely fast for ad-hoc analytics and ETL transformations.

# Install
pip install duckdb
 
# Query a pandas DataFrame directly — no data copy
import duckdb, pandas as pd
 
df = pd.read_csv("sales.csv")
 
result = duckdb.sql("""
    SELECT region, SUM(amount) AS total
    FROM df
    GROUP BY region
    ORDER BY total DESC
""").df()
  

3. Querying files directly — no table creation needed

One of DuckDB's most practical features is its ability to query files directly from the FROM clause. You can point it at a local Parquet file, a remote CSV over HTTPS, or even an S3 bucket — without defining a schema or creating a table first. DuckDB auto-detects column names and types, so exploratory work is fast. This is documented in the official DuckDB Guides.

-- Query a local Parquet file
SELECT customer_id, COUNT(*) AS orders
FROM   'orders_2024.parquet'
WHERE  status = 'completed'
GROUP BY customer_id
ORDER BY orders DESC
LIMIT  10;
 
-- Or query a remote CSV over HTTPS (httpfs extension auto-loads)
SELECT COUNT(*) FROM 'https://data.example.com/events.csv';
  

4. Where DuckDB fits in a data engineering workflow

DuckDB is not a replacement for SQL Server, Redshift, or PostgreSQL in transactional or multi-user systems. Its strength is in local analytical processing — the kind of work that typically lands in a Python script or Jupyter notebook. Common patterns include:

  • ETL pre-processing: filter, join, and reshape raw CSV or Parquet files before loading into a data warehouse.
  • Format conversion: read JSON or CSV and write out clean Parquet with a single SQL statement using COPY ... TO.
  • Ad-hoc analytics on exports: query a data warehouse export locally without spinning up another cloud service.
  • Privacy-first analysis: process sensitive data (health records, financial statements) entirely on-premises, with data never leaving the machine.
-- Convert a CSV to Parquet in one shot
COPY (
    SELECT *
    FROM   'raw_data.csv'
    WHERE  event_date >= '2024-01-01'
) TO 'clean_data.parquet' (FORMAT PARQUET);
  

5. Columnar execution — why it's fast

DuckDB uses a columnar-vectorized query engine. Instead of processing one row at a time (like PostgreSQL or MySQL do internally for OLTP), DuckDB processes large batches of column values together in a single CPU operation. This reduces the per-value overhead dramatically, making aggregations and scans over wide tables significantly faster than row-oriented engines. For a data engineer, this means you can comfortably run GROUP BY queries over tens of millions of rows on a standard laptop in seconds.


Wrapping Up

DuckDB fills a gap that pandas alone struggles with: fast, SQL-native, columnar analytics that run in-process with zero infrastructure. It reads Parquet, CSV, and JSON natively; integrates cleanly into Python and R workflows; and handles files too large for memory by spilling to disk automatically. For data engineers doing local transformations, format conversions, or exploratory analysis on raw exports, DuckDB is often the fastest path from raw data to insight. Start with the official DuckDB documentation — the installation takes under a minute, and the CLI is ready to query immediately.


Monday, May 18, 2026

Optimizing Redshift Performance by Configuring WLM Queues

Efficient query performance in Amazon Redshift often comes down to how well you manage workload concurrency. Redshift's Workload Management (WLM) queues enable you to control how queries share resources, helping avoid bottlenecks during peak loads. Properly configuring WLM queues ensures critical queries get the resources they need while maximizing cluster throughput.

This post will guide you through the basics of WLM queue configuration and share actionable steps to optimize query execution and concurrency for your Redshift workloads.


Automatic WLM vs. Manual WLM

Before diving into manual configuration, it's important to note that Amazon recommends Automatic WLM as the default approach for most workloads. With Automatic WLM, Redshift dynamically determines resource utilization as queries arrive and adjusts concurrency and memory allocation on the fly — without requiring manual tuning. It maximizes total throughput and handles unpredictable or changing workloads more efficiently than static manual configuration.

Manual WLM is best suited for scenarios where your workload patterns are predictable, you need to guarantee multiple workload types run simultaneously with fixed resource allocations, or you need to throttle certain query types at specific times of day. If you proceed with manual WLM, be prepared to monitor and tune it regularly as workloads evolve.


Step-by-Step Guide (Manual WLM)

1. Understand Redshift WLM Queues and Slots
WLM enables you to define queues with resource slots. Each query consumes one or more slots depending on queue configuration. Slots allocate memory and concurrency allowing multiple queries to run simultaneously without resource contention.
Learn more about WLM fundamentals here: Redshift WLM Documentation.


2. Analyze Your Query Workload
Identify query types and their resource demands. Separate lengthy ETL or analytical queries from short, high-priority reporting queries. Segmenting workloads allows for assigning appropriate queue configurations matching query resource profiles.


3. Create or Modify Queues Based on Workload Types
For example, create a "high_priority" queue with fewer concurrency slots but higher memory allocation for fast-running reports, and a "default" queue for ad-hoc or batch queries running with more concurrency but less memory per query.


4. Adjust Slots and Memory Allocations
Use the AWS Console or Redshift parameter groups to configure slots via the wlm_json_configuration parameter. An example config snippet:

-- Example JSON for WLM queue definition in parameter group:
[
  {
    "query_group": "high_priority",
    "concurrency": 5,
    "memory_percent_to_use": 30
  },
  {
    "query_group": "default",
    "concurrency": 10,
    "memory_percent_to_use": 20
  }
]

This example balances memory and concurrency to prioritize faster queries effectively.


5. Monitor and Tune Using System Views
Query views like STV_WLM_QUERY_STATE and STL_WLM_QUERY to analyze queue usage, wait times, and query runtimes. Use this insight to iteratively adjust concurrency and memory settings.


Conclusion

Optimizing Redshift WLM queues is key to balancing concurrency and performance in mixed workloads. For most users, Automatic WLM is the recommended starting point. If manual WLM better fits your use case, classifying queries, assigning queues based on priority, and tuning memory and slot allocations will ensure efficient resource use and a better experience for high-priority queries.
Regular monitoring and tuning with Redshift system views complete the feedback loop for continuous improvement.

Official Documentation for more details:
Amazon Redshift Workload Management


Wednesday, May 13, 2026

Data Engineering Books Worth Having on Your Shelf (or your tablet)

Good documentation gets you started. Good books get you deep. After years of working with cloud data platforms, SQL engines, and machine learning pipelines, a handful of titles keep coming up as genuinely useful — not just for beginners, but for practitioners who want to go beyond the basics. Here are five books that cover the tools this blog is built around.

Disclosure: the links below are Amazon affiliate links. If you purchase through them, this site earns a small commission at no extra cost to you.


Generative AI on AWS

Chris Fregly, Antje Barth & Shelbee Eigenbrode — O'Reilly

Covers how to build context-aware, multimodal reasoning applications on AWS using generative AI. Written by AWS practitioners, it bridges the gap between the hype and the actual engineering work — prompt engineering, RAG pipelines, fine-tuning, and deploying models at scale. Useful for data engineers who are being asked to build AI-adjacent infrastructure and want a grounded, AWS-native perspective.


Amazon Redshift: The Definitive Guide

Tyler Scherr & James Tunick — O'Reilly

The most thorough single reference for Redshift available in print. Goes well beyond the documentation: cluster architecture, distribution styles, sort keys, workload management, Redshift Spectrum, and data sharing. Whether you are running a provisioned cluster or Redshift Serverless, this book will save you hours of trial and error on performance and cost optimization.


Data Science on AWS

Chris Fregly & Antje Barth — O'Reilly

A hands-on guide to building end-to-end, continuous AI and ML pipelines on AWS. Covers SageMaker, data ingestion, feature engineering, model training, deployment, and monitoring — with real code throughout. The focus on continuous pipelines rather than one-off notebooks makes this especially relevant for teams trying to productionize their ML work.


Just Use Postgres!

Brandur Leach

The title says it plainly. PostgreSQL is a remarkably capable database that teams routinely underutilize before reaching for more complex infrastructure. This book is written for application developers and database practitioners who want to squeeze more out of Postgres — from indexing strategies and transactions to JSON, full-text search, and advanced concurrency patterns. A practical counter-argument to unnecessary architectural complexity.


Practical SQL, 2nd Edition

Anthony DeBarros — No Starch Press

SQL books tend to be either too shallow or too academic. This one hits the right balance. Written by a journalist and data analyst, it teaches SQL through real datasets and actual analysis problems using PostgreSQL. The focus on finding the story inside data — rather than just memorizing syntax — makes the concepts stick. Concepts transfer directly to MySQL, SQL Server, Redshift, and most other relational engines.


A very personal note on buying technical books

Technical books go out of date, but the fundamentals they cover rarely do.
A well-chosen book on Redshift architecture or SQL query design will still be useful three or more years after purchase — the core concepts around distribution, joins, and execution plans do not change with every minor release.


Monday, May 11, 2026

Quick Tip: Check Most DTU Expensive Queries in Azure SQL Database

Optimizing Azure SQL Database performance often begins with identifying the most resource-intensive queries. Understanding which queries consume the most DTU (Database Transaction Unit) helps prioritize tuning efforts and improve overall efficiency.

Azure SQL Database provides built-in tools to easily surface the queries with the highest DTU usage, making it straightforward for data engineers and DBAs to pinpoint performance bottlenecks.

Step-by-Step Guide

1. Connect to Your Azure SQL Database Using VSCode or SQL Server Management Studio (SSMS).
These tools allow you to run queries directly against your database.

2. Query the sys.dm_db_resource_stats DMV to Get Recent Resource Usage Data.
This dynamic management view shows DTU usage metrics aggregated for your database and is updated every 15 seconds. You can correlate this with queries identified in query stats views.
Example query to find top DTU-consuming queries:

SELECT TOP 10 
    qs.sql_handle,
    qs.plan_handle,
    total_worker_time/1000 AS total_cpu_ms,
    total_logical_reads, 
    execution_count,
    (total_worker_time / execution_count) / 1000 AS avg_cpu_ms,
    SUBSTRING(st.text, (qs.statement_start_offset/2) + 1, 
        ((CASE qs.statement_end_offset 
            WHEN -1 THEN DATALENGTH(st.text) 
            ELSE qs.statement_end_offset 
          END - qs.statement_start_offset)/2) + 1) AS query_text
FROM 
    sys.dm_exec_query_stats AS qs
CROSS APPLY 
    sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY 
    total_cpu_ms DESC;

This query highlights queries with the highest total CPU time, which mostly correlates with high DTU consumption.



3. Use Azure Portal's Query Performance Insight for Visual Analysis.
Navigate to your Azure SQL Database resource in the Azure Portal, select "Query Performance Insight," and explore the top resource-intensive queries by DTU consumption over a configurable time window.
Learn more: Query Performance Insight Documentation

4. Investigate Execution Plans for Identified Queries.
Analyze execution plans to understand why queries consume high DTUs and target specific improvements such as indexing, query rewriting, or parameter sniffing fixes.

Conclusion

By regularly identifying the most DTU expensive queries using DMV queries and Azure Query Performance Insight, you can effectively target optimization efforts on your Azure SQL Database. 

This proactive monitoring helps maintain smooth performance and cost efficiency without guesswork.

Additional tips:
- Automate these checks with scheduled scripts or alerts.
- Combine resource analysis with Azure SQL Intelligent Insights for AI-driven recommendations.

Data Skew in Data Engineering: What It Is and How to Fix It

You kick off a distributed job expecting it to finish in minutes — but one task keeps running while all others have long since ...