Thursday, June 25, 2026

Setting Up a Mac for Data Engineering and AI Work

If you work with data pipelines, SQL, notebooks, or machine learning models, a Mac with Apple Silicon is genuinely one of the best machines you can have as a daily driver. The unified memory architecture means your CPU and GPU share the same memory pool, which matters a lot when you are running Docker containers, a local Postgres instance, Jupyter, and an AWS CLI session all at once without the machine breaking a sweat. This post covers which model to pick and how to get the whole stack running from scratch. Some of the product links below are affiliate links, meaning I may earn a small commission if you purchase through them, at no extra cost to you.


1. Picking the right model

The current MacBook line runs entirely on M5 chips. For most data engineers doing cloud-first work — Redshift, Glue, dbt, Airflow — the MacBook Air M5 with 16GB is more than enough. If you run heavier local workloads like training models, spinning up multiple containers, or working with large DataFrames in memory, the MacBook Pro 14" M5 Pro with 24GB is the right step up. The Pro has active cooling, which matters when you push it hard for extended periods. The 36GB M5 Max is only worth it if local deep learning training is a regular part of your day, otherwise you are paying for headroom you will rarely use.


2. First thing: Homebrew and Xcode tools

Everything else in this setup depends on Homebrew, the package manager for macOS. Before installing it, you need Apple's command line developer tools, which also gives you Git.

# Install Xcode command line tools
xcode-select --install

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Add Homebrew to your PATH (Apple Silicon path)
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

3. Python environment management with uv

Forget installing Python directly or relying on conda for everything. The current best practice on Apple Silicon is uv, a fast Python package and project manager that handles Python versions and virtual environments without polluting your global install. It is significantly faster than pip and plays well with Jupyter.

# Install uv
brew install uv

# Create a new project environment
mkdir my-de-project && cd my-de-project
uv init

# Add packages (same idea as pip install)
uv add pandas sqlalchemy boto3 jupyterlab scikit-learn

# Launch Jupyter inside the project environment
uv run --with jupyter jupyter lab

4. Core tools via Homebrew

These cover the daily stack for data engineering: database clients, cloud CLIs, container runtime, and a better terminal.

# AWS CLI and tools
brew install awscli

# PostgreSQL client (psql without the full server)
brew install libpq
brew link --force libpq

# Docker via OrbStack (lighter than Docker Desktop on Apple Silicon)
brew install orbstack

# Git, jq, and wget
brew install git jq wget

# VS Code
brew install --cask visual-studio-code

5. PyTorch with Metal GPU acceleration

Apple Silicon GPUs run PyTorch through the Metal Performance Shaders (MPS) backend, which gives you real GPU acceleration for model training without needing CUDA. It works natively on macOS 12.3 or later and PyTorch picks it up automatically.

# Install PyTorch (Apple Silicon native build)
uv add torch torchvision torchaudio

# Verify MPS is available in Python
import torch
print(torch.backends.mps.is_available())  # should print True

# Move a model to the MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)

One thing worth knowing: MPS does not support every PyTorch operation yet. If you hit an unsupported op, set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 and PyTorch will silently fall back to CPU for that specific operation while keeping everything else on the GPU.


6. One limitation to consider

Macs do not support CUDA, and that is a real constraint if your production training runs on NVIDIA GPUs or if your team uses CUDA-specific libraries. The practical answer most engineers land on is using the Mac for development, prototyping, and running notebooks, then pushing actual training jobs to AWS SageMaker, Google Colab, or a cloud GPU instance. The unified memory architecture makes the Mac excellent for loading large quantized models locally — a 7B or 13B parameter model fits comfortably in 24GB unified memory — but for serious fine-tuning or multi-GPU training, cloud is still the right call.


A solid machine that gets out of your way

The real reason data engineers gravitate toward Macs is not any single spec — it is the combination of a Unix shell that works the way you expect, excellent battery life, and hardware that handles a full data stack locally without fan noise or thermal throttling on everyday tasks. Getting to a productive environment takes less than an hour with Homebrew, uv, and OrbStack in place. After that, you have PostgreSQL, Docker, AWS CLI, Jupyter, and PyTorch with GPU acceleration all running natively on Apple Silicon, which is a genuinely capable local setup for most data and AI workflows.


Friday, June 19, 2026

Five Ways Redshift Serverless Quietly Eats Your Budget

It is Friday, the queries are running, and nobody is watching the bill. That is the whole charm of Redshift Serverless: you stop thinking about nodes and resizes. It is also exactly how money slips out the door. Below are five habits that run up RPU charges while you look the other way, all backed by the docs.


1. Running with no maximum RPU-hours limit

By default, Serverless scales to meet load and meters you per second. With no ceiling on total consumption, one rough week of ad-hoc queries can scale well past whatever number you had in your head. Set a daily, weekly, or monthly RPU-hours limit on the workgroup and pick an action when it trips: alert, log, or turn off user queries entirely. See maximum RPU hours usage limit for the setup.

aws redshift-serverless create-usage-limit \
  --resource-arn arn:aws:redshift-serverless:us-east-1:123456789012:workgroup/analytics-wg \
  --usage-type serverless-compute \
  --amount 100 \
  --period weekly \
  --breach-action deactivate
  

2. Leaving idle sessions with open transactions

An open transaction keeps compute alive. If a session sits idle with a transaction still open, RPUs keep getting consumed until the session closes, and Serverless will wait up to six hours before ending it for you. The advice in the billing guide is blunt: close your transactions, and resist the urge to extend SESSION TIMEOUT unless a specific use case demands it.


3. Letting connection pools spam health checks

Here is the sneaky one. Serverless counts every incoming query as billable user activity, and that includes the lightweight keepalive pings your connection pool fires on a schedule. A chatty pool with an aggressive validation interval can keep compute warm during hours when no real work is happening. Check your pool's keepalive and validation settings, because those tiny queries add up.


4. Skipping the MaxRPU scaling cap

The RPU-hours limit from item 1 caps your total spend over a period. MaxRPU is a different lever: it caps how high you can scale at any single moment. Without it, one heavy query is free to grab a large slice of compute all at once. Set MaxRPU to the highest burst level you are actually willing to pay for, and Serverless will scale within that line. The mechanics live in the compute capacity docs.


5. Oversizing the base capacity

Base RPU is the floor that is always ready to serve queries, and the default sits at 128. If your steady-state workload only needs 32, you are paying for headroom you rarely touch. Start the base lower and let autoscaling cover the spikes. You can adjust it anywhere from 8 to 512 in steps of 8, at any time, with no impact on running queries.


Wrapping Up

None of these require heroic tuning. Set an RPU-hours limit and a MaxRPU cap, keep transactions short, audit your pool's keepalive, and right-size the base. The one habit worth building today: configure usage limits from day one rather than after the first surprising invoice arrives.

Monday, June 15, 2026

Killing Explicit Sort Steps in Redshift with the Right SORTKEY

You paste a slow Redshift query plan into PlanTrace and one of the tuning insights reads "SORTKEY optimization candidate", pointing at an explicit sort step that's adding cost you don't need to pay. 

A sort step runs at query time, every time, and on a wide table with millions of rows it can dominate the total cost. 

The fix is usually a SORTKEY that matches what the query is sorting on.


What an explicit sort step actually is

When your query has an ORDER BY, a window function, or a merge join that needs ordered input, Redshift has two options. If the data is already physically stored in the order it needs, it just reads it. If not, it sorts the rows on the fly. 

That on the fly sort is the explicit sort step, and it shows up as an XN Sort operation with its own Sort Key line and a cost range. 

Spotting it in raw EXPLAIN text means scanning indentation and matching cost numbers by eye. This is where reading the plan as a graph helps: PlanTrace renders each node with its cost broken out, so the expensive sort is obvious instead of buried, and the insight names the exact table and column behind it. The underlying mechanics are documented in the AWS guide on reviewing query plan steps.


What causes it?

The root cause is simple: the order the query needs does not match the physical order the table is stored in. A table with no SORTKEY, or with a SORTKEY on columns the query doesn't use, forces Redshift to materialize and sort the result set at runtime. The wider the rows and the larger the row count, the more expensive that becomes. In the graph you'll see the cost concentrated on the sort while the scans underneath look comparatively cheap, which is exactly the pattern the SORTKEY insight keys off of.


How to fix it

Define a SORTKEY that matches the column the query orders or joins on. When the table is already sorted that way, Redshift can skip the runtime sort and read rows in order straight from disk. 

Add the column to the SORTKEY definition with ALTER TABLE, which lets you change sort keys on existing tables without recreating them and without blocking concurrent reads or writes, as noted in the Redshift sort key recommendation announcement.

-- Add a compound sort key matching your ORDER BY / join column
ALTER TABLE lineitem
ALTER COMPOUND SORTKEY (l_quantity);

-- Or let Redshift manage it automatically
ALTER TABLE lineitem ALTER SORTKEY AUTO;
  

After the change, sort the existing data so the new key takes effect, then re-run EXPLAIN:

VACUUM SORT ONLY lineitem;
ANALYZE lineitem;

EXPLAIN
SELECT l_quantity, sum(l_extendedprice)
FROM lineitem
GROUP BY l_quantity
ORDER BY l_quantity;
  

Just note that on very large tables, ALTER table and VACUUM can take significant time to complete.  

Paste the new plan back into PlanTrace and compare it against the old one. 

If the SORTKEY matches, the XN Sort node either drops out of the plan or its cost falls sharply, and the candidate insight no longer fires. That before and after check is the only reliable confirmation. 

A SORTKEY that doesn't line up with the query's order changes nothing. Everything runs client side in your browser, so your plans are never stored or sent anywhere.


Wrapping up

An explicit sort step is Redshift telling you the data isn't stored in the order your query wants. Match the SORTKEY to the ORDER BY or join column, vacuum and analyze, then compare the plans to confirm the sort is gone. Letting PlanTrace surface the candidate and verify the result beats squinting at cost numbers in raw text. And remember that a sort key only helps queries that actually filter, join, or order on those columns, so optimize for the patterns that matter most.


Friday, June 12, 2026

5 Starter Projects for Your AI and Data Engineering Portfolio

Reading tutorials is fine. Shipping something is better. If you are trying to break into data engineering or AI, nothing on your resume carries more weight than a GitHub repo with working code and a problem you actually solved. These five projects are designed to give you hands-on experience with real tools while producing portfolio artifacts you can point to in an interview.


Project 1 — Automated ETL Pipeline with Scheduling

Tech Stack: Python, PostgreSQL, Apache Airflow, Docker

Build a pipeline that pulls data from a public API (weather, exchange rates, or any open dataset), transforms it with Python, and loads it into PostgreSQL on a schedule. Airflow handles the orchestration, and Docker keeps the environment reproducible. This project teaches you the core ETL loop and gives you a DAG you can walk through in any data engineering interview. The scheduling angle forces you to think about idempotency and failure handling from day one.

# Minimal Airflow DAG skeleton
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
 
def extract(): ...
def transform(): ...
def load(): ...
 
with DAG("etl_pipeline", start_date=datetime(2025, 1, 1), schedule="@daily") as dag:
    t1 = PythonOperator(task_id="extract", python_callable=extract)
    t2 = PythonOperator(task_id="transform", python_callable=transform)
    t3 = PythonOperator(task_id="load", python_callable=load)
    t1 >> t2 >> t3
  

Project 2 — Cloud Data Warehouse on AWS

Tech Stack: AWS S3, Amazon Redshift Serverless, Python (boto3), SQL

Upload a dataset to S3, provision a Redshift Serverless workgroup, and load the data using the COPY command. Write a set of analytical queries against it. The goal is not just to run queries — it is to understand how a cloud warehouse differs from a local database: columnar storage, distribution keys, and the cost model. This project demonstrates cloud data skills that show up in virtually every modern data engineering job description. Use the AWS Free Tier to keep costs at zero while learning.

-- Load from S3 into Redshift
COPY sales
FROM 's3://your-bucket/sales.csv'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftRole'
FORMAT AS CSV
IGNOREHEADER 1;
  

Project 3 — ETL Pipeline with Data Factory in Microsoft Fabric

Tech Stack: Data Factory in Microsoft Fabric, Microsoft Fabric Lakehouse, OneLake, SQL Server, Python (Fabric Notebook)

Microsoft has consolidated its data integration story into Microsoft Fabric, and Data Factory in Microsoft Fabric is its next-generation replacement for Azure Data Factory. Build a pipeline that ingests data from SQL Server or a flat file, transforms it using a Fabric Notebook (Python/PySpark), and lands the result in a Fabric Lakehouse backed by OneLake. This project exposes you to the unified Fabric workspace model — one platform for pipelines, notebooks, warehouses, and Power BI — which is exactly where Microsoft data engineering is heading. A free Fabric trial requires no credit card and gives you full access to build this end to end.


Project 4 — RAG Chatbot over Your Own Documents

Tech Stack: Amazon Bedrock, Claude or Llama 3 (via Bedrock), pgvector (PostgreSQL), Python, LangChain

Take a set of PDF or text documents you own, chunk them, generate embeddings using a Bedrock foundation model, store the vectors in PostgreSQL with the pgvector extension, and build a simple question-answering interface on top. This is Retrieval Augmented Generation (RAG) in its simplest form. It teaches you the full AI data pipeline: ingestion, embedding, vector search, and prompt construction. RAG is one of the most in-demand AI engineering patterns in production today.

# Store a document embedding in pgvector
INSERT INTO documents (content, embedding)
VALUES (%s, %s::vector);
 
-- Semantic similarity search
SELECT content
FROM   documents
ORDER BY embedding <-> '[query_vector]'::vector
LIMIT 5;
  

Project 5 — End-to-End ML Pipeline with Feature Engineering

Tech Stack: Python, pandas, scikit-learn, SQL Server or PostgreSQL, MLflow

Pick a structured dataset with a clear prediction target (churn, sales forecast, classification). Build the full cycle in Python: data extraction from a relational database, feature engineering, model training, and experiment tracking with MLflow. The data engineering side is the extraction and feature pipeline; the AI side is the model and tracking. Publishing your MLflow experiment results in a public repo gives reviewers something concrete to evaluate. This project bridges the two disciplines in a single deliverable.

import mlflow
from sklearn.ensemble import RandomForestClassifier
 
with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    mlflow.log_metric("accuracy", model.score(X_test, y_test))
    mlflow.sklearn.log_model(model, "model")
  

Five Projects. Two Cloud Platforms. One GitHub Profile Worth Showing.

Together these projects cover the full surface area hiring managers scan for: pipeline orchestration, cloud warehousing on AWS and Azure, AI integration, and end-to-end ML tracking. Pick the stack you are least afraid of, break it, fix it, and commit the mess. The best time to start was last month. The second best time is now.

If you are curious about what a working portfolio looks like in practice, you can take a look at my own projects.



Monday, June 8, 2026

AI in SQL Server vs PostgreSQL: Vector Search, Embeddings, and RAG Compared

Both SQL Server and PostgreSQL are moving fast into AI territory — and if you manage either (or both), you've probably wondered how they compare when it comes to storing embedding, running vector searches, and building RAG pipelines. The short answer: the capabilities are converging, but the approach is very different. Here's a practical breakdown.


1. The VECTOR data type: built-in vs extension

SQL Server 2025 ships with a native VECTOR data type, part of the core engine — no extra install needed. It stores embedding as optimized binary, surfaced as JSON arrays, with support for up to 1,998 dimensions. See the official type reference for the full syntax.

PostgreSQL takes the extension route via pgvector. You install it once and get a vector type that integrates natively with your existing tables. It's open source, available on RDS, Azure Database for PostgreSQL, Google Cloud SQL, and self-hosted instances.

-- SQL Server 2025
CREATE TABLE articles (
    id        INT PRIMARY KEY,
    title     NVARCHAR(500),
    embedding VECTOR(1536)
);
 
-- PostgreSQL + pgvector
CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE articles (
    id        SERIAL PRIMARY KEY,
    title     TEXT,
    embedding VECTOR(1536)
);

2. Similarity search: VECTOR_DISTANCE vs the <=> operator

SQL Server 2025 uses the VECTOR_DISTANCE() function with explicit metric names ('cosine', 'euclidean', 'dot'). PostgreSQL with pgvector uses operator syntax: <=> for cosine, <-> for L2 (Euclidean), <#> for negative inner product. Both let you combine similarity with regular WHERE filters in the same query.

-- SQL Server 2025: top 5 similar articles
SELECT TOP 5
    title,
    VECTOR_DISTANCE('cosine', embedding, @query_vector) AS distance
FROM   articles
ORDER BY distance ASC;
 
-- PostgreSQL + pgvector: top 5 similar articles
SELECT
    title,
    embedding <=> $1 AS distance
FROM   articles
ORDER BY distance
LIMIT  5;

3. Approximate nearest neighbor (ANN) indexing

For large datasets, exact KNN scans get expensive. SQL Server 2025 introduces DiskANN-powered vector indexes via CREATE VECTOR INDEX (currently in preview, requires enabling PREVIEW_FEATURES). pgvector supports two ANN index types: ivfflat (IVF with flat quantization) and hnsw (Hierarchical Navigable Small Worlds). HNSW is generally preferred for query latency.

-- SQL Server 2025: DiskANN vector index (preview)
CREATE VECTOR INDEX idx_articles_embedding
    ON articles (embedding)
    WITH (METRIC = 'cosine');
 
-- PostgreSQL: HNSW index via pgvector
CREATE INDEX idx_articles_embedding
    ON articles
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

4. External model integration

SQL Server 2025 introduces EXTERNAL MODEL, which lets you register AI models (Azure OpenAI, OpenAI, Ollama, etc.) as first-class T-SQL objects and call them directly from queries. PostgreSQL doesn't have this built into the core engine — you typically generate embeddings outside the DB (Python, LangChain, application layer) and then insert the resulting vectors. Azure Database for PostgreSQL does offer an azure_ai extension that bridges this gap for Azure-hosted instances.


5. Which one fits your use case?

If your workload is already on SQL Server and you want everything — embeddings, vector search, and model calls — inside the engine with zero extra infrastructure, SQL Server 2025 is compelling. The DiskANN index is designed to scale to billions of vectors without offloading to a dedicated vector store.

If you're on PostgreSQL, pgvector gets you 90% of the way there with far less licensing cost and broader managed service support. HNSW indexes are mature and production-proven. The trade-off is that embedding generation stays outside the database — you own that pipeline.


Bottom line

Both platforms now let you store and query embeddings directly alongside relational data — which is the real win, because it eliminates the data synchronization headache of maintaining a separate vector store. SQL Server 2025 bets on deeper engine integration and managed model calls; PostgreSQL bets on ecosystem flexibility and open-source momentum. The syntax is surprisingly similar. Pick the one you're already running, and start small: create one table with a VECTOR column, generate embeddings for a real dataset, and measure your query latency with and without an ANN index. That benchmark will tell you more than any comparison article.


Friday, June 5, 2026

PostgreSQL 18 Finally Makes BUFFERS the Default. Here Is Why That Matters

You run EXPLAIN ANALYZE on a slow query, stare at the plan, and something still feels off. The estimated rows look reasonable, the node timings add up, but you cannot tell whether the database is reading everything from memory or hammering the disk. That missing piece has always been BUFFERS, and for years you had to remember to type it manually. PostgreSQL 18 finally turns it on by default.


1. What BUFFERS actually tells you

The BUFFERS option reports how many 8 kB blocks were satisfied from PostgreSQL's shared buffer cache (hit) versus read from disk (read). It also tracks blocks that were written back to disk during the query (written) and any temporary blocks used by sort or hash operations. Per the official EXPLAIN documentation, the counts are cumulative: a parent node includes the buffer usage of all its children, so the top-level line gives you the whole-query picture at a glance.


2. The old way vs PostgreSQL 18

Before version 18, BUFFERS was off by default and required the parenthesized syntax to combine it with ANALYZE:

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM   orders
WHERE  customer_id = 42;

Starting with PostgreSQL 18, EXPLAIN ANALYZE alone is enough, buffer information is included automatically. The PostgreSQL 18 EXPLAIN reference confirms this change: "Buffers information is automatically included when ANALYZE is used."


3. Reading the output

Here is a trimmed example of what you will see under a scan node:

Bitmap Heap Scan on orders
  (actual time=0.012..1.430 rows=980 loops=1)
  Buffers: shared hit=12 read=210

A high read count relative to hit means the data is not in shared buffers and the query is going to disk. The ratio between the two is the fastest signal you have for deciding whether to raise shared_buffers, add caching, or reconsider an index.


4. Spotting temp block usage

Temp blocks appear when a sort or hash operation cannot fit in work_mem and spills to disk. They show up as a separate line in the plan:

Sort  (actual time=45.2..48.9 rows=50000 loops=1)
  Sort Method: external merge  Disk: 7104kB
  Buffers: shared hit=2240 temp read=889 written=889

Any temp read or temp written value above zero is a sign that the operation spilled to disk. Increasing work_mem for that session (or globally if it happens often) is usually the fix. The Using EXPLAIN guide in the PostgreSQL docs walks through how to interpret these lines in context.


Wrapping Up

The BUFFERS option turns EXPLAIN ANALYZE from a timing report into a genuine I/O diagnostic. With PostgreSQL 18 enabling it automatically, there is one fewer thing to remember when chasing down a slow query. If you are still on an older version, make EXPLAIN (ANALYZE, BUFFERS) your default habit. The hit vs read ratio and any temp blocks in the output will often tell you more than the node timings alone.

If reading raw plan text is not your thing, I built PlanTrace exactly for this: a free visualizer that turns EXPLAIN output from PostgreSQL and Redshift into an interactive node graph. Buffer counts, node timings, and scan types all rendered in one color-coded view. Paste your plan and see everything at once, no indentation squinting required.


Wednesday, June 3, 2026

AI and Data Engineering on the Edge: The Good, the Bad, and the Overhyped

Running AI and data pipelines on the edge instead of the cloud has gone from a niche embedded concern to a default option on a lot of architecture diagrams. Some of that is justified. A lot of it is people moving compute to the edge because it sounds modern, then paying for the privilege in operational pain. This post is the honest version: where edge AI and edge data engineering earn their keep, where they don't, and how to tell the difference before you commit hardware to a field you'll have to drive to when it breaks.


The good: latency, privacy, and the egress bill

The strongest cases for the edge are the boring, measurable ones. Processing data where it's generated removes the round-trip to a region thousands of kilometers away, which matters when your latency budget is in the tens of milliseconds. It keeps sensitive data on the device, which is sometimes a compliance requirement rather than a nice-to-have. And it sidesteps egress fees on high-volume sources like 4K video or high-frequency sensor streams, where the recurring cost of moving raw data back to the cloud can quietly dwarf the cost of the inference itself.

In practice this shows up in a few recurring patterns. A computer vision model rejecting defective parts at the end of a production line — the decision has to happen before the conveyor moves, not after a round-trip to a cloud region. An anomaly detector running on vibration and temperature sensors at a remote pump station with no reliable cell coverage — it has to work offline or it doesn't work at all. A document extraction pipeline at a warehouse dock where scanned shipping documents need to be parsed against a local WMS — sending raw images off-site adds latency, cost, and an unnecessary privacy exposure on every transaction. What these have in common is that the constraint is hard, measurable, and present before anyone opens a cloud console.


The bad: you just turned one server into a fleet

Here's what the vendor slides skip. The moment you push inference to the edge, you've traded one managed environment for hundreds of unmanaged ones. Model updates become a deployment problem across heterogeneous hardware. Observability gets harder because the interesting failures happen on a device with no dashboard. Drift goes undetected longer because nobody is watching each node. And large models simply don't fit. A frontier-scale model is ill-suited for edge deployment on memory grounds alone, so "AI on the edge" almost always means a quantized, pruned, smaller model than the one you tested in the cloud. The accuracy you benchmarked is not the accuracy you'll ship.


Where it's not worth the price

If your workload tolerates a second or two of latency, runs occasionally rather than continuously, and produces modest data volumes, the edge is usually the more expensive option once you count the real costs: device provisioning, physical maintenance, security patching across a fleet, and the engineering hours spent debugging hardware you can't SSH into reliably. Centralized cloud compute exists precisely so you don't have to operate a thousand tiny servers. Batch ETL, periodic reporting, model retraining, anything that's fine with a round-trip — keep it in the cloud. Edge economics tip in your favor when volume and latency constraints are real and constant, not when they're hypothetical.


Where it's not worth doing just because

This is the category that burns the most time. Edge for its own sake. Symptoms: the latency requirement was never actually defined, the data was never sensitive, the volume was never high, and the real reason for going edge is that it looked good in the proposal. Before you commit, run the workload through a plain decision check rather than vibes:

Should this workload run on the edge? Your Workload Sub-100ms latency & unreliable network? YES Deploy to Edge NO Data too sensitive to leave the device? YES Deploy to Edge NO Egress cost exceeds amortized device cost? YES Deploy to Edge NO Keep it in the Cloud YES exits to edge deployment. Three NOs means the workload stays central.

If none of the first three branches fire, the edge is a liability you're choosing to maintain. The honest move is to admit that and keep the workload central.


If you do want to experiment

Two hardware tiers are worth knowing about, and they answer very different questions.

At the low end, an ESP32 costs single-digit dollars, draws milliamps in active use, and is capable of running TensorFlow Lite Micro for sensor-based anomaly detection or simple classification — entirely offline, no network dependency whatsoever. It is a microcontroller, not a computer: model size is measured in kilobytes and you are not running vision inference on it. But for always-on, low-power sensing deployed in volume, nothing is more direct or cheaper to maintain in the field.

ESP32 development board

ESP32 — the microcontroller tier. Sensor inference, offline, milliamps.


At the higher end, a Raspberry Pi 5 paired with an AI accelerator HAT runs real-time vision inference locally at single-digit watts. The HAT+ 2, released in early 2026, now pushes into small generative workloads on the same board. The gap between an ESP32 and a Pi 5 is roughly the gap between "flag an abnormal vibration reading" and "classify objects in a live camera feed in real time." Start at the bottom of that stack and move up only when the task genuinely requires it.

Raspberry Pi 5 board

Raspberry Pi 5 — the capable-inference tier. Vision, local AI, still single-digit watts.

(Both hardware links above are Amazon affiliate links — if you buy through them, this blog earns a small commission at no extra cost to you.)


Wrapping up

Edge AI and edge data engineering are excellent tools and a terrible default. They win on hard latency, real privacy constraints, and genuine egress costs, and they lose everywhere those constraints are imaginary. Before you distribute compute across a fleet you'll have to maintain, make the workload prove it needs to be there. If it can't, the cloud is still the cheaper, calmer place to run it.


Setting Up a Mac for Data Engineering and AI Work

If you work with data pipelines, SQL, notebooks, or machine learning models, a Mac with Apple Silicon is genuinely one of the best machine...