Study your Data: 5 Starter Projects for Your AI and Data Engineering Portfolio

Reading tutorials is fine. Shipping something is better. If you are trying to break into data engineering or AI, nothing on your resume carries more weight than a GitHub repo with working code and a problem you actually solved. These five projects are designed to give you hands-on experience with real tools while producing portfolio artifacts you can point to in an interview.

Project 1 — Automated ETL Pipeline with Scheduling

Tech Stack: Python, PostgreSQL, Apache Airflow, Docker

Build a pipeline that pulls data from a public API (weather, exchange rates, or any open dataset), transforms it with Python, and loads it into PostgreSQL on a schedule. Airflow handles the orchestration, and Docker keeps the environment reproducible. This project teaches you the core ETL loop and gives you a DAG you can walk through in any data engineering interview. The scheduling angle forces you to think about idempotency and failure handling from day one.

# Minimal Airflow DAG skeleton
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
 
def extract(): ...
def transform(): ...
def load(): ...
 
with DAG("etl_pipeline", start_date=datetime(2025, 1, 1), schedule="@daily") as dag:
    t1 = PythonOperator(task_id="extract", python_callable=extract)
    t2 = PythonOperator(task_id="transform", python_callable=transform)
    t3 = PythonOperator(task_id="load", python_callable=load)
    t1 >> t2 >> t3

Project 2 — Cloud Data Warehouse on AWS

Tech Stack: AWS S3, Amazon Redshift Serverless, Python (boto3), SQL

Upload a dataset to S3, provision a Redshift Serverless workgroup, and load the data using the COPY command. Write a set of analytical queries against it. The goal is not just to run queries — it is to understand how a cloud warehouse differs from a local database: columnar storage, distribution keys, and the cost model. This project demonstrates cloud data skills that show up in virtually every modern data engineering job description. Use the AWS Free Tier to keep costs at zero while learning.

-- Load from S3 into Redshift
COPY sales
FROM 's3://your-bucket/sales.csv'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftRole'
FORMAT AS CSV
IGNOREHEADER 1;

Project 3 — ETL Pipeline with Data Factory in Microsoft Fabric

Tech Stack: Data Factory in Microsoft Fabric, Microsoft Fabric Lakehouse, OneLake, SQL Server, Python (Fabric Notebook)

Microsoft has consolidated its data integration story into Microsoft Fabric, and Data Factory in Microsoft Fabric is its next-generation replacement for Azure Data Factory. Build a pipeline that ingests data from SQL Server or a flat file, transforms it using a Fabric Notebook (Python/PySpark), and lands the result in a Fabric Lakehouse backed by OneLake. This project exposes you to the unified Fabric workspace model — one platform for pipelines, notebooks, warehouses, and Power BI — which is exactly where Microsoft data engineering is heading. A free Fabric trial requires no credit card and gives you full access to build this end to end.

Project 4 — RAG Chatbot over Your Own Documents

Tech Stack: Amazon Bedrock, Claude or Llama 3 (via Bedrock), pgvector (PostgreSQL), Python, LangChain

Take a set of PDF or text documents you own, chunk them, generate embeddings using a Bedrock foundation model, store the vectors in PostgreSQL with the pgvector extension, and build a simple question-answering interface on top. This is Retrieval Augmented Generation (RAG) in its simplest form. It teaches you the full AI data pipeline: ingestion, embedding, vector search, and prompt construction. RAG is one of the most in-demand AI engineering patterns in production today.

# Store a document embedding in pgvector
INSERT INTO documents (content, embedding)
VALUES (%s, %s::vector);
 
-- Semantic similarity search
SELECT content
FROM   documents
ORDER BY embedding <-> '[query_vector]'::vector
LIMIT 5;

Project 5 — End-to-End ML Pipeline with Feature Engineering

Tech Stack: Python, pandas, scikit-learn, SQL Server or PostgreSQL, MLflow

Pick a structured dataset with a clear prediction target (churn, sales forecast, classification). Build the full cycle in Python: data extraction from a relational database, feature engineering, model training, and experiment tracking with MLflow. The data engineering side is the extraction and feature pipeline; the AI side is the model and tracking. Publishing your MLflow experiment results in a public repo gives reviewers something concrete to evaluate. This project bridges the two disciplines in a single deliverable.

import mlflow
from sklearn.ensemble import RandomForestClassifier
 
with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    mlflow.log_metric("accuracy", model.score(X_test, y_test))
    mlflow.sklearn.log_model(model, "model")

Five Projects. Two Cloud Platforms. One GitHub Profile Worth Showing.

Together these projects cover the full surface area hiring managers scan for: pipeline orchestration, cloud warehousing on AWS and Azure, AI integration, and end-to-end ML tracking. Pick the stack you are least afraid of, break it, fix it, and commit the mess. The best time to start was last month. The second best time is now.

If you are curious about what a working portfolio looks like in practice, you can take a look at my own projects.

Study your Data

Friday, June 12, 2026

5 Starter Projects for Your AI and Data Engineering Portfolio

Project 1 — Automated ETL Pipeline with Scheduling

Project 2 — Cloud Data Warehouse on AWS

Project 3 — ETL Pipeline with Data Factory in Microsoft Fabric

Project 4 — RAG Chatbot over Your Own Documents

Project 5 — End-to-End ML Pipeline with Feature Engineering

Five Projects. Two Cloud Platforms. One GitHub Profile Worth Showing.

No comments:

Post a Comment

Match the Model to the Task, Not the App: An Orchestrator-Worker Pattern in Python