Loading…
Loading…
Flagship Data Engineering Project
A scalable PostgreSQL ETL and analytics pipeline for more than 40 million NYC yellow taxi trip records, focused on schema design, indexing, and query performance.
Large trip-level mobility datasets can become difficult to analyze when raw files are queried directly or loaded without a clear schema and indexing strategy. The project focused on converting raw NYC yellow taxi trip data into a structured PostgreSQL analytics workflow that supports efficient time-based, spatial, passenger, and revenue analysis.
Raw Parquet Files
Staged Loading
Cleaning and Type Normalization
PostgreSQL Analytical Tables
Indexes and Query Optimization
SQL Analytics Outputs
Loaded raw taxi trip files into a structured workflow from Parquet sources.
Handled schema consistency, column types, missingness, and analytical readiness.
Created PostgreSQL tables designed for time-filtered and analytical queries.
Added indexing strategies to improve query speed on high-volume filters.
Extracted temporal, spatial, passenger, and revenue insights through optimized SQL.
Before optimization
About 1.2 seconds (time-filtered query)
After schema/indexing optimization
About 0.05 seconds (time-filtered query)
This improvement shows the value of database design and indexing in practical data engineering workflows, especially when working with large analytical datasets.
This project demonstrates large-scale ETL thinking, relational schema design, query optimization, SQL analytics, and the ability to turn raw data into a structured analytical database workflow.
The repository contains the ETL workflow, PostgreSQL schema, SQL analytics, and performance optimization notes.
View GitHub Repository