Flagship Data Engineering Project

NYC Taxi Data Engineering

A scalable PostgreSQL ETL and analytics pipeline for more than 40 million NYC yellow taxi trip records, focused on schema design, indexing, and query performance.

Data Engineering
PostgreSQL
ETL Pipeline
SQL Analytics
Query Optimization

View Repository Back to Projects

Proof Metrics

40M+: Records processed
0.05s: Optimized query latency
~1.2s → 0.05s: Time-filtered query improvement
PostgreSQL: Analytical database layer

Problem

Large trip-level mobility datasets can become difficult to analyze when raw files are queried directly or loaded without a clear schema and indexing strategy. The project focused on converting raw NYC yellow taxi trip data into a structured PostgreSQL analytics workflow that supports efficient time-based, spatial, passenger, and revenue analysis.

Data

NYC yellow taxi trip records at trip-level granularity
More than 40 million records in the analytical pipeline
Raw Parquet files as the ingestion source
Cleaned and transformed PostgreSQL analytical tables
Time, location, passenger, and fare/revenue-related fields
PostgreSQL as the primary analytical database

Architecture

Raw Parquet Files

Staged Loading

Cleaning and Type Normalization

PostgreSQL Analytical Tables

Indexes and Query Optimization

SQL Analytics Outputs

Methodology

01
Data ingestion
Loaded raw taxi trip files into a structured workflow from Parquet sources.
02
Data cleaning
Handled schema consistency, column types, missingness, and analytical readiness.
03
Database modeling
Created PostgreSQL tables designed for time-filtered and analytical queries.
04
Indexing
Added indexing strategies to improve query speed on high-volume filters.
05
SQL analytics
Extracted temporal, spatial, passenger, and revenue insights through optimized SQL.

Performance Result

Before optimization

About 1.2 seconds (time-filtered query)

After schema/indexing optimization

About 0.05 seconds (time-filtered query)

This improvement shows the value of database design and indexing in practical data engineering workflows, especially when working with large analytical datasets.

Technical Decisions

Used PostgreSQL instead of flat-file querying for repeatable analytics
Converted raw Parquet data into cleaned analytical tables
Applied indexing for time-filtered queries
Focused on reproducible SQL workflows
Measured query latency before and after optimization

SQL Analytics Areas

Temporal trip patterns
Location-based trip behavior
Passenger count patterns
Fare/revenue analysis
Query performance testing

What This Demonstrates

This project demonstrates large-scale ETL thinking, relational schema design, query optimization, SQL analytics, and the ability to turn raw data into a structured analytical database workflow.

Limitations

Analysis depends on source data consistency
Larger production workflows would need orchestration and monitoring
Additional geospatial enrichment could improve spatial analysis
Further optimization could include partitioning or materialized views

Explore the repository

The repository contains the ETL workflow, PostgreSQL schema, SQL analytics, and performance optimization notes.

View GitHub Repository

Related Projects

Loading…

Portfolio home

Flagship Data Engineering Project

NYC Taxi Data Engineering

A scalable PostgreSQL ETL and analytics pipeline for more than 40 million NYC yellow taxi trip records, focused on schema design, indexing, and query performance.

Data Engineering
PostgreSQL
ETL Pipeline
SQL Analytics
Query Optimization

View Repository Back to Projects

Proof Metrics

40M+: Records processed
0.05s: Optimized query latency
~1.2s → 0.05s: Time-filtered query improvement
PostgreSQL: Analytical database layer

Problem

Data

NYC yellow taxi trip records at trip-level granularity
More than 40 million records in the analytical pipeline
Raw Parquet files as the ingestion source
Cleaned and transformed PostgreSQL analytical tables
Time, location, passenger, and fare/revenue-related fields
PostgreSQL as the primary analytical database

Architecture

Raw Parquet Files

Staged Loading

Cleaning and Type Normalization

PostgreSQL Analytical Tables

Indexes and Query Optimization

SQL Analytics Outputs

Methodology

01
Data ingestion
Loaded raw taxi trip files into a structured workflow from Parquet sources.
02
Data cleaning
Handled schema consistency, column types, missingness, and analytical readiness.
03
Database modeling
Created PostgreSQL tables designed for time-filtered and analytical queries.
04
Indexing
Added indexing strategies to improve query speed on high-volume filters.
05
SQL analytics
Extracted temporal, spatial, passenger, and revenue insights through optimized SQL.

Performance Result

Before optimization

About 1.2 seconds (time-filtered query)

After schema/indexing optimization

About 0.05 seconds (time-filtered query)

This improvement shows the value of database design and indexing in practical data engineering workflows, especially when working with large analytical datasets.

Technical Decisions

Used PostgreSQL instead of flat-file querying for repeatable analytics
Converted raw Parquet data into cleaned analytical tables
Applied indexing for time-filtered queries
Focused on reproducible SQL workflows
Measured query latency before and after optimization

SQL Analytics Areas

Temporal trip patterns
Location-based trip behavior
Passenger count patterns
Fare/revenue analysis
Query performance testing

What This Demonstrates

This project demonstrates large-scale ETL thinking, relational schema design, query optimization, SQL analytics, and the ability to turn raw data into a structured analytical database workflow.

Limitations

Analysis depends on source data consistency
Larger production workflows would need orchestration and monitoring
Additional geospatial enrichment could improve spatial analysis
Further optimization could include partitioning or materialized views

Explore the repository

The repository contains the ETL workflow, PostgreSQL schema, SQL analytics, and performance optimization notes.

View GitHub Repository

Proof Metrics

Problem

Data

Architecture

Methodology

Data ingestion

Data cleaning

Database modeling

Indexing

SQL analytics

Performance Result

Technical Decisions

SQL Analytics Areas

What This Demonstrates

Limitations

Explore the repository

Related Projects

Proof Metrics

Problem

Data

Architecture

Methodology

Data ingestion

Data cleaning

Database modeling

Indexing

SQL analytics

Performance Result

Technical Decisions

SQL Analytics Areas

What This Demonstrates

Limitations

Explore the repository

Related Projects