Arthur Pitta | Senior Data Engineer

About

Senior Data Engineer | AWS/GCP & ETL/ELT Specialist | Tech Mentor

Hello there! I’m an accomplished Senior Data Engineer with 7 years of experience, dedicated to transforming complex data into actionable insights. With a focus on AWS and GCP services and ETL/ELT processes, I build robust data pipelines that power intelligent decision-making.

Programming Languages: Python, SQL, Bash
Cloud Platforms: AWS, GCP
AWS Services: S3, Glue, Lambda, Redshift
GCP Services: BigQuery, Dataflow, Cloud Storage, Pub/Sub
Databases & Data Warehouses: PostgreSQL, MySQL, BigQuery, Snowflake, SQLite
Data Processing & Orchestration: Databricks, Airflow, Prefect, Dagster, Spark, PySpark
Data Modeling & Transformation: dbt, Trino, Iceberg, Delta, DuckDB
Data Quality & Metadata Management: OpenMetadata
DevOps & Version Control: Git, GitHub, Docker, Docker Compose, Dev Containers
Business Intelligence & Visualization: Power BI, Tableau
Methodologies: SAFe Scrum, Agile

Let's Make Connections. I'm eager to contribute my technical expertise to a team that's all about progress and innovation. If you’re scouting for a Senior Data Engineer, let’s chat!

Pokémon -
Delta Lakehouse

End-to-end batch data pipeline on Pokémon Data with Delta Lakehouse Architecture.

OVERVIEW

The PokéAPI Data Pipeline project is a project that aims to create a data pipeline to extract, load, and transform data from the PokéAPI into a Delta Lakehouse. The project uses Python, DuckDB, MinIO, Docker, Docker Compose, Dev Container, and Poetry.

About PokéAPI

The PokéAPI is a RESTful API that provides data about the Pokémon games. The API provides data about the Pokémon species, abilities, moves, types, and more.

Pokemon is a media franchise created by Satoshi Tajiri and Ken Sugimori and is managed by The Pokémon Company, a collaboration between Nintendo, Game Freak, and Creatures. The franchise was created by Satoshi Tajiri and Ken Sugimori and is centered on fictional creatures called "Pokémon", which humans, known as Pokémon Trainers, catch and train to battle each other for sport.

The goal of the PokéAPI Data Pipeline project is to create a Delta Lakehouse with the data provided by the PokéAPI.

Technologies

The PokéAPI Data Pipeline project uses the following technologies:

Ingestion: Python
Processing: DuckDB
Storage: MinIO
Orchestration: Dagster
Visualization: Apache Superset
Data format: Delta
Infrastructure: Docker

Docker Compose

Dev Container

Tools: Poetry

Pre-commmit

Git

GitHub

Problem Statement

Data

The data selected for the project is the Pokémon provided by the PokéAPI. The data includes the pokémons details for all the pokémons available in the API.

The data descriptions is available in the PokéAPI Documentation.

Data Pipeline Overview

This is a batch data pipeline that extracts data from the PokéAPI, transforms the data into a delta format and loads the data into a Delta Lakehouse.

The ELT steps are as follows:

Extract: Extract the data from the PokéAPI and store it in a raw format in the staging area.
Load: Load the data into a staging area.
Transform: Transform the data into a delta format it into the Delta Lakehouse.

Medallion Architecture

The Medallion Architecture is a data architecture that uses the following components:

Raw: The Raw layer is the raw data layer that stores the raw data extracted from the source.
Bronze: The Bronze layer is the transformed data layer that stores the transformed data in a delta format.
Silver: The Silver layer is the curated data layer that stores the curated data in a delta format.
Gold: The Gold layer is the business data layer that stores the business data in a delta format.

Pipeline Architecture

The architecture of the data pipeline is as follows:

Results

Delta Lakehouse

Orchestration

Data Visualization

Conclusion

The full project can be viewed on my GitHub:

Pokémon - Delta Lakehouse