Data Lake Development
Project Goals
The project’s goal was to provide a scalable platform that unified diverse data sources into curated data sets for analytics and reporting use throughout the company. Additionally, this new platform replaced legacy reports that utilized technology being phased out to reduce cost.
Work Summary
I worked with and team of data engineers, systems administrators, stakeholders, and analysts on the data lake development project. My role consisted of building the transformations and data models, and the data processing pipelines used for integrating satellite imagery catalog and customer order fulfillment data. I also worked with analysts by producing analytic datasets for use on reports and dashboards.
Project Technologies
The data lake implementation included the following tools and technologies:
Trino: A scalable SQL Query Engine that allows analytics on a variety of database technologies that are connected in a high-performance federated fashion. The most unique aspect of this from a developer’s perspective is that you can join tables from completely different databases using SQL.
Apache Iceberg: Iceberg is an open-source table format enabling processing on tables with petabytes of data. This was integrated with Trino to allow SQL access, including all of the standard relational database-oriented operations.
AWS S3 buckets: Used to store the iceberg table data formatted as Parquet files.
AWS RDS (PostgreSQL, Oracle): These databases were the primary storage for operational as well as legacy data mart data. They were connected to the Trino cluster in a read-only configuration.
dbt ( Data Build Tool): An open-source software tool providing the ability to build models based on complex transformations using SQL and the Jinja templating language. Some key strengths of this tool include the ability to generate SQL statements dynamically, pre and post hooks for before / after processing of SQL, and a rather elegant method of performing tests to build data quality into the process.
Apache Superset: Open-source Business Intelligence software that is able to connect to a wide variety of data sources, including Trino.
Apache Airflow: Open-source platform used to create, orchestrate, and monitor sequences of tasks into workflows such as data processing pipelines. The workflows are coded using Python and utilize the directed acyclic graphs (DAG) concept which ensures that each task is executed once per workflow run, preventing infinite loops and defining clear dependencies among tasks.
Challenges and Design Considerations
Building a data lake presented many challenges to the entire team.
JSON Data: Some tables in the source databases contained JSON data types, which required decomposition and in some cases normalization to improve understandability for other users in the organization. Additionally, the internal structures of this data required detailed collaboration with external teams to understand the structure and meaning of the objects, keys and values.
Data quality: As with any project utilizing databases that have evolved over decades, the fitness of the source data required special handling. In this case, there were gaps in data due to source system failures during processing and other outages. I was able to detect some of these issues with the dbt "test" feature to raise errors and report for upstream actions. Additionally, I developed separate processes to evaluate the accuracy of data.
Learning curve for dbt: The dbt tool is a command line tool and is most conveniently used on a Linux based SSH terminal. Some developers in the organization were more comfortable using MS Windows desktops. I documented a method to perform dbt model development and testing using the Pycharm Professional IDE in conjunction with the terminal to perform the coding, testing, and Git repository activities.
Maintenance of iceberg tables: Tables in Trino that use the Iceberg connector can be maintained with most of the standard SQL data definition and data management statements that can be expected in a relational database. In the cases of MERGE, UPDATE, and DELETE statements, we found that the tables would need to be optimized periodically. I built processes into the Airflow workflows to periodically execute the maintenance steps.
Query Performance in Trino: Some dbt models required complex and expensive joins that stressed the memory and cpu capabilities of the Trino cluster. As a work around in these cases I was able to resolve this by passing the SQL directly to the source database using the query() function.
SAP obscure table names: Some dbt transformations required use of Oracle-based table data in an external SAP R/3 platform. It is well known that the SAP table names and schema structures are not intuitive to developers who do not work with SAP. This required frequent efforts to seek and collaborate with external teams.
Development Tasks
All of the source and target data sources were connected by the Trino distributed SQL query engine, so there was no need for extraction processes. This allowed all of the connected data sources to be joined using SQL. The transformations were performed exclusively using the dbt core open source software.
My main activities on this project included:
Creation of Replicas: One of the goals of the data lake was to provide a replica of a set of tables representing Order, Order History, and Inventory entities that were part of an external operational database. Many of the tables utilized "json" data types with embedded arrays and multidimensional objects. A requirement was to decompose this data and present it in a form that was easier for other teams to perform analysis. Additionally, the data in the operational database was frequently updated so it was necessary to keep the targets up to date. For this replica, I was able to achieve a change-data-capture technique with the appropriate "materialized" and "incremental strategies" provided by the dbt tool, along with the support for the SQL MERGE syntax provided by Trino and Iceberg.
Airflow workflows: Following development and testing of the dbt models and the required table maintenance routines, I created the Python-based Airflow DAG's (Directed Acyclic Graph) to schedule and orchestrate the tasks.
Analytic Dataset Development: My development efforts here included collaboration with subject matter experts who were familiar with reporting requirements. This was part of a broader effort to replace legacy reports with a new series of dashboards and reports with Apache Superset and the data lake. The intention was to provide the report and dashboard creators with a consolidated, tested and up to date data set based on complex requirements.
Validation and Monitoring
The dbt tool provides a rather elegant and built-in feature for performing testing of transformations. It can perform schema-level tests to guarantee referential integrity, unique key rules, and NOT NULL column value requirements. It also supports complex SQL testing to guarantee business rules are adhered to. I used all of these features to detect problems and raise errors.
In the cases where there were test failures, the exception would be visible on the Airflow console so it was easy to identify and troubleshoot these issues.