In this article, we answer why they are useful, what problems they solve and demonstrate basic usage based on the Apache Iceberg table format.
In an ever-growing world of data, traditional Data Lakes were assumed to be immutable – once written, they remain unchanged. This idea came from their main use case, which involved storing huge amounts of raw data for historical analysis and reporting.
The business continues to grow, and some of the use cases were hard (or even too expensive) to create with the existing toolsets. Recently, the need for handling data mutations on the Big Data scale such as updates and deletes has grown. This is where the Data Lakehouse steps in. By combining the capabilities of Data Lakes and Data Warehouses, Lakehouses allow for transactional operations, like updates and deletes, similar to what we see in Data Warehouses. It's designed to handle both structured and unstructured data, provide support for various data types and offers the flexibility to run different types of analytics – from machine learning to business intelligence.
Data Lake + Data Warehouse = Data Lakehouse
In the architecture of a Data Lakehouse (often called Transactional Data Lake) there’s an essential component - a table format such as Apache Iceberg, Apache Hudi, Delta Lake. These technologies provide transactional capabilities, data versioning, rollback, time travel and upsert features, making them crucial to make the most out of your Data Lakehouse and unlock the true potential of your data.
Apache Iceberg is an open-source table format for large-scale data processing, initially developed by Netflix and Apple. Iceberg was created to address the limitations and challenges of existing data storage formats, such as Apache Parquet. Apache Iceberg adds ACID (atomicity, consistency, isolation, and durability) transactions, snapshot isolation, time travel, Schema evolution and more. It is designed to provide efficient and scalable data storage and analytics capabilities, particularly for big data workloads.
Iceberg provides a table format abstraction that allows users to work with data using familiar SQL-like semantics. It captures rich metadata information about the dataset when individual data files are created. Iceberg tables consist of three layers: the Iceberg catalog, the metadata layer, and the data layer leveraging immutable file formats like Parquet, Avro and ORC.
Some key features of Apache Iceberg include:
But to fully unlock the potential of Apache Iceberg, we need to place it in a fully integrated environment. This is where AWS steps in.
Apache Iceberg works with data frameworks like Apache Spark, Flink, Hive, Presto, and AWS services like Amazon Athena, EMR, and AWS Glue. These AWS services, combined with Iceberg, support a Data Lakehouse architecture with the data stored on Amazon S3 Bucket and metadata on AWS Glue Data Catalog.
The following diagram (figure 1) demonstrates how we can approach it on AWS.
The process starts with a Data Lake that functions as a primary repository for raw, unprocessed data. This initial stage contains three integral components:
Following this, arrows signify data transfer from the Data Lake to one of the three processing components. We can use any of these tools to gather our data and put it into the next stage:
All three tools perform data transformation, cleaning, and loading operations and are capable of saving processed data in Apache Iceberg table format.
The last stage is the final Data Lakehouse, which is effectively the same infrastructure stack as the initial Data Lake (Amazon S3, AWS Glue Data Catalog, AWS Lake Formation), but with data stored in the optimized Apache Iceberg table format. Such Lakehouse can then be consumed by the downstream services, such as:
After this brief introduction, we will now demonstrate the basic features of a Data Lakehouse architecture on AWS using Apache Iceberg, an open-source table format for large datasets.
Please note that this section is neither an end to end tutorial to follow nor a deep dive into specific Iceberg features. Our focus here is to provide a basic walkthrough of Iceberg features and give you a sense of the core component used in a Data Lakehouse on AWS. The writing skips some boilerplate details, letting you focus only on the important parts.
We start with ingesting data into the Apache Iceberg realm. Then, we’ll show how you can query that data from Amazon Athena while using the time travel feature. Lastly, we will demonstrate how to generate a table with differences between each version of your datasets using the changelog mechanism of Apache Iceberg and Apache Spark, also known as Change Data Capture.
We will now focus on the first part of the process (figure 3), which is mainly handled by a service called AWS Glue Job. Its main purpose is to collect raw, unprocessed data. Then, it transforms this data into Apache Iceberg format, which is used as the core of the Data Lakehouse.
Let’s assume we already have some data in our Raw Data Zone Amazon S3 Bucket (figure 3) that we want to process and analyze. For storage purposes, we utilize JSON-Lines format which is a popular option for raw data. The data we used is about the products and looks as follows (listing 1):
<span class="code-info">Listing 1. Sample products records are stored in the Raw Data Zone under /day=01 partition</span>
In our Raw Data Zone, we partitioned the data by the ingestion date. So each day we put our ingestion results under a specific prefix in Amazon S3 which looks like this:
<span class="code-info">Listing 2. Amazon S3 Bucket structure under Raw Data Zone</span>
Now, let's transfer our data into what's called a Curated Data Zone (figure 3), a Data Lakehouse where we'll store it in the Apache Iceberg table format ready for further analytics. For that, we are going to use AWS Glue Job written in Python that uses PySpark (Apache Spark), to read and write data accordingly. The AWS Glue needs to be configured to load the necessary libraries for Apache Iceberg. This can be done in three different ways:
The following script (listing 2) is used to load data from Amazon S3 for a specific day, perform data transformation tasks, and then either merge the results with an existing Apache Iceberg table or create a new one, depending on the presence of a specific table in AWS Glue Data Catalog. Besides the Amazon S3 Bucket, this script also requires the creation of the AWS Glue Data Catalog database, which in our case is called 'apache_iceberg_showcase'. As a result, the job will save the data in Amazon S3 and update the AWS Glue Data Catalog table with its schema.
<span class="code-info">Listing 3. Sample Python script in AWS Glue Job leverages Apache Spark to transform JSON data from the Raw Data Zone into Apache Iceberg format in the Curated Data Zone, simultaneously updating the AWS Glue Data Catalog</span>
After the AWS Glue Job is executed successfully, our data is stored in an Amazon S3 Bucket, and it's ready to be queried in Amazon Athena as an Apache Iceberg table listed in the AWS Glue Data Catalog.
The picture below (figure 4) illustrates how the products table schema appears in the AWS Glue Data Catalog table after successful AWS Glue Job execution.
This is the way Apache Iceberg keeps data in the Amazon S3 Bucket (listing 3) leveraging immutable file formats like Parquet and Avro. This is why we call Apache Iceberg a table format as it does it all by keeping your data in open source file formats on Amazon S3 Bucket.
<span class="code-info">Listing 4. Amazon S3 Bucket structure under Curated Data Zone</span>
The data directory (data layer) works in conjunction with the metadata directory (metadata layer). While the data directory contains the raw data, the metadata directory contains information about the table layout, the schema, and the partitioning config, as well as the snapshots of the table's contents. The metadata allows for efficient data querying and supports the time travel feature, making it easy to query the table at different points in time. Whereas the Iceberg catalog stores the metadata pointer to the current table metadata file which in our case is the AWS Glue Data Catalog table (figure 5).
Once we've stored the data in the Apache Iceberg format, we can begin to query it. We'll start with a basic query in Amazon Athena. This query uses the products table from the AWS Glue Data Catalog (figure 4).
In this instance, we've simply selected all records. This appears similar to a standard query made in Amazon Athena (figure 6).
But hold on, there's more. The Apache Iceberg table format allows us to easily update these tables, a feature that wasn't nearly as straightforward in the traditional Data Lake architecture.
Updating Apache Iceberg table data
Suppose we want to revise the price of Product A (figure 5). Below is the record in our dataset that we're planning to modify. We are going to put this record in the next partition under the Raw Data Zone Amazon S3 Bucket and re-run the same AWS Glue Job shown before (figure 3).
<span class="code-info">Listing 4. Updated products record ofProduct A stored in the Raw Data Zone under /day=02 partition.</span>
So, after we've executed the AWS Glue Job once again on the data added under the /day=02 partition in the Raw Data Zone of our Amazon S3 Bucket, we are ready for querying. If we run the same query again, the products table's current state will be updated to reflect the price increase shown in figure 7.
Now, we want to check the past price of products. Apache Iceberg makes this really easy. Because of its integration with Amazon Athena, we can use simple SQL commands to look back in time (figure 8).
In the image below (figure 9), there is a query used in Amazon Athena. This query targets an Apache Iceberg table. A special suffix, $history, is added to the table name to query it’s metadata. This allows us to see the history of actions performed on the table over time.
Once we know when the table was modified with the exact timestamp. Let's perform a time travel query. The image below (figure 10) displays the original condition of the table before we increased the price of Product A moving us back in time.
Let's take a look at how the table changes after we've updated the price by adjusting the timestamp value (figure 11).
As demonstrated, we can easily specify any given point in time to view the state of the table at that particular moment.
But what if we want to track when certain changes were made throughout the course of time?
Let's pay attention to the AWS Glue Job that creates the changelog table (figure 12). Here, we are going to need another AWS Glue Job instance, that is going to use the same job configuration as the one we mentioned before. What differs is the code it uses.
Code below (listing 6), generates Apache Iceberg changelog view, and saves it in Amazon S3 Bucket via AWS Glue Data Catalog as parquet_changelog table. Note that this time, we are saving the data in plain Parquet (without Apache Iceberg table format over it) for ad-hoc querying. AWS Glue Job execution allows us to compute the data changelog between specific snapshots (taken from figure 9) or exact timestamps.
<span class="code-info">Listing 5. Sample Python script in the AWS Glue Job that utilizes Apache Spark to run an Apache Iceberg procedure, creating a changelog table on Amazon S3 and updating the products_changelog table in the AWS Glue Data Catalog</span>
After successful AWS Glue Job execution, we can query the table from Amazon Athena and get a changelog of a specific record throughout the time (figure 13). We can see the history of Product A that was modified earlier along with the commit timestamps. We see the state of the record before and after a particular change was applied indicated by _change_type column.
We've traveled back and forth in time with Apache Iceberg, took a tour of the Data Lakehouse on AWS and got to know why this is such a big deal changing the data game.
Traditional Data Lakes are big stores of data that don't change. However, modern businesses often need to update their data more frequently. To do this, we have Data Lakehouses. These blend features from both Data Lakes and Data Warehouses, allowing changes to be made to the data, such as adding or removing information.
Apache Iceberg is a tool that helps manage big amounts of data better. It can track changes over time, fix data without rewriting everything, and make finding and accessing data easier. Also, it works well with AWS services, allowing data to be stored in a flexible and efficient way. This unlocks several features like time travel, seamless handling of updates, and incremental data processing for data stored on Amazon S3.
More about client requirements, current situation and limitations
We're here to help you!