Unlocking Efficiency and Flexibility: Exploring Different Operations on Delta Tables

By Poornima Peiris

--

Delta Lake has transformed how we handle big data, offering Delta tables that manage data effectively. These tables support various actions, making data management easier. Basic actions like inserting and updating data are simple to understand. Advanced actions such as merging data and partitioning add flexibility. With Delta tables, we can even travel back in time to view past data versions. By learning about these actions, we can optimize our data workflows and ensure data reliability. This makes Delta tables essential for anyone working with big data.

First, let’s understand the difference between the “append”, “overwrite” and “merge” write modes of a Delta table. The “append” mode adds new data to the existing Delta table, preserving the existing data. On the other hand, the “overwrite” mode replaces the entire content of the Delta table with the new data, essentially starting fresh. The “merge” statement is used to perform a combination of INSERT, UPDATE, and DELETE operations based on a specified condition.

To demonstrate this, let’s create a sample Delta table using a random dataset and insert it into the Delta table. When you write a Delta table without providing a database schema, it will be written to the default location.

Let’s check the number of records.

Below, we can observe our sample records.

Now, let’s generate another dataset containing 50 records and save it into a second DataFrame

Our DataFrame and Delta table are prepared to perform tasks.

Append

• The “append operation” involves adding new data to an existing dataset or table without altering existing data.

• It is commonly used to add records to a dataset over time, such as appending new rows to a database table.

• This operation is particularly useful for continuously updating datasets or logging new events without losing previous data.

• It ensures data integrity by preserving existing records and only adding new entries.

• The append operation does not modify or delete any existing data; it simply extends the dataset with additional information.

Our Delta table currently holds 100 records, while the DataFrame contains 50 records. We’re about to append an additional 50 records to the Delta table.

Next, we’ll append the data into the Delta table and verify the record count after the operation.

As evident now, we have a total of 150 records.

Overwrite

The “overwrite” operation replaces existing data with new data in a dataset or table, ensuring that only the latest information is retained.

• Completely replaces the contents of the dataset or table.

• Useful for refreshing datasets with updated or corrected information.

• Eliminates any previous records, ensuring data consistency.

• Resets the dataset to the state defined by the new data.

• Ensures that only the most recent data is retained, discarding any prior entries.

Next, let’s perform an overwrite on DataFrame 02 and observe the impact on our dataset.

After checking the record count, it’s evident that our dataset has been replaced by DataFrame 02.

Merge Statement

The “merge statement” in Delta tables combines data from different sources based on specified conditions, enabling operations like insert, update, and delete within a single statement.

Combining Data: It merges data from source and target tables based on matching criteria.

• Insert Operation: If a record exists in the source but not the target, it inserts it into the target.

Update Operation: If a record exists in both source and target and meets certain conditions, it updates the target with values from the source.

Delete Operation: If a record exists in the target but not the source, it can delete it from the target.

Conditional Logic: Merge statement allows for conditional logic to determine the actions to take based on specific conditions.

Efficiency: It optimizes data synchronization tasks by minimizing the need for separate insert, update, and delete operations.

In essence, the merge statement streamlines the process of synchronizing and updating data in Delta tables, enhancing efficiency and data integrity.

We have our sample Delta table containing the date 2024–03–29.

Now, let’s create another sample DataFrame with the date 2024–03–30.

Let’s retrieve sample records from the Delta table with the date 2024–03–29 and create a third DataFrame. Next, we’ll merge DataFrame 2 and 3 into a new DataFrame called “result_df” and remove duplicate.

After removing duplicates, the DataFrame now contains the following number of records.

Merge the result_df , DataFrame into the Delta table.

Here’s a brief description of how a sample Delta table merge operation works with the provided parameters:

• Merge: This initiates the merge operation on the Delta table.

• WhenMatchedUpdate: This specifies the condition to be met when a match is found between the source and target data. If the condition is satisfied, it updates the target table with values from the source.

• Condition: This defines the condition to be met for the update operation to take place. It can be based on one or more columns in the dataset.

• WhenNotMatched: This determines the action to be taken when no match is found between the source and target data.

• Execute: This executes the merge operation based on the specified conditions, updating the target Delta table accordingly

Now, let’s execute a query to view the records in our target Delta table.

As observed, while our merged DataFrame contains values from both 2024–03–29 and 2024–03–30, only the data from 2024–03–30 was inserted into the table. Additionally, the existing data for 2024–03–29 was updated.

Furthermore, by examining the history of the merge operation, we can determine the number of records inserted, updated, and deleted as follows:

Update Delta Table

Updating a Delta table involves modifying existing records within the table based on specified conditions. This can include changing values within certain columns, updating rows that meet specific criteria, or applying transformations to the data. In essence, updating a Delta table allows for the modification and refinement of existing data to reflect changes or corrections in the underlying dataset.

This code uses the %sql magic command to execute SQL statements directly in a Databricks notebook. It updates the delta_table by setting the city column to ‘California’ for rows where the city is ‘Houston’. Note that you need to replace delta_table with the actual name of your Delta table.

Other valuable operations within a Delta table.

Checking if a Table is a Delta Table

You can verify whether a table is a Delta table by using the following command:

This provides you with a lot of information.

How to View the History of a Delta Table

Viewing the history of a Delta table allows you to track the changes made to the table over time. It provides insights into the operations performed on the table, such as inserts, updates, deletes, and metadata changes. This history enables you to understand the evolution of the data within the table and track any modifications made to it.

In conclusion, exploring the various operations available on Delta tables opens new avenues for efficiency and flexibility in data management. By leveraging features such as append, overwrite, merge, and more, data engineers can streamline their workflows and ensure data integrity within their Delta Lake architecture. As we continue to delve deeper into the realm of data engineering, mastering these operations empowers us to extract maximum value from our data lakes.

We hope this exploration has provided valuable insights into the potential of Delta tables and inspired you to optimize your data engineering practices. As you embark on your journey, remember to stay curious, experiment with new techniques, and share your learnings with the community.

--

--

OCTAVE - John Keells Group
OCTAVE — John Keells Group

OCTAVE, the John Keells Group Centre of Excellence for Data and Advanced Analytics, is the cornerstone of the Group’s data-driven decision making.