_delta_log) and deletion vectors to mark rows as invalid without physically rewriting the entire Parquet file. Letās walk through an example to understand what happens step by step.
Scenario
-
You have a Delta Table with the following data stored in two Parquet files:
1.parquet: Contains rows(1, "Arun")and(2, "Bala").2.parquet: Contains row(3, "Raj").
-
The transaction log (
_delta_log) has two entries:0.json: Records the creation of1.parquet.1.json: Records the addition of2.parquet.
-
You want to delete the row with
id=2.
Step-by-Step Process
1. Initial State
- The table currently looks like this:
| id | name |
|---|---|
| 1 | Arun |
| 2 | Bala |
| 3 | Raj |
- The transaction log reflects the current state of the table:
0.json:1.parquetis added.1.json:2.parquetis added.
2. Deleting the Row
When you delete the row withid=2, the following happens:
Delete operation on a delta table
-
Identify the File Containing the Row: The row with
id=2is located in1.parquet. -
Mark the Row as Invalid: Delta Tables use a deletion vector to mark the row with
id=2as invalid in1.parquet. This avoids rewriting the entire file. -
Update the Transaction Log:
- A new transaction log entry (
2.json) is created to record the changes:1.parquetis āremovedā (marked as invalid for the row withid=2).1.parquetis āre-addedā with a deletion vector specifying that the row withid=2should be excluded.
- A new transaction log entry (
2.json
3. Final State
After the delete operation:-
The table still has two Parquet files:
1.parquet: Contains the original rows(1, "Arun")and(2, "Bala"), but the row withid=2is marked as invalid.2.parquet: Contains the original row(3, "Raj").
-
The transaction log (
_delta_log) now has three entries:0.json: Records the creation of1.parquet.1.json: Records the addition of2.parquet.2.json: Records the deletion (removal and re-addition of1.parquetwith a deletion vector).
How Queries Work
When you query the table after the delete operation:- The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
- The query reads:
- All valid rows from
1.parquet(excluding the row withid=2). - All rows from
2.parquet.
- All valid rows from
- The result is a combined view of the data:
| id | name |
|---|---|
| 1 | Arun |
| 3 | Raj |
Why Not Physically Delete the Row?
You might wonder why Delta Tables donāt physically remove the row from1.parquet. Hereās why:
-
Immutability of Parquet Files:
- Parquet files are immutable, meaning they cannot be modified once written. Physically deleting a row would require rewriting the entire file, which is inefficient.
-
Efficiency:
- Rewriting an entire Parquet file for a single row deletion would be extremely inefficient, especially for large files.
- By using deletion vectors, Delta Tables minimize the amount of data that needs to be rewritten.
-
Concurrency and Consistency:
- Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while deletions are happening.
- Physically deleting rows would make it much harder to maintain this consistency.
-
Time Travel:
- Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
- Physically deleting rows would make it difficult to maintain this history.
Summary
When you delete a row in a Delta Table:- The row is marked as invalid using a deletion vector.
- The transaction log is updated to reflect this change.
- Queries exclude the invalid row when reading data, providing a consistent view of the table.