Best Practices for Deleting Data

Deleting Data from Vertica

Vertica is a high-performance column-store analytic database. For deleting data, Vertica differs from traditional databases in two ways:

The DELETE statement does not actually delete data from the disk storage; it marks rows as deleted so that they can be retrieved by historical queries.
The UPDATE statement performs two tasks: It writes the new data and marks the old data for deletion.

The DELETE statement does not delete the data from the disk storage. Instead, the statement creates a delete vector that records the position of the deleted record and the epoch when the delete was committed.

Reasons for Deleting Data

Common reasons for deleting data are as follows:

You need to delete historical data at regular intervals.
You need to update and/or delete data loaded by mistake.
You need to delete staging tables.

Different Types of Delete

	Recommended Load	Rollback Possibility	Performance	Use
Single Row Delete	WOS	Yes	Depends on projection design	Recommended in WOS so that deleted rows are combined in one delete vector when data is moved out.
Trickle Load	WOS	Yes	Depends on Projection Design	Recommended for small batches that happen at frequent intervals. Recommended in WOS so that deleted rows are combined in one delete vector when data is moved out.
Bulk Delete	Direct ROS	Yes	Depends on projection design	Recommended because it creates one delete vector for each ROS that has data to be marked as deleted.
Drop Partition	N/A	No	Fast with catalog removal and storage is removed in the background.	Recommended to clean historic data. It forces the moveout operation before executing to move data inserted in ROS that belongs to partition to be removed.
Truncate	N/A	No	Fast with catalog object changes and storage is removed in the background.	Removes all storage associated with a table while preserving the table definition. Recommended when you need to clean the table content.

Removing Deleted Data Permanently

To make the disk space reusable, Vertica allows you to permanently remove the deleted data from the physical storage of the database. The deleted data is permanently removed from the ROS container. You can control the deletion of data in the physical storage of your database by enabling the Tuple Mover to remove the data automatically or removing the data manually.

The Ancient History Mark (AHM) decides when the data is removed permanently. The AHM is an epoch that represents the time until which the history is retained. Any history older than the AHM is eligible for permanent removal.

To determine which data is eligible for automatic removal by the Tuple Mover, set a purge policy. Best practices for setting a purge policy are as follows:

Use the HistoryRetentionTime configuration parameter to specify the time in seconds for which deleted data is saved. This practice is the preferred method for determining which deleted data can be purged. To disable, set the HistoryRetentionTime configuration parameter to -1.

=> SELECT SET_CONFIG_PARAMETER('HistoryRetentionTime', '{ <seconds> | -1 }' );

Specify the number of historical epochs that are saved. To use this method, set the HistoryRetentionTime configuration parameter to -1 and set the number of historical epochs to be saved.

=> SELECT SET_CONFIG_PARAMETER('HistoryRetentionTime', ' -1 ' );
=> SELECT SET_CONFIG_PARAMETER('HistoryRetentionEpochs', '{<num_epochs>}');

Specify the percentage of deleted rows that need to be reached in order to permanently remove the data. In the following example, you set the percentage of deleted data to be purged as 20% of the total data.

=> SELECT SET_CONFIG_PARAMETER('PurgeMergeoutPercent', '{ Percent| 20 }' );

If a table is not partitioned, the Mergeout operation permanently removes the data of all the ROS containers that meet the PurgeMergeoutPercent of rows deleted in particular ROS.

If a table is partitioned, the Mergeout operation permanently removes the data of only the inactive partitions that meet the PurgeMergeoutPercent value.

Approaches for Managing Delete Vectors

Considering the overhead involved in permanently removing the deleted data, You need to manage the number of delete vectors. You can manage the created delete vectors by looking for alternatives to avoid using the DELETE function or by performing bulk deletes, as described in this section.

Alternatives for Removing Data

Alternative ways of removing data without using the DELETE function are as follows:

To remove all the data from a table, use the TRUNCATE TABLE statement. Truncating a table removes all the storage associated with that table and the table’s projections. With the TRUNCATE TABLE, you can preserve the projection definition. Truncating a table removes dependencies in the catalog and removes storage containers in the background.
To delete historical data, partition the data by the date field and use the DROP PARTITION function. DROP PARTITION removes dependencies in the catalog files and removes storage containers in the background without affecting Vertica transactions. Vertica executes a Moveout operation before executing DROP PARTITION. If the data to be deleted is for a partial partition, follow these steps:
1. Create staging tables with the same table definition as the original table.
2. Move the data to be preserved in the partition to the staging table.
3. After the data has been moved, swap the partition using the SWAP_PARTITIONS_BETWEEN_TABLES function. This function is a catalog operation with no data movement.

For example, suppose you want to reload last week’s data due to an error in a table that is partitioned by month. To delete the data from last week and to insert new data, follow these steps:

Create a staging table.

=> CREATE TABLE store.staging_store_orders_fact like store.store_orders_fact including projections;

Remove the data from 2005-11-20 to 2005-11-27 by moving that data to the staging table.

=> INSERT /*+ direct */ into store.staging_store_orders_fact select * from store.store_orders_fact where date_ordered between '2005-11-01' and '2005-11-19' or date_ordered between '2005-11-28' and '2005-11-30';

Swap the partition.

=> SELECT SWAP_PARTITIONS_BETWEEN_TABLES('store.staging_store_orders_fact',200511,200511,'store.store_orders_fact');

Truncate or drop the staging table.

Perform Bulk Delete

If you cannot avoid using the DELETE function, if possible, perform a bulk delete. Use bulk delete instead of multiple single deletes. You should commit in WOS deleting multiple single rows of data. Failing to commit creates multiple delete vectors for each statement.

Best practices for performing a bulk delete are as follows:

Load the delete predicate in a temporary table.
Delete rows in one statement by joining the temporary table with the delete predicates and the table with data to be deleted.

For example, suppose you want to delete records (Employee ID) for several employees. You can delete the records in a single delete statement, which creates one delete vector per statement. You can also create one bulk delete statement that creates one delete vector per ROS container that contains deleted data.

The following code snippet illustrates how to perform a bulk delete.

=> CREATE LOCAL TEMP TABLE data_to_delete (emp_id INT);
CREATE TABLE
=> COPY data_to_delete FROM '/tmp/employee_to_delete.txt' ;
Rows Loaded
-------------
          15
(1 row)
 
=> DELETE /*+ direct */ FROM store.store_orders_fact WHERE employee_key IN (SELECT * FROM data_to_delete );
OUTPUT
--------
   1740
(1 row)
=> DROP TABLE data_to_delete ;
DROP TABLE

Managing Delete Vectors

If you need to delete data, you can choose either to perform a bulk delete by creating one delete vector or perform multiple single deletes. Performing multiple single deletes results in multiple delete vectors.

The Tuple Mover mergeout operation removes the delete vectors when:

The delete occurs before the AHM.
The deleted rows per ROS containers are greater than PurgeMergeoutPercent (default is 20%) of the total ROS rows.

In partitioned tables, the mergeout operation purges the data of inactive partitions, but not active partitions.

If automatic mergeout has not permanently removed the data, you need to manually manage the delete vectors. This section explains the manual operation.

To check the delete vectors and the deleted rows, use the following query. This query filters the data to the ROS containers in the initiator nodes. By querying the local data, you avoid putting load on the system tables.

=> SELECT schema_name
          ,projection_name
          ,count(*) num_ros
          ,sum(total_row_count) num_rows
          ,sum(deleted_row_count) num_deld_rows
          ,sum(delete_vector_count) Num_dv
          ,(sum(deleted_row_count) / sum(total_row_count) * 100)::INT por_del_rows
FROM storage_containers
WHERE node_name = ( SELECT local_node_name())
GROUP BY 1, 2
HAVING sum(deleted_row_count) > 0
ORDER BY 5 DESC;
 
schema_name | projection_name      | num_ros | num_rows | num_deld_rows | Num_dv |?column?
------------+----------------------+---------+----------+---------------+--------+--------
store       | store_orders_fact_b1 | 60      | 200044   |  5636         | 62     |  3
store       | store_orders_fact_b0 | 60      | 200210   |  5618         | 62     |  3

If the table is partitioned, you can view the deleted rows per partition using the following query.

=> SELECT p.node_name
          ,p.Table_schema
          ,p.projection_name
          ,p.partition_key
          ,count(DISTINCT p.ros_id) num_ros -- Number of ROS containers in projection
          ,sum(p.ros_size_bytes) used_bytes -- Used bytes by the projecton
          ,sum(p.ros_row_count) num_rows -- Number of rows in projection
          ,sum(p.deleted_row_count) num_del -- Number of deleted rows in the table
          ,sum(delete_vector_count) cdv-- Number deleted vectors
          ,(sum(sc.deleted_row_count)/sum(p.ros_row_count)*100)::int por_del_rows -- Percentage of deleted rows per partition.
FROM partitions p inner join storage_containers sc ON ros_id = storage_oid
WHERE p.node_name = (SELECT local_node_name()) AND sc.node_name = (SELECT local_node_name())
GROUP BY 1,2,3,4
Having SM(delete_vector_count) > 0
ORDER BY 10 DESC,2,3,4;

node_name           |Table_schema|projection_name       | partition_key |num_ros  |used_bytes|num_rows|num_del|cdv|por_del_rows
--------------------+------------+----------------------+---------------+---------+----------+--------+-------+---+------------
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200511        | 1       | 244467   | 9978   | 1664  | 9 | 50
v_utn_demo_node0001 | store      | store_orders_fact_b1 | 200511        | 1       | 243471   | 9936   | 1650  | 9 | 50
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200311        | 1       | 82143    | 3354   | 84    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200407        | 1       | 82799    | 3380   | 90    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200601        | 1       | 79787    | 3258   | 82    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200703        | 1       | 83788    | 3422   | 88    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200712        | 1       | 81095    | 3316   | 92    | 1 | 3

Based on the results of the preceding query, you have three options:

Mergeout of the delete vectors: If you have too many delete vectors, but not too many deleted rows and no partitions, you should merge the delete vectors.
Purge the partition: Purge the partitioned data (also called a partial purge).
Purge the table: Manually purge the data by rewriting the entire table.

Mergeout Delete Vectors

If there are multiple delete vectors but the percentage of deleted rows is less compared to the total rows of the table, merge the delete vectors into one delete vector. Merging the delete vectors is better than purging the entire table. Purging a table rewrites the data set of ROS containers without the deleted rows. If the number of deleted rows is small in comparison with the total rows of the table, do not purge the table. To avoid a ROS pushback, reduce the number of delete vectors.

To avoid performance degradation and ROS pushback, mergeout consolidates multiple ROS containers and purges the deleted records. The Tuple Mover performs an automatic mergeout by combining two or more ROS containers into a single container without the deleted rows. However, if the number of delete vectors per ROS containers is less than PurgeMergeoutPercent (the default is 20%), a mergeout operation does not purge the deleted records. Using many DELETE statements to delete multiple rows creates many small containers to hold the deletion marks. Each container consumes resources and impacts performance. If there are mergeout cycles for merging, the Tuple Mover merges the deletion marker containers into a single large container.

To merge the deleted vectors into one delete vector, use the DO_TM_TASK as follows:

=> SELECT DO_TM_TASK('dvmergeout');

Purging Partitions

If there are more number of deleted rows in one partition than the other partition of the table, purge the specific partition with higher number of deleted rows. Purge a specific partition instead of purging the whole table. Purging the whole table results in rewriting the ROS containers with only a few deleted rows. Recall the SELECT schema_name query mentioned earlier in this document that returns the location of the deleted data in partitions. If cdv and por_del_rows are higher for a particular partition, purge just that partition. The steps to purge the partition are as follows:

Move the AHM to make the delete markers older than AHM. Moving the AHM makes the delete markers eligible for removal.

=> SELECT MAKE_AHM_NOW();

Purge the partition with the partition key.

=> SELECT PURGE_PARTITION('store.store_orders_fact',200511);

Purging Table Data

If the deleted data is distributed between all the partitions or the table is not partitioned but there are too many deleted rows, purging the entire table is the last option.

Move the AHM to make the delete markers older than the AHM. Moving the AHM makes the delete markers eligible for removal.

=> SELECT MAKE_AHM_NOW();

Purge the table or purge the projection.

=> SELECT PURGE('store.store_orders_fact');
OR
=> SELECT PURGE_PARTITION ('store_orders_fact_b0');

Optimizing Vertica for Delete Operation

You can optimize your database for deletion by performing the following tasks:

Create a better projection design by using a high-cardinality column at the end of the sort order or including the delete predicate columns in all the projections.
Delete the data in batches by inserting the data into temp tables.
Purge the data after 20% percent or more rows are deleted in a particular table.
Use partitioning to chunk data into groups so that you can delete the data together with DROP_PARTITION.
Truncate tables instead of using DELETE statement when you want to empty a table.

For more information, see Optimizing DELETEs and UPDATEs for Performance in the Vertica documentation.