Photo by Matthew Henry on Unsplash
Azure SQL Database query ideas - handling duplicates
Pick the right record(s)!
This post is part of the blog series created primarily for database engineers, with the hope that it might help you get a head start on your development tasks. With that being said, let us dive in...
Remove numerous copies of almost duplicate records from a table
The purpose of this blog post is to demonstrate a manual approach to cleanup data for an exceptional situation. Let us say we have a table with certain business columns (e.g. id, date, item, order, amount) and non-business columns (e.g. surrogate key, create_date_time, etc.) and for some reason, the process that persists data in this table has malfunctioned and stored numerous duplicate copies of business records with different surrogate key and create_date_time. We are asked to remove duplicates and keep only a single copy of such records.
Identify and persist the unique records
One way to clean your table is by using a temporary table to store the unique records, remove all duplicate copies from the original table and restore the unique records from the temporary table in the original table.
-- create a temporary table
select top 1 * into temp_table_with_unique_records from original_table_with_duplicate_records where 1=2;
-- insert unique records in temporary table
insert into temp_table_with_unique_records
select
min([surrogate_key_id]),
[id] ,
[date] ,
[order_number] ,
min([create_date_time]) ,
[item] ,
[amount]
from original_table_with_duplicate_records
group by
[id] ,
[date] ,
[order_number] ,
[item] ,
[amount]
In this example, we have used the "group by" clause to group all the duplicates and the min function to select the minimum value out of the non-duplicate values, but practically the logic to be used might be different and more complex. Consider this other variation, when we could insert unique records using a distinct clause, with dummy surrogate_key_id, and a fixed value for create_date_time.
insert into temp_table_with_unique_records
select distinct
'1111-aaaa-1111' as [surrogate_key_id], -- dummy value
[id] ,
[date] ,
[order_number] ,
'2021-01-11 00:00:00.000' as [create_date_time] ,
[item] ,
[amount]
from original_table_with_duplicate_records
Next, we clear the numerous copies of almost duplicate records from the original table. The approach would depend on the practical situation. You could truncate the table, or if the table is partitioned and you need to remove entire partitions, you could truncate partitions. Truncating the table or partitions is faster than the "delete" statement.
Clean up duplicate records
truncate table original_table_with_duplicate_records with (
PARTITIONS(1,2,3)
);
If truncate is not an option, then we would need to issue a delete statement with appropriate filters.
Restore unique records
Next, we restore the unique records in the original table.
insert into original_table_with_duplicate_records
select * from temp_table_with_unique_records;
It helps to note the record counts at each step for verification.
select count (distinct id) from temp_table_with_unique_records;
Note the use of count and distinct in the same query. In the Azure SQL database, this works only for getting a count of distinct values from one column. If you need to count distinct combinations of many columns, here is a different workaround.
-- select count of distinct combinations
select count(*) from
(
select distinct id, date, order_number from temp_table_with_unique_records
) a
Finally, you can remove the temporary table.
drop table if exists temp_table_with_unique_records;
The "if exists" clause helps in avoiding errors (while trying to drop a table that does not exist) in the case of running a series of DML statements and retrying during the development and testing phase.
In the series of posts titled "Azure SQL Database query ideas", I will cover some of the topics that I usually encounter as part of my projects.