Handling NULL values in join condition
Avoid unexpected filtering out of data in the output of your join condition...
This post is part of the blog series created primarily for database engineers, with the hope that it might help you get a head start on your development tasks. With that being said, let us dive in...
When using Azure SQL database, if you are joining on certain columns and those columns may have NULL values, you may get unexpected filtering out of data in the output if NULL values are not handled properly. Let us look at an example.
Consider that you have a table TBL_AGGREGATE that retains the aggregated sales totals per key combination of {day, location, item, delivery_mode (sold at a location or delivered to a customer), delivery_vendor (e.g. Instacart)}. This is the table that retains the output of the sales aggregation process. Consider that for the data related to items sold at the store location, the value of the delivery_vendor field is NULL.
Consider another table called LATE_TRANSACTIONS that receives late transactions from prior days. These late transactions were not delivered timely due to any technical reason (network failure, etc.) and hence were not aggregated during the daily aggregation process. The requirement is to add the sales from these late transactions to the already aggregated data for those prior dates. Hence we need to "refresh" the prior day aggregated data using these new late transactions that did not arrive when we aggregated that data. Here is a way to effectively handle possible NULL values for the delivery_vendor data.
with cte_aggregate as (
select
locationid,
transaction_date,
itemid,
mode,
delivery_vendor,
sum(sold_amount) sold_amount,
from late_transactions
group by
locationid,
transaction_date,
itemid,
mode,
delivery_vendor
)
MERGE tbl_aggregate AS t
USING (
SELECT * FROM cte_aggregate
) AS s
ON t.transaction_date = s.transaction_date
and t.locationid = s.locationid
and t.itemid = s.itemid
and t.mode = s.mode
AND (t.delivery_vendor = s.delivery_vendor OR
(t.delivery_vendor IS NULL AND s.delivery_vendor IS NULL))
WHEN MATCHED THEN
UPDATE SET
t.sold_amount = t.sold_amount + s.sold_amount,
t.update_date_time = ''', @update_date_time, '''
WHEN NOT MATCHED THEN
INSERT (
transaction_date,
create_date_time,
update_date_time,
locationid,
itemid,
sold_amount,
mode,
delivery_vendor
)
VALUES (
s.transaction_date,
''', @create_date_time, ''',
''', @update_date_time, ''',
s.locationid,
s.itemid,
s.sold_amount,
s.mode,
s.delivery_vendor
);
Specifically, focus on the join condition below, which basically states that either the value on both sides is NULL or the value on both sides is equal.
t.delivery_vendor = s.delivery_vendor OR
(t.delivery_vendor IS NULL AND s.delivery_vendor IS NULL)
There are various approaches to handling null values, but this approach seems one of the best ones. There are several articles already published on this rather simple topic, so let me provide some good and helpful references:
This article goes into much detail with simple examples to illustrate the challenges and solutions.
This blog does a good analysis of "in" clause vs "inner join".
In the series of posts titled "Azure SQL Database query ideas", I will cover some of the topics that I usually encounter as part of my projects.