Thursday, November 06, 2008

Comparing the Equality of two Queries

I often find the need to compare two different queries for equality in Microsoft SQL Server. By equality, I mean that the two queries return the same data. This does not imply any other types of equality such as execution plan, client statistics, etc.

To use the "template" below, just replace the two queries in the [Start/End Source Query] section and [Start/End Comparison Query] section, respectively.

SQL Unit Test

In my contrived example, I'm using the Orders table of the MS example Northwind database. Since the Northwind Orders table doesn't have any OrderIDs less than 10000, the example queries below are equivalent. In essence, this is a SQL unit test to prove that the added where clause "refactoring" didn't cause a change in the resulting data.

Technical Details

This technique uses the SQL Server functions BINARY_CHECKSUM and CHECKSUM_AGG. Using a checksum, you can summarize each row's data as a single number. Then, these row data numbers can be summarized (again via a checksum) into a final query-wide single number.

BINARY_CHECKSUM

Returns the binary checksum value computed over a row of a table or over a list of expressions. Ignores columns of noncomparable data types in its computation. Noncomparable data types include text, ntext, image, cursor, xml, and noncomparable common language runtime (CLR) user-defined types.

CHECKSUM_AGG

Returns the checksum of the values in a group. Aggregates the row values into a single comprehensive checksum

Notes

  • The order of the rows doesn't affect the comparison, however, the order of the columns does.
  • Columns of noncomparable data types are ignored. Noncomparable data types include text, ntext, image, cursor, xml, and noncomparable common language runtime (CLR) user-defined types.
  • CHECKSUM can be used in place of BINARY_CHECKSUM. BINARY_CHECKSUM tends to be more restrictive regarding string comparison issues such as locale or case-sensitivity. Ultimately, how less restrictive CHECKSUM is depends on the database server collation settings, etc.
  • There is a rare chance for hash collision where two rows differ in such a coincidental way that both checksums are the same. If this genuinely occurs, drop what you are doing and go buy a lottery ticket. Now. (Honestly, I don't know if it is as high as lottery-level odds.)

Show Me the Code!

DECLARE @Query1Checksum bigint
DECLARE @Query2Checksum bigint
 
-- Get checksum from source query
Select @Query1Checksum = CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM
(
    -- [Start Source Query]
    SELECT *
    FROM Orders
    -- [End Source Query]
)
AS Source
 
-- Get checksum from comparison query
Select @Query2Checksum = CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM
(
    -- [Start Comparison Query]
    SELECT *
    FROM Orders
    WHERE OrderID >= 10000
    -- [End Comparison Query]
)
As Comparison
 
-- Compare CheckSums to determine equality
IF @Query1Checksum = @Query2Checksum
    PRINT 'Queries are Equal'
ELSE
    PRINT 'Queries are NOT Equal'
If you found this post helpful, please "Kick" it so others can find it too:
kick it on DotNetKicks.com