How to check data regression with Delta Lake and SQL?

You just changed the value of one column in your processing logic and want to check whether there wasn't other impact. You can do that by launching the query like this:

SELECT x.id, COUNT(*) FROM (
  SELECT STRUCT(col1, col2, col3, col4) AS id  FROM waitingforcode.dev.data VERSION AS OF 1
  UNION ALL
  SELECT STRUCT(col1, col2, col3, col4) AS id FROM waitingforcode.dev.data
) AS x
GROUP BY x.id
HAVING COUNT(*) < 2;

The query returns all rows without the match between the previous table version (1 in our example) and the new table version that was created after the changes in our processing logic. The query comments the impacted column which is col1. If the result is empty, you can now issue a sligthly modified query:

SELECT x.id, COUNT(*) FROM (
  SELECT STRUCT(col1, col2, col3, col4) AS id  FROM waitingforcode.dev.data VERSION AS OF 1
  UNION ALL
  SELECT STRUCT(col1, col2, col3, col4) AS id FROM waitingforcode.dev.data
) AS x
GROUP BY x.id
HAVING COUNT(*) < 2;

This time you should be expecting the results to contain some rows, depending on the expected impact on the col1.

Please notice that the query can also return groups greater than 2. It might happen if the columns combination won't be unique in the table.