Thursday, February 2, 2017

NULLs and the T-SQL EXCEPT Clause

NULL is the absence of a value. This means it is an unknown value. But it is not an unknown value like “x” is an unknown value in algebra. This is because “x” is an actual value and so x = x. NULL is not only unknown; it is the absence of a value. Consequently, NULL does not equal NULL; and any calculation involving NULL is equal to NULL.

So…
2 + 2 + x is equal to 4 + x

But…
2 + 2 + NULL is not equal to 4 + NULL

And T-SQL functions align to this.
2 + 2 = 4
2 + 2 != 5
And anything with a NULL in it is Not Equal. 


All is well in the world so far.


Enter EXCEPT which outputs distinct rows from a 1st dataset unmatched by any row in a 2nd dataset. A simple example is below. 2 is in the top dataset but not in the bottom set and so it is returned as a result of the EXCEPT clause. Even though the top dataset has two 2s only one 2 is returned because the EXCEPT clause inherently produces a distinct result.


We will replace the 2s in the top dataset with NULLs. Notice that although T-SQL functions treat NULLs as not equal to each other, EXCEPT returns a single NULL. Both NULLs are taken to be "equal". 



This conforms to the general functionality of DISTINCT.



Finally, moving one of the NULLs to the bottom set, the result of the EXCEPT is empty – thereby treating NULL in the top set (which could be 2+2) as matching NULL in the bottom set (which could be 5).


The MSDN article goes so far as to describe the two NULL values as “equal” although the SQL standard uses the language “IS NOT DISTINCT FROM”.


So although T-SQL functions treat NULLs as not equal to each other, the T-SQL EXCEPT clause treats NULLs as equal. In the case below, each SELECT separately conforms to the logic that anything plus a missing value nets a missing value, but then treats those resulting missing values as matching. The result of the EXCEPT clause is empty.



Might 2 + 2 actually equal 5?

Happy trails....


Monday, August 1, 2016

SELECT Clause Aliases Affect ORDER BY clause

If a SELECT clause transforms a column, and then aliases the result as the same name of the original column in the base table, the ORDER BY clause is affected and cannot make use of the sorted data in the base table. The SELECT clause alias affects the ORDER BY clause.

For an example, I am using a simple two-column table with an auto-increment Clustered PK and string column.

  



CASTing the PK as a VARCHAR, aliasing it as a different name (n1) with ORDER BY the base table PK column performs an Index Seek and leverages the base table sort.



But, CASTing the PK as a VARCHAR, aliasing it as the base table column name again with ORDER BY the base table PK column performs an Index Seek but must sort the data – because the data actually is sorted as a VARCHAR

 


The column 'n' has its meaning changed between execution of the WHERE clause and execution of the ORDER BY clause. The transformed data is what is being ORDER'd. 



To return transformed data - named the same as the base table column and ordered as it is in the base table PK (which was a weird requirement I recently had), you need to fully qualify the ORDER BY column reference.







Saturday, December 27, 2014

Database Tables Without a Primary Key or Cluster Key

Here is a simple script to report all tables in a database with indicators for those without PKs or CKs.


WITH DataCTE
AS
(
SELECT
        SCHEMA_NAME(schema_id)           AS SchemaName
       ,name                                    AS TableName
       ,'PK'                                    AS KeyType
       ,CASE
              WHEN OBJECTPROPERTY(OBJECT_ID,'TableHasPrimaryKey') = 0 THEN 'X'
              ELSE ''
       END AS MissingIndicator
FROM
       sys.tables
UNION ALL
SELECT
        SCHEMA_NAME(schema_id)           AS SchemaName
       ,name                                    AS TableName
       ,'CK'                                    AS KeyType
       ,CASE
              WHEN OBJECTPROPERTY(OBJECT_ID,'TableHasClustIndex') = 0 THEN 'X'
              ELSE ''
       END AS MissingIndicator
FROM
       sys.tables
)

SELECT
       SchemaName
       ,TableName
       ,ISNULL([CK],'') AS MissingCK
       ,ISNULL([PK],'') AS MissingPK
FROM
       DataCTE
PIVOT
       (
       MIN(MissingIndicator)
       FOR KeyType IN
              (
              [CK],[PK]
              )
       ) pvtData
ORDER BY
       SchemaName
       ,TableName


Here is a sample execution of it against the AdventureWorks 2014 database. The ProductProductPhoto table has no CK in the database as provided at codeplex. I removed the PK on ProductSubcategory (which also is the CK) to test that bit of the script.





Wednesday, October 16, 2013

Returning a Default Row When No Data Would Be Returned


Recently, I had to tune a stored procedure that had an odd behavior. If no data was found matching the parameters passed in, it still returned a row with empty string values. The way the sproc was doing this was to first check for the existence of data for the parameters passed in. If data was there it issues the SELECT statement, otherwise a SELECT of empty string values.

IF NOT EXISTS (SELECT * FROM dbo.TestData WHERE Col1 = @Col1 AND Col2 = @Col2)
       SELECT
               '' AS Col1
              ,'' AS Col2
              ,'' AS Col3
              ,'' AS Col4
ELSE
       SELECT
               Col1
              ,Col2
              ,Col3
              ,Col4
       FROM
              dbo.TestData
       WHERE
              Col1 = @Col1
       AND Col2 = @Col2

Naturally, this sort of logic should not be in the database tier. But the contract with the application was established, and the application call to the database was not sufficiently encapsulated – and so I had to work with this and change only the implementation of the sproc while preserving this contract.

So the obvious problem with the solution above is that when there is a match, there are two requests for the data as the execution plan below shows.



So the challenge was to preserve the default row contract with the application while eliminating the extra requests to check for the existence of the data. I will present three solutions, the third being the best I know of. But naturally, let me know if you have a better solution.

 (1)  UNION ALL the default row.

SELECT
        Col1
       ,Col2
       ,Col3
       ,Col4
FROM
       dbo.TestData
WHERE
       Col1 = @Col1
AND Col2 = @Col2
UNION ALL
SELECT
        '' AS Col1
       ,'' AS Col2
       ,'' AS Col3
       ,'' AS Col4

The problem here is that the default row is returned even when the data exists.


So for this solution, additional logic is needed to return either the records matching the parameters or the default row, but not both – something like this.

 WITH DataCTE
AS
(
SELECT
        Col1
       ,Col2
       ,Col3
       ,Col4
       ,0 AS DefaultRowInd
FROM
       dbo.TestData
WHERE
       Col1 = @Col1
AND Col2 = @Col2
UNION ALL
SELECT
        '' AS Col1
       ,'' AS Col2
       ,'' AS Col3
       ,'' AS Col4
       ,1 AS DefaultRowInd
)

SELECT
        Col1
       ,Col2
       ,Col3
       ,Col4
FROM
       DataCTE
WHERE
       DefaultRowInd = (SELECT MIN(DefaultRowInd) FROM DataCTE)

But the CTE does not shield us from two requests for the data. The WHERE clause at the end referencing DataCTE ultimately is a separate request for the data against the base table, and so - in this regard - no different from the original sproc implementation.



(2)  Table variable solution. This solution reads the data once into a table variable and then checks there for the existence of the data.

DECLARE @Data TABLE (Col1 VARCHAR(50),Col2 VARCHAR(50),Col3 VARCHAR(50),Col4 VARCHAR(50))
INSERT
INTO
       @Data
(
        Col1
       ,Col2
       ,Col3
       ,Col4
)
SELECT
        Col1
       ,Col2
       ,Col3
       ,Col4
FROM
       dbo.TestData
WHERE
       Col1 = @Col1
AND Col2 = @Col2

IF (SELECT COUNT(*) FROM @Data) = 0
       SELECT
               '' AS Col1
              ,'' AS Col2
              ,'' AS Col3
              ,'' AS Col4
ELSE
       SELECT
               Col1
              ,Col2
              ,Col3
              ,Col4
       FROM
              @Data

Although in a heavy-handed way this avoids a separate existence-check hit to the base table, the sproc still has to do the same sort of processing against the table variable and actually it is more expensive. The execution plan comparison is roughly 25% (original) vs. 75% (table variable solution).

(3)  SELECT NULL and LEFT OUTER JOIN. This is the best solution I was able to work out.

SELECT
        ISNULL(Col1,'') AS Col1
       ,ISNULL(Col2,'') AS Col2
       ,ISNULL(Col3,'') AS Col3
       ,ISNULL(Col4,'') AS Col4
FROM
       (SELECT NULL AS DefaultVal) DefaultRow
       LEFT OUTER JOIN dbo.TestData
              ON  Col1 = @Col1
              AND Col2 = @Col2

This solution guarantees a record is returned with the SELECT NULL and will display data from the base table only if it exists – otherwise the ISNULL functions returns empty strings. NOTE: In order for this to work, it is important to move the WHERE clause criteria to the LEFT OUTER JOIN criteria – so it is applied only to the base table, and not to the entire query. If it is not moved from the WHERE clause, the T-SQL will not return the default row.

Below are the result sets for both conditions confirming the required results.


Comparing the execution plans, we see the tuned version has a execution plan cost equal to half of the original sproc – which makes sense because we have eliminated one database hit without introducing costly processing.



Hope this is of some benefit.

Later

 -Regan

Wednesday, February 20, 2013

Types of User-Defined Functions And Performance Considerations

Microsoft SQL Server supports user-defined functions (UDFs) which provide for code encapsulation and reuse, but come with design and very significant performance considerations. Application developers new to T-SQL like UDFs because they align to a procedural programming design pattern familiar to that skill set. However unless understood and used properly, UDFs will greatly reduce database performance.

 In this post I will discuss the following:

1.    Types of UDFs and examples
2.    Performance comparisons and explanations

1. Types of UDFs and examples

UDFs in SQL Server can return Tables or Scalar values. Additionally UDFs can be Multi-Statement or Single-Statement (labeled In-Line). So logically there are four types – although Scalar In-Line UDFs currently are not supported.

                                    Multi-Statement        In-Line
Scalar-Valued            Supported                   Not Supported
Table-Valued             Supported                   Supported

In the SSMS GUI you can see the UDFs in the Programmability -> Functions section.









For our examples, we will be using a simple City, State database. We will be reviewing In-Line vs. Multi-Statement versions of a function to return the first city (alphabetically) for a given state and letter.

Although Scalar functions typically are not explicitly qualified as Multi-Statement, I have named the function with “Multi” in this post to be clear that in this regard Scalar functions align to Multi-Statement Table-Valued functions. As will be explained below, the difference between Multi-Statement and Single-Statement gets to the essential difference among the various types of UDFs and corresponding performance considerations. Below are code samples for the three UDFs.


CREATE FUNCTION [dbo].[GetFirstCity_Scalar_Multi](
        @CDState            VARCHAR(2)
       ,@CityFirstLetter    VARCHAR(1)
)

RETURNS VARCHAR(100)
AS

BEGIN
       DECLARE @TXCity VARCHAR(100)

            --other T-SQL statements allowed

       SELECT TOP 1 
              @TXCity = TXCity
       FROM  
              dbo.GeoData
       WHERE
              CDState = @CDState
       AND    TXCity LIKE @CityFirstLetter + '%'
       ORDER BY
              TXCity

            --other T-SQL statements allowed

       RETURN @TXCity
END

CREATE FUNCTION [dbo].[GetFirstCity_Table_Multi](
        @CDState            VARCHAR(2)
       ,@CityFirstLetter    VARCHAR(1)
)

RETURNS @City TABLE
    (
        TXCity       VARCHAR(100)
    )
AS

BEGIN
   --other T-SQL statements allowed

    INSERT
       INTO
              @City
                     SELECT TOP 1 
                            TXCity
                     FROM  
                           dbo.GeoData
                     WHERE
                           CDState = @CDState
                     AND    TXCity LIKE @CityFirstLetter + '%'
                     ORDER BY
                            TXCity

   --other T-SQL statements allowed
    RETURN
END

CREATE FUNCTION  [dbo].[GetFirstCity_Table_InLine](
        @CDState            VARCHAR(2)
       ,@CityFirstLetter    VARCHAR(1)
) RETURNS TABLE

AS
RETURN
      --only one SELECT allowed
       SELECT TOP 1 
               TXCity
       FROM  
              dbo.GeoData
       WHERE
              CDState = @CDState
       AND    TXCity LIKE @CityFirstLetter + '%'
       ORDER BY
               TXCity

Note that the essential work of each UDF – the data access SQL – is identical. What distinguishes each is “only” the interface and implementation wrapping it. But we will see next that this has significant performance implications.

2. Performance comparisons and explanations

Something I always do when comparing various syntax and techniques in T-SQL is to run all options in a single request and compare the Execution Plans and relative costs. Often times this is a good approach but not always. Taking this at face value, the Scalar UDF would be by far the best and the two Table-Valued versions equal.

However taking a closer look, the first thing to notice is that only the Table_InLine version shows any actual data access. The Scalar and Table_Multi show only scans of data - somehow already available. This gets to the essence of the matter. The Scalar and Table_Multi UDFs are black boxes to the caller and query optimizer; the data access in these two UDFs is not transparent to the caller. These behave much like a VB or C# function where the caller simply passes parameters and gets a result – knowing nothing of the implementation of how the result was produced. Likewise, the optimizer does not have visibility into these functions so the cost of these functions is not reported in the Execution Plans.

Looking to the io stats, we see something similar. Only the Table_InLine version shows the real data access.

Table_Multi
Table '#ABDEF001'. Scan count 1, logical reads 1,

Table_InLine
Table 'GEOData'. Scan count 1, logical reads 2

Scalar
(1 row(s) affected)


Additionally running each UDF a single time shows no real difference in wall clock execution time. So I created a performance test which ran each UDF for a distinct list of states, for each for the 26 letters of the alphabet - and iterated this 200 times. Note that I turned on the Discard Results option to better compare just processing time. The results are below.

Table_Multi     0:25
Scalar              0:07
Table_InLine   0:04

This demonstrates and quantifies that the black box nature of the Scalar and Table-Valued Multi UDFs – causing the data access not to show in execution plans – also causes performance issues. To state the obvious, if the optimizer cannot see the SQL, it cannot be optimized for the overall request.

Additionally, use of Multi-Statement UDFs (Table-Valued or Scalar) compounds other inefficiencies. For example, if we remove the IX_StateAndCity index (which will result in table scans), and repeat the test above, the results are as follows.

Table_Multi     31:49 (7,546% Increase)
Scalar              31:09 (26,600% Increase)
Table_InLine   1:04 (1,500% Increase)

In terms of percent increase, the In-Line UDF incurred the least impact and in terms of absolute execution time, it runs in a 30th of the time of the other UDFs.  Multi-Statement UDFs magnify other inefficiencies.

What this all underscores is that coding T-SQL properly is a different skill set from procedural application design patterns. Except in rare and atypical scenarios, set-based T-SQL will outperform an iterative solution. Multi-Statement UDFs cannot be substituted into the query calling them and so force the optimizer to use iteration – calling the UDF for each row. In T-SQL world, this is referred to as RBAR (Row By Agonizing Row)  which was coined by Jeff Moden.

By contract InLine UDFs use a form of substitution whereby the UDF code is transparent to the caller and considered as part of the Execution Plan - hence the term In-Line. Below are two queries which (1) call the InLine UDF and (2) explicitly dups the UDF code in the query. The highlighted sections show the different implementations of the CROSS APPLY.

Query #1
SELECT
        CDState
       ,FirstCityForLetterApply.TXCity AS FirstCityForLetter
FROM
       StatesCTE
       CROSS JOIN Letters
       CROSS APPLY [dbo].[GetFirstCity_Table_InLine](CDState,Letter) FirstCityForLetterApply

Query #2
SELECT
        CDState
       ,FirstCityForLetterApply.TXCity AS FirstCityForLetter
FROM
       StatesCTE
       CROSS JOIN Letters
       CROSS APPLY
              (
              SELECT TOP 1 
                      TXCity
              FROM  
                     dbo.GeoData GeoDataApply
              WHERE
                     GeoDataApply.CDState = StatesCTE.CDState
              AND    GeoDataApply.TXCity LIKE Letters.Letter  + '%'
              ORDER BY
                      TXCity
              ) FirstCityForLetterApply

And we can see that the Execution Plans are identical. So with an In-Line UDF we  are able to benefit from code encapsulation and reuse while retaining the performance of a fully optimized query with all parts visible to the optimizer.









Finally, Multi-Statement UDFs impact the ability to parallelize query plans. For more information on this, I refer you to this Dec 2011 post of Paul White: http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

Later

-Regan