In this post I will discuss the following:
1.
Types
of UDFs and examples
2.
Performance
comparisons and explanations
1. Types
of UDFs and examples
UDFs in
SQL Server can return Tables or Scalar values. Additionally UDFs can be Multi-Statement
or Single-Statement (labeled In-Line). So logically there are four types –
although Scalar In-Line UDFs currently are not supported.
Multi-Statement In-Line
Scalar-Valued Supported Not Supported
Table-Valued Supported Supported
In the
SSMS GUI you can see the UDFs in the Programmability -> Functions section.
Although Scalar functions typically are not explicitly qualified
as Multi-Statement, I have named the function with “Multi” in this post to be
clear that in this regard Scalar functions align to Multi-Statement Table-Valued
functions. As will be explained below, the difference between Multi-Statement
and Single-Statement gets to the essential difference among the various types
of UDFs and corresponding performance considerations. Below are code samples
for the three UDFs.
CREATE FUNCTION [dbo].[GetFirstCity_Scalar_Multi](
@CDState VARCHAR(2)
,@CityFirstLetter VARCHAR(1)
)
RETURNS VARCHAR(100)
AS
BEGIN
DECLARE @TXCity VARCHAR(100)
--other T-SQL statements allowed
SELECT TOP 1
@TXCity = TXCity
FROM
dbo.GeoData
WHERE
CDState = @CDState
AND TXCity LIKE @CityFirstLetter + '%'
ORDER BY
TXCity
--other T-SQL statements allowed
RETURN @TXCity
END
CREATE FUNCTION [dbo].[GetFirstCity_Table_Multi](
@CDState VARCHAR(2)
,@CityFirstLetter VARCHAR(1)
)
RETURNS @City TABLE
(
TXCity VARCHAR(100)
)
AS
BEGIN
--other T-SQL statements allowed
--other T-SQL statements allowed
INSERT
INTO
@City
SELECT TOP 1
TXCity
FROM
dbo.GeoData
WHERE
CDState = @CDState
AND TXCity LIKE @CityFirstLetter + '%'
ORDER BY
TXCity
--other T-SQL statements allowed
--other T-SQL statements allowed
RETURN
END
CREATE FUNCTION [dbo].[GetFirstCity_Table_InLine](
@CDState VARCHAR(2)
,@CityFirstLetter VARCHAR(1)
) RETURNS TABLE
AS
RETURN
--only one SELECT allowed
--only one SELECT allowed
SELECT TOP 1
TXCity
FROM
dbo.GeoData
WHERE
CDState = @CDState
AND TXCity LIKE @CityFirstLetter + '%'
ORDER BY
TXCity
Note that the
essential work of each UDF – the data access SQL – is identical. What
distinguishes each is “only” the interface and implementation wrapping it. But
we will see next that this has significant performance implications.
2.
Performance comparisons and explanations
Something
I always do when comparing various syntax and techniques in T-SQL is to run all
options in a single request and compare the Execution Plans and relative costs.
Often times this is a good approach but not always. Taking this at face value,
the Scalar UDF would be by far the best and the two Table-Valued versions
equal.
However
taking a closer look, the first thing to notice is that only the Table_InLine
version shows any actual data access. The Scalar and Table_Multi show only
scans of data - somehow already available. This gets to the essence of the
matter. The Scalar and Table_Multi UDFs are black boxes to the caller and query
optimizer; the data access in these two UDFs is not transparent to the caller.
These behave much like a VB or C# function where the caller simply passes
parameters and gets a result – knowing nothing of the implementation of how the
result was produced. Likewise, the optimizer does not have visibility into
these functions so the cost of these functions is not reported in the Execution
Plans.
Looking to
the io stats, we see something similar. Only the Table_InLine version shows the
real data access.
Table_Multi
Table '#ABDEF001'. Scan count 1,
logical reads 1,
Table_InLine
Table 'GEOData'. Scan count 1, logical
reads 2
Scalar
(1 row(s) affected)
Additionally
running each UDF a single time shows no real difference in wall clock execution
time. So I created a performance test which ran each UDF for a distinct list of
states, for each for the 26 letters of the alphabet - and iterated this 200 times. Note
that I turned on the Discard Results option to better compare just processing
time. The results are below.
Table_Multi 0:25
Scalar 0:07
Table_InLine 0:04
This
demonstrates and quantifies that the black box nature of the Scalar and Table-Valued
Multi UDFs – causing the data access not to show in execution plans – also
causes performance issues. To state the obvious, if the optimizer cannot see
the SQL, it cannot be optimized for the overall request.
Additionally,
use of Multi-Statement UDFs (Table-Valued or Scalar) compounds other
inefficiencies. For example, if we remove the IX_StateAndCity index (which will
result in table scans), and repeat the test above, the results are as follows.
Table_Multi 31:49 (7,546% Increase)
Scalar 31:09 (26,600% Increase)
Table_InLine 1:04 (1,500% Increase)
In terms
of percent increase, the In-Line UDF incurred the least impact and in terms of
absolute execution time, it runs in a 30th of the time of the other
UDFs. Multi-Statement UDFs
magnify other inefficiencies.
What this
all underscores is that coding T-SQL properly is a different skill set from
procedural application design patterns. Except in rare and atypical scenarios,
set-based T-SQL will outperform an iterative solution. Multi-Statement UDFs cannot be substituted into the query calling them and so force
the optimizer to use iteration – calling the UDF for each row. In T-SQL world,
this is referred to as RBAR (Row By Agonizing Row) which was coined by Jeff Moden.
By
contract InLine UDFs use a form of substitution whereby the UDF code is transparent
to the caller and considered as part of the Execution Plan - hence the term In-Line. Below are two
queries which (1) call the InLine UDF and (2) explicitly dups the UDF code in
the query. The highlighted sections show the different implementations of the
CROSS APPLY.
Query #1
SELECT
CDState
,FirstCityForLetterApply.TXCity AS FirstCityForLetter
FROM
StatesCTE
CROSS JOIN Letters
CROSS APPLY
[dbo].[GetFirstCity_Table_InLine](CDState,Letter) FirstCityForLetterApply
Query #2
SELECT
CDState
,FirstCityForLetterApply.TXCity AS FirstCityForLetter
FROM
StatesCTE
CROSS JOIN Letters
CROSS APPLY
(
SELECT TOP
1
TXCity
FROM
dbo.GeoData GeoDataApply
WHERE
GeoDataApply.CDState = StatesCTE.CDState
AND GeoDataApply.TXCity LIKE Letters.Letter + '%'
ORDER
BY
TXCity
) FirstCityForLetterApply
And we can
see that the Execution Plans are identical. So with an In-Line UDF we are able to benefit from code encapsulation
and reuse while retaining the performance of a fully optimized query with all
parts visible to the optimizer.
Finally,
Multi-Statement UDFs impact the ability to parallelize query plans. For more
information on this, I refer you to this Dec 2011 post of Paul White: http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx
Later
-Regan