sql Records: Types of User-Defined Functions And Performance Considerations

Microsoft SQL Server supports user-defined functions (UDFs) which provide for code encapsulation and reuse, but come with design and very significant performance considerations. Application developers new to T-SQL like UDFs because they align to a procedural programming design pattern familiar to that skill set. However unless understood and used properly, UDFs will greatly reduce database performance.

In this post I will discuss the following:

1. Types of UDFs and examples

2. Performance comparisons and explanations

1. Types of UDFs and examples

UDFs in SQL Server can return Tables or Scalar values. Additionally UDFs can be Multi-Statement or Single-Statement (labeled In-Line). So logically there are four types – although Scalar In-Line UDFs currently are not supported.

Multi-Statement In-Line

Scalar-Valued Supported Not Supported

Table-Valued Supported Supported

In the SSMS GUI you can see the UDFs in the Programmability -> Functions section.

For our examples, we will be using a simple City, State database. We will be reviewing In-Line vs. Multi-Statement versions of a function to return the first city (alphabetically) for a given state and letter.

Although Scalar functions typically are not explicitly qualified as Multi-Statement, I have named the function with “Multi” in this post to be clear that in this regard Scalar functions align to Multi-Statement Table-Valued functions. As will be explained below, the difference between Multi-Statement and Single-Statement gets to the essential difference among the various types of UDFs and corresponding performance considerations. Below are code samples for the three UDFs.

CREATE FUNCTION [dbo].[GetFirstCity_Scalar_Multi](

@CDState VARCHAR(2)

,@CityFirstLetter VARCHAR(1)

)

RETURNS VARCHAR(100)

BEGIN

DECLARE @TXCity VARCHAR(100)

--other T-SQL statements allowed

SELECT TOP 1

@TXCity = TXCity

FROM

dbo.GeoData

WHERE

CDState = @CDState

AND TXCity LIKE @CityFirstLetter + '%'

ORDER BY

TXCity

--other T-SQL statements allowed

RETURN @TXCity

END

CREATE FUNCTION [dbo].[GetFirstCity_Table_Multi](

@CDState VARCHAR(2)

,@CityFirstLetter VARCHAR(1)

)

RETURNS @City TABLE

(

TXCity VARCHAR(100)

)

BEGIN
--other T-SQL statements allowed

INSERT

INTO

@City

SELECT TOP 1

TXCity

FROM

dbo.GeoData

WHERE

CDState = @CDState

AND TXCity LIKE @CityFirstLetter + '%'

ORDER BY

TXCity

--other T-SQL statements allowed

RETURN

END

CREATE FUNCTION [dbo].[GetFirstCity_Table_InLine](

@CDState VARCHAR(2)

,@CityFirstLetter VARCHAR(1)

) RETURNS TABLE

RETURN
--only one SELECT allowed

SELECT TOP 1

TXCity

FROM

dbo.GeoData

WHERE

CDState = @CDState

AND TXCity LIKE @CityFirstLetter + '%'

ORDER BY

TXCity

Note that the essential work of each UDF – the data access SQL – is identical. What distinguishes each is “only” the interface and implementation wrapping it. But we will see next that this has significant performance implications.

2. Performance comparisons and explanations

Something I always do when comparing various syntax and techniques in T-SQL is to run all options in a single request and compare the Execution Plans and relative costs. Often times this is a good approach but not always. Taking this at face value, the Scalar UDF would be by far the best and the two Table-Valued versions equal.

However taking a closer look, the first thing to notice is that only the Table_InLine version shows any actual data access. The Scalar and Table_Multi show only scans of data - somehow already available. This gets to the essence of the matter. The Scalar and Table_Multi UDFs are black boxes to the caller and query optimizer; the data access in these two UDFs is not transparent to the caller. These behave much like a VB or C# function where the caller simply passes parameters and gets a result – knowing nothing of the implementation of how the result was produced. Likewise, the optimizer does not have visibility into these functions so the cost of these functions is not reported in the Execution Plans.

Looking to the io stats, we see something similar. Only the Table_InLine version shows the real data access.

Table_Multi

Table '#ABDEF001'. Scan count 1, logical reads 1,

Table_InLine

Table 'GEOData'. Scan count 1, logical reads 2

Scalar

(1 row(s) affected)

Additionally running each UDF a single time shows no real difference in wall clock execution time. So I created a performance test which ran each UDF for a distinct list of states, for each for the 26 letters of the alphabet - and iterated this 200 times. Note that I turned on the Discard Results option to better compare just processing time. The results are below.

Table_Multi 0:25

Scalar 0:07

Table_InLine 0:04

This demonstrates and quantifies that the black box nature of the Scalar and Table-Valued Multi UDFs – causing the data access not to show in execution plans – also causes performance issues. To state the obvious, if the optimizer cannot see the SQL, it cannot be optimized for the overall request.

Additionally, use of Multi-Statement UDFs (Table-Valued or Scalar) compounds other inefficiencies. For example, if we remove the IX_StateAndCity index (which will result in table scans), and repeat the test above, the results are as follows.

Table_Multi 31:49 (7,546% Increase)

Scalar 31:09 (26,600% Increase)

Table_InLine 1:04 (1,500% Increase)

In terms of percent increase, the In-Line UDF incurred the least impact and in terms of absolute execution time, it runs in a 30^th of the time of the other UDFs. Multi-Statement UDFs magnify other inefficiencies.

What this all underscores is that coding T-SQL properly is a different skill set from procedural application design patterns. Except in rare and atypical scenarios, set-based T-SQL will outperform an iterative solution. Multi-Statement UDFs cannot be substituted into the query calling them and so force the optimizer to use iteration – calling the UDF for each row. In T-SQL world, this is referred to as RBAR (Row By Agonizing Row) which was coined by Jeff Moden.

By contract InLine UDFs use a form of substitution whereby the UDF code is transparent to the caller and considered as part of the Execution Plan - hence the term In-Line. Below are two queries which (1) call the InLine UDF and (2) explicitly dups the UDF code in the query. The highlighted sections show the different implementations of the CROSS APPLY.

Query #1

SELECT

CDState

,FirstCityForLetterApply.TXCity AS FirstCityForLetter

FROM

StatesCTE

CROSS JOIN Letters

CROSS APPLY [dbo].[GetFirstCity_Table_InLine](CDState,Letter) FirstCityForLetterApply

Query #2

SELECT

CDState

,FirstCityForLetterApply.TXCity AS FirstCityForLetter

FROM

StatesCTE

CROSS JOIN Letters

CROSS APPLY

(

SELECT TOP 1

TXCity

FROM

dbo.GeoData GeoDataApply

WHERE

GeoDataApply.CDState = StatesCTE.CDState

AND GeoDataApply.TXCity LIKE Letters.Letter + '%'

ORDER BY

TXCity

) FirstCityForLetterApply

And we can see that the Execution Plans are identical. So with an In-Line UDF we are able to benefit from code encapsulation and reuse while retaining the performance of a fully optimized query with all parts visible to the optimizer.

Finally, Multi-Statement UDFs impact the ability to parallelize query plans. For more information on this, I refer you to this Dec 2011 post of Paul White: http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

Later

-Regan

sql Records

Wednesday, February 20, 2013

Types of User-Defined Functions And Performance Considerations

No comments:

Post a Comment

About Me