Friday, November 16, 2012

Converting a delimited list to a table using T-SQL

There is a common problem of converting a delimited list of values to a table format to join to other tables within a database. For example, a session database might have a stored procedure that accepts two parameters: (a) a SessionID and (b) a comma-delimited list of attributes to retrieve. Below are 6 techniques to accomplish this within SQL Server and a 7th which essentially pushes the processing to the application tier.
We will work through concrete stored procedure examples using the following table. The stored procedures each accept the same parameters – a single SessionID and a deliminted list of attributes requested for the given SessionID.
CREATE TABLE [dbo].[Sessions](
       [SessionID] [bigint] NOT NULL,
       [AttributeID] [bigint] NOT NULL,
       [Value] [varchar](50) NOT NULL,
 CONSTRAINT [PK_Sessions] PRIMARY KEY CLUSTERED
(
       [SessionID] ASC,
       [AttributeID] ASC
) ON [PRIMARY];
GO
New-comers to T-SQL often try the SQL below - not realizing that the parameter is an atomic value and does not materialize into the SQL statement.
--solution 0 - intuitive, but not supported
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM
       dbo.Sessions
WHERE
       SessionID = @SessionID
AND AttributeID IN (@AttributeIDs);
GO
Solution 1 below works, but is using an iterative parsing approach – adopted from application development but foreign to set-based T-SQL. It is unnecessarily complex and is not recommended.
--solution 1 - iterative split to table variable
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
DECLARE @SplitValues TABLE (SplitVal VARCHAR(100))
DECLARE @pos INT;
DECLARE @tmpval VARCHAR(20);
SET @pos = CHARINDEX(',', @AttributeIDs);
WHILE @pos > 0
BEGIN
       SET @tmpval = LEFT(@AttributeIDs, CHARINDEX(',', @AttributeIDs) - 1)
       INSERT @SplitValues (SplitVal) VALUES (@tmpval)
       SET @AttributeIDs = SUBSTRING(@AttributeIDs, @pos + 1, DATALENGTH(@AttributeIDs))
       SET @pos = CHARINDEX(',', @AttributeIDs)
END
INSERT @SplitValues (SplitVal) VALUES (@AttributeIDs);
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM
       dbo.Sessions
       INNER JOIN @SplitValues AttributeSplit
              ON     AttributeSplit.SplitVal = dbo.Sessions.AttributeID
WHERE
       SessionID = @SessionID;
GO
The second solution uses a recursive Common Table Expression (CTE). Although by some definitions technically set-based, this essentially still is an iterative approach to parse out the values into a derived table. This is interesting to understand but there are better solutions.
--solution 2 - Recursive CTE
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
SET @AttributeIDs += ',';
WITH SplitValsCTE
AS
(
       SELECT
               SUBSTRING(@AttributeIDs,1,CHARINDEX(',', @AttributeIDs)-1)  AS SplitVal
              ,SUBSTRING(@AttributeIDs,CHARINDEX(',', @AttributeIDs)+1, DATALENGTH(@AttributeIDs))  AS RemainingString
       UNION ALL
       SELECT
               SUBSTRING(RemainingString,1,CHARINDEX(',', RemainingString)-1)
              ,SUBSTRING(RemainingString,CHARINDEX(',', RemainingString )+1, DATALENGTH(RemainingString))
       FROM
              SplitValsCTE
       WHERE
              CHARINDEX(',', RemainingString) > 0
       AND DATALENGTH(SplitVal) > 0
)
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM
       dbo.Sessions
       INNER JOIN SplitValsCTE
              ON     SplitValsCTE.SplitVal = dbo.Sessions.AttributeID
WHERE
       SessionID = @SessionID;
GO
The third solution uses a numbers or tally table. This uses a table of integers along with string functions to parse out the values. If the length of delimited list string exceeds 2,047, you would need to implement a different numbers table – either a static base table or a virtual table. See Itzik Ben-Gan for this: http://www.sqlmag.com/article/sql-server/virtual-auxiliary-table-of-numbers
--solution 3 - numbers table
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
SET @AttributeIDs = ',' + @AttributeIDs + ',';
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM  
       [master].[dbo].[spt_values] Nums
       INNER JOIN dbo.Sessions
              ON     SUBSTRING(@AttributeIDs,(Nums.number+1),CHARINDEX(',',@AttributeIDs,(Nums.number+1))-Nums.number-1) = dbo.Sessions.AttributeID
WHERE
       SessionID = @SessionID
AND    type = 'P'
AND    Nums.number BETWEEN 1 AND DATALENGTH(@AttributeIDs)
AND    SUBSTRING(@AttributeIDs,Nums.number,1) = ','
AND    CHARINDEX(',',@AttributeIDs,(Nums.number+1)) > Nums.number;
GO
The fourth solution converts the input string to XML which then allows for using XML functions to parse out the values.
--solution 4 - xml
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
DECLARE @AttributeIDsXML AS XML
SET @AttributeIDs = LEFT(@AttributeIDs, DATALENGTH(@AttributeIDs)-1) ;
SET @AttributeIDsXML = CAST('<root><x>' + REPLACE(@AttributeIDs,',','</x><x>') + '</x></root>' AS XML) ;
SELECT
        dbo.Sessions.SessionID
       ,dbo.Sessions.AttributeID
       ,dbo.Sessions.Value
FROM
       dbo.Sessions
       INNER JOIN
              (
              SELECT
                     root.x.value('.', 'BIGINT') AS AttributeID
              FROM
                     @AttributeIDsXML.nodes('(/root/x)') as root(x)
              ) AttributeIDs
              ON     AttributeIDs.AttributeID = dbo.Sessions.AttributeID
WHERE
       SessionID = @SessionID;
GO
The fifth option is dynamic SQL. This is a heavy-handed way of forcing what originally was tried in the non-workable solution 0. It dynamically builds the SQL statement into a string variable - materializing the list into the statement string; and then EXECUTEs that statement. It works but is not recommended for performance reasons as well as possible SQL injection exposure.
--solution 5 dynamic SQL
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
SET @AttributeIDs = LEFT(@AttributeIDs, DATALENGTH(@AttributeIDs)-1)
SET @AttributeIDs = REPLACE(@AttributeIDs, '''', '''''')
EXECUTE
('
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM
       dbo.Sessions
WHERE
       SessionID = ''' + @SessionID + '''
AND AttributeID IN (' + @AttributeIDs + ')
');
GO
The final T-SQL option performs a counter-intuitive reverse LIKE operation where the logic compares each AttributeID from the base table to see if it is LIKE the list passed in. This works and is efficient but depends on there being a relatively low volume of AttributeID values for each SessionID. This is because it is not SARGable and so within the SessionID values it must scan all AttributeID values. From the execution plan you can see it is scanning a range looking for matches based on the predicate.

  
--solution 6 reverse LIKE
CREATE PROCEDURE uspGetAttributes
        @SessionID BIGINT
       ,@AttributeIDs VARCHAR(1000)
AS
SET @AttributeIDs = ',' + @AttributeIDs + ',';
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM
       dbo.Sessions
WHERE
       SessionID = @SessionID
AND @AttributeIDs LIKE '%,' + CAST(AttributeID AS VARCHAR(20)) + ',%';
GO
If you are at .NET 4.0 for the application tier and SQL Server 2008+ you can take advantage of Table-Valued Parameters (TVP).  For this you create a table type object – essentially just a schema for a table variable and then you can have the application load a table object instead of a delimited list and pass that object to the stored procedure. This is not a T-SQL solution to convert a delimited list to a table, but I wanted to include it because it is an excellent solution to this type of problem. It pushes the conversion to the application tier – or more than likely would never use a delimited list to begin with. It also should be noted that even at SQL Server 2005, it is possible to take a similar approach by passing XML instead of the delimited list and then processing it in like manner to solution 4.
--solution 7 table-valued parameter (TVP) - requires interface change to sproc
CREATE TYPE [dbo].[AttributesType] AS TABLE(
       [AttributeID] [bigint] NOT NULL,
       PRIMARY KEY CLUSTERED
(
       [AttributeID] ASC
)WITH (IGNORE_DUP_KEY = OFF)
);
GO
CREATE PROCEDURE uspGetAttributes
        @SessionID   BIGINT
       ,@AttributeIDsTVP dbo.AttributesType READONLY
AS
SELECT
        dbo.Sessions.SessionID
       ,dbo.Sessions.AttributeID
       ,dbo.Sessions.Value
FROM
       dbo.Sessions
       INNER JOIN @AttributeIDsTVP AttributeIDsTVP
              ON     AttributeIDsTVP.AttributeID = dbo.Sessions.AttributeID 
WHERE
       SessionID = @SessionID;
GO
  
Regarding which solution is best to implement, it depends on the characteristics of your data. If you know that there will always be a relatively low volume of values for the column associated with the delimited list, and there is an index within which these can be scanned – e.g.  low volume of AttributeIDs for each SessionID – then the reverse LIKE works well and avoids some hits to the database by way of index searches. Below is index search data for a 30 minute load test of the various solutions. Reverse LIKE is lowest by a factor of 8 compared to all of the solutions except CTE, and a factor of 34 for CTE. But again, this solution should be considered only in cases where the characteristics of the data are known to be compatible.


 Probably the most widely-effective solution for a variety of data distributions is the numbers table solution – because it uses a separate process to extract the individual values and in that regard does not depend on characteristics of the data to be joined to. It also avoids the unnecessary conversion of the input to XML and the iterative characteristics of recursive CTEs.
A further consideration is code isolation and re-use. The numbers-table solution is well-suited to abstraction to an in-line table UDF.  This is a topic unto itself, but for the present purpose it is sufficient to emphasize that the UDF must be in-line so that it is materialized into the calling object and thereby available to the optimizer. For a UDF to be in-line, it must be comprised of a single query.  Below is an example of abstracting the Delimited-List-To-Table processing to an in-line UDF. You could also implement this with the CTE and XML solutions, but not with the other solutions including reverse LIKE.
  
CREATE FUNCTION  dbo.ParseValues_InLine(
       @AttributeIDs VARCHAR(1000)
) RETURNS TABLE
AS
RETURN
       SELECT
               SUBSTRING(',' + @AttributeIDs + ',',(Nums.number+1),CHARINDEX(',',',' + @AttributeIDs + ',',(Nums.number+1))-Nums.number-1) AS Item
       FROM  
              [master].[dbo].[spt_values] Nums
       WHERE
              Type = 'P'
       AND    Nums.number BETWEEN 1 AND DATALENGTH(',' + @AttributeIDs + ',')
       AND    SUBSTRING(',' + @AttributeIDs + ',',Nums.number,1) = ','
       AND    CHARINDEX(',',',' + @AttributeIDs + ',',(Nums.number+1)) > Nums.number;
GO
--solution 3a - numbers table using in-line UDF
CREATE PROCEDURE uspGetAttributes
        @SessionID          BIGINT
       ,@AttributeIDs       VARCHAR(1000)
AS
SELECT
        SessionID
       ,AttributeID
       ,Value
FROM  
       dbo.Sessions
       INNER JOIN dbo.ParseValues_InLine(@AttributeIDs) AttributeIDsTable
              ON     AttributeIDsTable.Item = dbo.Sessions.AttributeID
WHERE
       SessionID = @SessionID;
GO
Comparing the original numbers table solution to the updated version using an in-line UDF, the execution plans are identical. So we were able to abstract the Delimited-List-To-Table processing to a function avoiding code duplications, and without loss of performance. Naturally, the UDF could be expanded to pass a delimiter to make it more flexible.
Original stored procedure using the numbers table solution
Updated stored procedure using the numbers table solution within an in-line UDF

12/28/2012 Update:
I just read a great article by Jeff Moden which explains further tuning opportunities of the Tally Table solution avoiding scale issues resulting from the concationation of delimiters.
http://www.sqlservercentral.com/articles/Tally+Table/72993/

Based heavily on Moden's article this is a good encapsulated function for this need. It uses a materialized numbers table. 

CREATE FUNCTION  [dbo].[fnListToTable8K](
        @List               VARCHAR(8000)
       ,@Delimiter          CHAR(1)
) RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
      --this UNIONs a "1" (first element) with n+1 values (subsequent elements) relative to delimiter positions
       --eliminates need to prepend a delimiter
       WITH ElementStartCTE
       AS
       (
       SELECT 1 AS Start
       UNION ALL
       SELECT
              n+1
       FROM
              dbo.Numbers WITH (NOLOCK)
       WHERE
              n <= DATALENGTH(@List)
       AND    SUBSTRING(@List,n,1) = @Delimiter
       )

       --ISNULL\NULLIF logic handles last element eliminating need to append delimiter
       ,ElementStartAndLengthCTE
       AS
       (
       SELECT
              Start
              ,ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,Start),0)-Start,DATALENGTH(@List)) AS Length
       FROM
              ElementStartCTE
       )

       SELECT
              SUBSTRING(@List,Start,Length) AS Item
       FROM
              ElementStartAndLengthCTE;
Later
-Regan