Tuesday, 5 March 2013

Don't fear the Bulk Copy

While searching for a way to optimise parts of a particular system today, I had managed to get a write to SQL Server (2005, non-local) down to about 0.5 seconds for 121 rows. Not great, but I was prepared to believe that it was working hard as it is a fat table with a dodgy looking schema, with the less said the better about the network in-between.

My boss pointed me at this page: http://msdn.microsoft.com/en-us/library/1y8tb169.aspx which has details on the SqlBulkCopy object in the System.Data library.

To be honest, I was a bit cagey about this. My belief has always been that there is a lower limit of rows before the pros of using a bulk copy operation outweigh the cons. Perhaps that is true, but certainly for a mere 121 rows, it is not an issue at all.

Previously, I had implemented this insert operation using a variety of methods:

  • Injection via an XML parameter, processed by use of @xml.nodes style queries. 
  • Dynamically generated SQL, creating a large INSERT statement with multiple rows using SELECT and UNION ALL. 
  • Individual parameterised INSERT statements. 

Of these, the XML method had the worst performance; SQL Server may have tools to navigate and process XML, but they are definitely not quick and this was to be expected. This process is *marginally* quicker than doing individual insert statements, but only after the number of rows has increased past, say, 50.

The dynamically generated query looked a bit like this:

INSERT INTO [dbo].[tblData] ( [UserId], [CreateDate], [Value] )
SELECT 1001, '2013-03-04 00:14:30.00', 25
UNION ALL SELECT 1023, '2013-03-04 00:14:30.15', 67
UNION ALL SELECT 1038, '2013-03-04 00:14:30.32', 21
Examining the execution plan for this yielded that it spent most of its time doing a clustered index insert; about 98% of its time to be exact. Although performance for this increased between runs (eventually down to 21 milliseconds for the writing of 121 rows), there was still room for improvement. Time for 1000 users in a Parallel.ForEach loop of the entire operation (which included this write) was 00:02:30

It was hoped that the use of parameterised INSERT statements would allow SQL Server to cache an execution plan and use it. In tests, it does perform faster. Time for 1000 users, as above, was 00:01:45

So... on to an implementation using SqlBulkCopy.

If you are throwing arbitrary data at it then this entails a little bit of set up. The WriteToServer method accepts a DataTable object and this must be tailored exactly to the schema for the bulk copy to work.

(Please note that this is an extremely cut down / Noddy version of the table for brevity.)
public override void Put(int userID, List; myData)
    DataTable dt = new DataTable();
    dt.Columns.Add(new DataColumn("DataId", typeof (System.Int32)));
    dt.Columns.Add(new DataColumn("UserId", typeof(System.Int32)));
    dt.Columns.Add(new DataColumn("CreateDate",typeof(System.DateTime)));
    dt.Columns.Add(new DataColumn("Value",typeof(System.Int32)));
After this, you add the data, row by row into the table. If you have a nullable field, then test for HasValue and use either the Value or enter DBNull.Value for the assignment.
foreach (MyData row in myData) {
    DataRow dr = dt.NewRow();
    dr["DataId"] = DBNull.Value;
    dr["UserId"] = row.UserId;
    dr["CreateDate"] = row.CreateDate;
    dr["Value"] = row.Value;
Now we set up the SqlBulkCopy object. I've added two option flags for it - KeepNulls and KeepIdentity. KeepNulls so it will honour the DBNull.Value encountered on some fields, and KeepIdentity so that it leaves the destination table in control of the assignment of row identity. I have included the row ID in the DataTable's columns, set it to DBNull.Value in the rows themselves, but I shall now make sure that it is removed from the column mappings by clearing them and re-adding the columns I require.

The KeepIdentity flag does NOT do that. This code only works because I do not include the identity column in the ColumnMappings collection.

using (
    SqlBulkCopy bulkCopy = new SqlBulkCopy(
        SqlBulkCopyOptions.KeepNulls | SqlBulkCopyOptions.KeepIdentity))

        bulkCopy.DestinationTableName = "dbo.tblData";
        bulkCopy.ColumnMappings.Add("UserId", "UserId");
        bulkCopy.ColumnMappings.Add("CreateDate", "CreateDate");
        bulkCopy.ColumnMappings.Add("Value", "Value");
Then I can perform the write.
        catch (Exception)
The performance of the write, at 121 rows, was 0.025 seconds, instead of 0.5 seconds. Time for 1000 users, in the parallel test mentioned above, was 00:00:45.

This technique is *lightning* fast and totally worth doing for much smaller numbers of rows than I originally thought.

Don't fear the Bulk Copy.

It should be noted that the default behaviour of the WriteToServer command is that the following apply:
  • Table constraints will not be enforced
  • Insert triggers will not fire
  • The operation will use Row locks
This behaviour can be tailored using the SqlBulkCopyOptions enumeration, as detailed here: http://msdn.microsoft.com/en-gb/library/system.data.sqlclient.sqlbulkcopyoptions.aspx