Performance and scalability improvements on UPDATE and DELETE SQL statements

Invantive SQL is used by many companies for exchanging data between (cloud) platforms and data maintenance such as synchronizing back office systems with mid office or front office systems. Release 20.1.393 includes features to improve the scalability of DML operations update and delete in the non-time critical maintenance of high volume environments by restricting their run time using the limit clause.

These improvements hold across all available platforms that support update and/or delete, such as Visma.net Financials, ActiveCampaign, Exact Online or ANSI SQL-based platforms such as SQL Server.

The insert operation has no new features, but thanks to the bulk insert support on platforms such as SQL Server, PostgreSQL and Oracle it already provides great performance. Note that MySQL databases do not offer native bulk insert support.

Limit-clause

The limit clause has been available for years on all select statements, allowing restricting the number of rows retrieved from one or a range of data containers in a query. The limit clause introduces a step in the query’s execution plan, applying a cut in data retrieval after handling the indicated number of rows.

For example, the following query statement will only retrieve a low volume of rows before ending even when there are millions of rows present:

select *
from   devices@sqlserver
limit  1

The query can also be written as top 1 to improve interoperability with queries written originally for platforms such as SQL Server.

Batches of data

The limit can be placed before or after an order by clause. The order by clause requires batching the full result set and then sorting them completely before returning rows to the next step in the query’s execution plan.

Often, the limit statement can be moved from after to before the order by step improving the performance in terms of speed and data processed thanks to the streaming nature of Invantive SQL row retrieval.

Performance limitation on PostgreSQL of limit

Specifically and solely on PostgreSQL, the PostgreSQL ado.net provider npgSql used by Invantive SQL has a strong limitation in streaming data. By default, the npgSql provider needs to complete full retrieval of a result set before being able to continue. Even when just a few rows are retrieved, the npgSql provider will retrieve all rows from PostgreSQL before continuing.

Since Invantive SQL currently does not forward limit or group operations such as max to the data container as described on Query Tool MAX() , Order By ... statements speed question with SQL Server - 2 van forums, this introduces significant overhead on PostgreSQL. This overhead is especially measurable on very large tables in terms of amount of data stored.

Parallel execution

When a query statement is executed in parallel across multiple partitions such as across Exact Online companies (“divisions”), a number of individual partitions in parallel start streaming data into the output stream. Once the requested number of rows has reached the limit clause, the evaluation will end of all streams as in:

--
-- Select for instance 250 companies.
--
use all@eol

--
-- Retrieve the first 5.000 transaction lines
-- across 250 companies.
--
select *
from   TransactionLinesBulk@eol
limit  5000

Both single threaded (single partition) execution as parallel execution uses a look-ahead buffer of typically “one large screen filled with rows” per partition. The amount of rows in this look-ahead buffer varies depending on properties of the data container but a value of 100 rows is an easy to remember approximation.

In the sample given, their might be 16 parallel executions each fetching 1.000 rows given the paging size of TransactionLinesBulk before the 5.000 rows have been handled, leading to 11.000 rows being trashed. Even worse, when some partitions run significantly faster, some of them might fetch 2.000 rows while other might still be waiting for data.

Although it may seem a waste, the algorithm has been optimized over the year to improve the performance of first row retrieval and query execution time with a limited percentage of rows going unused.

Limit with update and delete

In many scenarios, the synchronization of data across systems or maintenance of data is not time critical. Business runs fine as long as data state converges within a reasonable time frame. However, the data volumes to be handled can often fluctuate. The most evident scenario is a Black Friday scenario: an order of magnitude more of circumstantial transaction data arriving in a time slot to be handled.

On both the update and delete a limit clause can be added. The limit restricts the maximum number of rows being updated or deleted. For example, the following statement will update a limited number of rows, even when more rows fall within the restrictions:

update exactonlinerest..tasks@exactonline
set   description = description || ' [DONE]'
where  handled = true
and    description not like '% [DONE]'
limit  100

When the same statement is run over-and-over again in a number of days, all handled tasks will ultimately change their description. However, on a single run no more than 100 tasks will be updated and in general not many more than 100 tasks will be retrieved to determine their primary key for an update. In fact, the internal working of Invantive SQL is:

  • Retrieve the number of rows matching the criteria of the where-clause, limited by the specified maximum number of rows.
  • For each row, determine the primary key value and for update also the new row values.
  • Update or delete the row using an API call.

As a welcome side-effect of running much faster, this approach also makes it easier to stay with limitations on the number of API calls allowed.

Filter Forwarding on update and delete

However, sometimes a limit does not enhance scalability of performance. Invantive SQL can forward many simple filters contained in a where-clause to a delete or update to the data container. For instance, the following simple delete removes all rows matching the where-clause:

delete logging@sqlserver
where date_created < trunc(sysdate) - 7

This especially and mostly always applies to ANSI SQL-based platforms such as SQL Server. Few cloud platforms have an API that allow multiple rows to be deleted or updated with a single API call.

In general, the performance is better of not using limit when a platform can forward simple filters and your update or delete statement uses a combination of simple filters that can be forwarded.

The easiest way to verify the recommended approach is to measure actual system behavior. For many purposes, the contents of the data dictionary view SessionIOs@DataDictionary provides a great starting point for further tuning and optimization. The view enables you to make sure that actual statement execution behaviour matches your expectations. Also you can read more various performance-related topcis on this community.