How to optimize synchronize tables SQL statement to achieve one way watershed incremental logic

I am working on a new watershed incremental loading script in stored procedure that would achieve fast and low traffic consumption, on tables containing 500K (and growing…) records of data.

I have a source_table@container1 that contains a ID (PRIMARY KEY with AUTO_INCREMENT) and write_date columns.

Simply doing a

synchronize
source_table@container1 
into dest_table@container2
with insert
or update
identified by id

to synchronize the table data just works, very nicely and quite fast (approximately 140 seconds for 500K records).

I’ve found out that Invantive synchronize statement:

  • downloads all records from both tables source_table@container1 and dest_table@container2
  • then does a compare of the data,
  • then pushes the necessary insert or update to the dest_table@container2.

In my case, from over 500K records, only very few rows have changed value. Only doing incremental changes would take (if well-coded) a few ms. In my case I have 40+ data tables to synchronize, if possible at small and regular interval.

In my script logic I have done that for now:

I leverage a new table sync_table@container2, that contains script run information, like job_start

Select top 1 job_start from sync_data order by id DESC ->> that I store into a last_job_start variable.

then I get only created or updated records from container1:

create or replace temp@inmemorystorage 
as 
select * 
from   source_table@container1 
where  write_date > last_job_start

This table holds a very small amount of updated or created records since the last sync, enabling incremental loading.

But them I stuck, because using the nice synchronize that enables me (quite nicely) to skip writing a long insert + update logic:

synchronize temp@inmemorystorage 
into        dest_table@container2
with        insert
            or update
identified 
by          id

The synchronize still loads the full data from the container2… so I saved half of the script execution time.

Is there a turnaround (or new lnvantive logics to come synchronize?) to overcome the full load of the destination container?

When correctly understood, both tables have similar size in terms of table row count, like 500.000 rows of data. The approach depicted above to first create a smaller subset to load and merge, indeed saves a large amount of data retrieval on one table side of the equation, typically halving execution time for the user.

When I understand correctly, the contents of temp@InMemoryStorage contains a new and somewhat overlapping data set with the other side, like the following tables with data at synchronization start:

Pre:
[abcdefghijklmnop--------]
[-----------------------pqrstuvwxyz]

which should after applying the action on the first table change the first table into:

Post:
[abcdefghijklmnopqrstuvwxyz]

The non-optimized approach (so without optimization using the incremental technique) would have been:

Pre:
[abcdefghijklmnop--------]
[-bcdefghijklmnopqrstuvwxyz]

Post:
[abcdefghijklmnopqrstuvwxyz]

Is the first pre/post the intended incremental behaivour?

If so, then it is an incremental data loading scenario we have seen on various occasions, since most replication scenarios typically involve unidirectional synchronization of tables together with solely insert and/or update of values.

Also, the platform storing the target table often has an efficient implementation of server-side filtering, like allowing a where clause for a query on the target table. All ANSI SQL-based platforms offer server-side filtering, but also typical OData platforms support it.

A possible optimization that is being considered for inclusion in Invantive SQL is for this frequently occurring edge case of table synchronization. The internal logic of table synchronization would be extended to something matching the following pseudo-code:

if unidirectional synchronization from small source table to large target table
then
  if solely insert or update on target table
  then
    if platform of target table supports server-side filtering on column values
    then
      search small source table data for columns with a low variance on value
      (such as few values on enumerable or all within a small range used on non-enumerables)
      select a limited number (like 5) of candidates from these to add as a server-side filter to target table upon retrieval
      apply traditional logic between small source table on filtered target table
    end if;
  end if;
end;

This should reduce runtime for the synchronization edge case from scaling linear in number of rows stored in the two tables to scaling in number of changed/new rows between the tables.

This optimization would require no changes to existing SQL code.

What are your thoughts on this?

Then I am around this step :

select a limited number (like 5) of candidates from these to add as a server-side filter to target table upon retrieval

I’ll need help for that one :

apply traditional logic between small source table on filtered target table

But I have an appointment next week with my preferred Invantive consultant. I’ll share results right after. Stay tuned.

Thanks for your sharing. The intention is to fully integrate the whole algorithm into synchronize. To reduce chances of an unnecessary consult, I will consult development regarding this improvement first.

To include:

  • Retrieve all from source
  • Calculate n
  • Retrieve 2n streaming from target
  • Decide on approach
  • Determine selectiveness by comparing source and target samples.

We have a similar scenario and see the same problem as the OP, described in this thread (in Dutch).

In our case, the target table contains 100M rows (55GB of on disk data). We also only synchronize insert or update. We’re first dumping OData data (from Twinfield) in a staging table, so deletes cannot be detected anyway. That’s a common limitation of batch ETL processing (streaming CDC processing however… but that’s an entirely different chapter).

The full table scan on the target table cannot help us with that, would take a very significant amount of time and would require significantly more resources (probably more than we have available) on the machine running the Invantive Query Tool.

I am now working on a staging script that does incremental work by finding newly added values in the source_table and insert in the dest_table;
then also makes an update on some values that need to be updated.

By doing a Bulk Insert I can achieve high speed insertions, and loop statements are making the necessary updates.

it seems to work pretty well.

Of course my business case is simple here (no deletion made on source_table to be updated in the dest_table etc… ) but that gives an approach in such applicable cases. Of course this script is tailored to the table structure and is not a “generic” script like what synchronize does…

I’ll share my full script when it goes out from staging status…

1 like