I am working on a new watershed incremental loading script in stored procedure that would achieve fast and low traffic consumption, on tables containing 500K (and growing…) records of data.
I have a source_table@container1
that contains a ID (PRIMARY KEY
with AUTO_INCREMENT
) and write_date columns.
Simply doing a
synchronize
source_table@container1
into dest_table@container2
with insert
or update
identified by id
to synchronize the table data just works, very nicely and quite fast (approximately 140 seconds for 500K records).
I’ve found out that Invantive synchronize
statement:
- downloads all records from both tables
source_table@container1
anddest_table@container2
- then does a compare of the data,
- then pushes the necessary insert or update to the
dest_table@container2
.
In my case, from over 500K records, only very few rows have changed value. Only doing incremental changes would take (if well-coded) a few ms. In my case I have 40+ data tables to synchronize, if possible at small and regular interval.
In my script logic I have done that for now:
I leverage a new table sync_table@container2
, that contains script run information, like job_start
Select top 1 job_start from sync_data order by id DESC ->> that I store into a last_job_start variable.
then I get only created or updated records from container1:
create or replace temp@inmemorystorage
as
select *
from source_table@container1
where write_date > last_job_start
This table holds a very small amount of updated or created records since the last sync, enabling incremental loading.
But them I stuck, because using the nice synchronize
that enables me (quite nicely) to skip writing a long insert + update
logic:
synchronize temp@inmemorystorage
into dest_table@container2
with insert
or update
identified
by id
The synchronize
still loads the full data from the container2… so I saved half of the script execution time.
Is there a turnaround (or new lnvantive logics to come synchronize
?) to overcome the full load of the destination container?