Combinations of a data sets can be ranked according to criteria establishing a probability of a combination referring to the same real-world object. An Invantive SQL statement is presented to calculate the match probability based upon multiple criteria.
Choosing the Right Data Set on a Real-world Object or Person
A certain group of users of Exact Globe value the ease of online web use of Exact Online above the superior features of Exact Globe. A continuous stream of users flows from Globe to Online and many remain on the Online platform.
Despite being engineered within the same company, most engineering of Exact Online was done years later and alternative choices were made. Therefor, Exact Online has a company-wide list of counter parties, named “Accounts”, whereas it seems that in Globe two sets exist, namely Customers and Suppliers, where the same number may refer to either a customer, a supplier or both.
The default conversion translates these numbers into company-wide unique Exact Online account numbers. However, when loading additional years of financial history from XML Audit Files, it is necessary to translate the customer and supplier number into their respective Exact Online account number.
By matching on multiple match probability aspects, a mapping can be made that even across large data sets correlates the parties across Globe and Exact Online.
The probability is assessed for a party in Exact Online and Globe to be the same based upon a combination of the following criteria:
- Exact Online account name ends with the account name of Globe.
- Exact Online number ends with the number of Globe.
- Exact Online behavior as a customer or supplier matches the group in Globe.
The more of these criteria are matched, the higher the probability of being an identical company.
Probability are multiplied in this scenario, but non-linear approaches or correlated weights can also easily be introduced.
Query to Determine Probability
The following query determines for each combination between the Globe and Exact Online account whether they refer to the same counter party.
A value of match probability is assigned to each criteria. The reverse probabilities (so non-match probability) are all multiplied and then reversed again. A combination scoring on all criteria will therefore return a value closer to 100% than another combination.
create or replace table mapcustsupcandidates@inmemorystorage as select distinct cpy.custsupid , act.code act_code , 1 - ( -- -- Chance of wrong account name. -- ( 1 - case when act.name like '%' || cpy.custsupname then 0.4 else 0 end ) * -- -- Chance of wrong account code. -- ( 1 - case when act.code like '%' || trim(cpy.custsupid) then 0.6 when trim(cpy.custsupid) like '%' || act.code then 0.6 else 0 end ) * -- -- Chance of wrong account properties. -- ( 1 - case when trim(cpy.custsupid) like 'D%' then case when act.status = 'C' then 0.4 else 0 end when trim(cpy.custsupid) like 'C%' then case when act.issupplier then 0.4 else 0 end else 0.2 end ) ) chance_right from CounterParties@xaf join exactonlinerest..accounts@eol act -- -- When a Creditor and Debtors with the same number after -- the initial C or D, they -- are split into two different numbers in Exact Online. -- Exact Online conversion then adds a '10' to one of them -- but maintains the original number in the name of the account. -- on act.name like '%' || cpy.custsupname and case when cpy.custsupid like 'D%' then case when act.status = 'C' then true else false end when cpy.custsupid like 'C%' then act.issupplier else true end = true
List Most Probable Matches
A simple query reports the most likely match:
select custsupid , first(act_code) accountcode from ( select custsupid , act_code , chance_right from mapcustsupcandidates@inmemorystorage order by custsupid , chance_right desc ) group by custsupid
Please note that the ‘first()’ group function with an ‘order by’ could also have been replaced by using the Invantive SQL ‘union distinct’ set operator since there are at most two groups of counter parties.