Time synchronization: a need for real-time data warehouses

Understanding of the entire business process of a data warehouse

Data warehouses collect data of all parts of the organization. In this way they give an understanding of the entire business process and not just about those segments that happen to be registered by a software package. Traditional data warehouses are not immediately supplied with new data. Often new data appears whenever the positions are checked on a monthly or daily basis, such as in ‘Turnover October 2012’. This allows you to accurately answer a number of predetermined questions over a longer period.

If it is not just the positions that are registered, but also the individual transactions, then questions can be answered that do not always have to be defined in advance. If transactions are registered, then it is also useful to check whether it is an option to switch from star schemas and similar schemas to a more relational storage model, like data vault. As an aside: data vault is a modeling technique to be used primarily for data warehouses with a normalized model as basis. In this respect data vault and Invantive Producer share the same concepts.

Real-time: new findings arrive with every tick of the clock

Some business processes have a relatively short life cycle. An example would be the trading in derivatives or the online selling of books. If you desire updated data from the data warehouse in such a process, then a daily or monthly update of the data warehouse no longer suffices.

In that case you would like updated data from your data warehouse. I personally think this is a great moment. The need to have updated data in your data warehouse usually means that the data warehouse is seen as a factor with a lot of added value.

Flawless real-time and flying trackballs

It is nearly always the case that data is not ‘flawless’ real-time, but instead the data warehouse has a short delay compared to the actual situation in the company. Depending on processing speed, technical capabilities and desired delay this can be very low (few seconds) or quite a bit longer (one day).

Sometimes a self-imposed minimum can be placed. You do not always want the latest data, but rather a ‘subdued’ version of it. For instance, this subdued version is reached by halting the largely automated trading of certain financial products whenever the dynamic balance is disrupted.

Why would you want a minimum of delay in the updating of your data warehouse for a subdued version? My favorite example is a trackball with dynamic power feedback. During my graduation I was allowed to run tests with it. On this trackball with a diameter of about 5 centimeters there were two electromotors. The direction of rotation and torque of the electromotors could be affected with software, but for the more hardware-minded there was also a potentiometer. If I turned that myself, then the dynamic feedback system of the trackball would enhance itself. The trackball would literally fly meters through the room after only a slight push. I imposed a minimum on myself through the settings of the potentiometer then.

Because certain business processes are sometimes connected through the data warehouse, you do not want them to enhance each other. It is of course possible to make a system analysis in respect to the interactions, but that is no mean feat when it comes to complex business processes. If you are starting with data warehousing it is smarter to subdue unexpected influences by processing the data less quickly.

Integrity of data, also with real-time data warehousing

But this data needs to be correct. The correlation between the data needs to be correct. Therefore, do not register an order under customer number 1234 first, and only afterwards register that customer. That is because the total turnover per customer will not match with the total turnover The data needs to become available to the data warehouse in the proper order. That will usually work out; most modern systems put the entered data in a neat database right away. Processing the data in the proper order will work out just fine as well as long as the data is all sorted on the basis of a time stamp, such as:

  • Time 15:13:43, system Customers, customer 1234 entered
  • Time 15:13:44, system Orders, the first order of customer 1234 entered

To make this possible, all delivered transactions need to be provided with a time stamp like the one in the example.

Beware: if the delivery of one system is temporarily halted while the rest continues to supply, then the real-time character of your data warehouse deteriorates. Sometimes this is used as an argument to compromise the integrity of the data. I personally think this is a bad solution. If every user of the data warehouse needs to keep track of these kinds of data errors, then the data warehouse is demanding too much of the user and offers insufficient added value. If you use the data warehouse for external financial reporting you will not be able to escape having to add additional measures in the control framework in order to compensate for this shortcoming.

Time stamp for a real-time data warehouse

The time stamp needs to meet a number of properties:

  • Sorted according to reality: you need to able to sort the transactions, independent of the supplying system, on the basis of the time stamp to make sure they are in the same order as they have really been registered in.

  • Traceable to the reality of the user: you need to be able to see from a time stamp to which reporting date it belongs to much like a user can see the same on a clock on the wall.

A time stamp that meets the first demand could be a company-wide unique transaction counter for example. A counter like that, regardless of the technical problems, is hard to be traced to the reality of the user (the second demand).

A time stamp taking the form of a date and time does not fully meet the first demand. If multiple transactions take place within the greatest accuracy of the time stamp, then it is impossible to see in which order they have taken place. However, a time stamp with date and time does meet the second demand. Most users work with the accuracy of a second at most themselves. The agreements regarding worldwide time zones make it possible that you can easily convert times in both directions.

Within Invantive Producer we use a combination of the date/timestamp (for example h_datum_start) and a company-wide transaction counter (for example h_id of transactie_bijgewerkt) for these reasons.
A company-wide transaction counter is sometimes not practically achievable. Its need can sometimes be resolved in a different way. If you can limit yourself to a required accuracy of for example 50 milliseconds for reports, then a company-wide counter is not necessary. Instead, you can suffice with the self-organizing of all transactions in such a time period. This ordering can be done automatically on the basis of a normalized data model, based on metadata like in an Invantive Producer repository for example. In the example above, you would first enter all customer data into the data warehouse that have been added in the time period and only then the order. The approach of ‘keep trying until all processed’ is also possible.

The last problem you encounter is the establishing of a company-wide date/time stamp. If the accuracy of about 50 milliseconds suffices here as well, then the NTP protocol (NTP: Network Time Protocol) should work just fine. Possibly this could be combined with similar concepts specific to certain operating systems such as Windows Time Service.

NTP: Network Time Protocol

With NTP you can synchronize time with other references. Usually this happens at operating system level so that all applications running on a server enjoy this pleasure. The synchronization is accurate to an infallible degree, for example a few tens of milliseconds. Such a reference for the right time can be an ‘ultimate’ source of time for the entire company or just for a colleague-server which you are trying to come to a consensus to in terms of joint time. But normally you try to link with world time.

An atomic clock is, according to NTP, the ultimate source for the correct world time. Depending on the number of steps that your reference is to the atomic clock, your number of steps is equal to its number of steps plus 1. Every step introduces a small measuring error. The number of steps until the ultimate reference is therefore an indication of the accuracy of the time. With NTP the number of steps is called ‘stratum’. In practice you usually have a stratum somewhere between 3 and 5, a stratum between 10 and the maximum 16 generally leads to a moderate to poor synchronization of the time. A detailed explanation about NTP is available at http://www.ntp.org and at Wikipedia.