6IT

Herodotus writes history... in binary and to Database

[[This is Chapter 20(g) from "beta" Volume VI of the upcoming book "Development&Deployment of Multiplayer Online Games", which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the "release" book to those who help with improving; for further details see "Book Beta Testing". All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

When speaking about OLTP databases and reporting/analytics built on top of them, there is one issue that is frequently ignored – it is the issue of historical data.

One observation which stands for pretty much any real-world system, is that

99% of reporting requests and 99.9% of analytics is purely historical

Moreover, for most of OLTP databases out there, 99%+ of their size is history. In other words – unless we’re speaking about rather exotic stuff (such as high-frequency trading) – it is all about history.

Audit Tables

From practical standpoint, it means that for pretty much every transaction modifying something in your DB, you need to add a record to an audit table – explaining what exactly was done – and why. For example, a row in the audit table for modification in USERS adding the USER money tends to have information such as “what was done” (USERID, DELTAMONEY, and probably RESULTINGMONEY), “when it was done” (TIMESTAMP), “who has done it” (for example – ADMINID or NULL/USERID).

One all-important property of audit tables (at least for those tables/fields which you care about) is that

Information within the audit table should be sufficient to validate/justify current state.

What to Audit?

From the statement above, it follows that if you have MONEY field in your USERS table (corresponding to real dollars/euros/etc.) – your audit table should provide enough information to see (and to show your player/user) how it happened that she has exactly this amount of money in this field. The same goes for ANY critical information, including artifacts (the ones which may cost $20k on eBay) etc. etc.

Regardless of what you’re doing – there will be times when your players will ask you “hey, where is my {money|artifact|…}?!” – and you’d better have a good explanation.

The rationale for this is simple. Regardless of what you’re doing – there will be times when your players will ask you “hey, where is my {money|artifact|…}?!” – and you’d better have a good explanation. The explanation is necessary to see that (a) you don’t have a bug, and (b) to convince your player/user that you neither have a bug nor deliberately cheating them. And if you don’t do proper audit (where “proper” = ”implemented according to the statement above”) – then, well, both these tasks become pretty much impossible :-( (and no, saying Google-style “we do no evil” won’t fly – unless, of course, you’re Google 😥 ).

On the other hand, if speaking about auditing RPG player inventories - it MAY become too expensive to audit all of the movements of all the items within the player's inventory. On the other hand, it is not really necessary too - chances that you will be asked to prove how the mundane item such as usual healing potion has "disappeared" from the inventory, are very slim (and even if you're asked - public complaints about it "disappearing" won't make too big waves in the community almost for sure). Still, if we're speaking about unique artifacts (sold on eBay for thousands of real dollars) - having audit trail for them is necessary. In practice, it means that you MAY need to resort to a compromise, auditing only those ultra-expensive artifacts (and skipping audit for mundane items).

Another typical example of what needs to be saved in the audit table – is information about player and admin logins. While (as a rule of thumb) login audits are NOT going to be too interesting to your players (unless they are hacked) – your security team will LOVE login audits for sure; it is an immensely important tool for figuring out relations between supposedly independent players, and it helps to identify bypassing bans / collusions / ToC violations A LOT. At this point, we're not interested in how to use this information (we'll discuss it in Vol. 3, tentatively Chapter XXXIII on Anti-Cheating) - but are interested just in how to collect it.

One Central AUDIT Table or Multiple Audit Tables?

One all-important question which arises when designing your audit tables, is the following. As you’re going to save LOTS of different audits, we need to decide “whether it is better to have separate audit tables (such as “one for auditing USERS, another one for auditing LOGINS”) – or one all-encompassing AUDIT table?” Actually, as noted in Stackoverflow.AuditTables – there are at least FOUR different options in this regard (plus "Event Sourcing" as a fifth one). Let’s discuss them one by one:

No audit table, just “last change” fields in the main table.
- TBH, I don’t know of real-world scenarios when this approach is viable. It is not really an audit (and cannot be used for the purposes mentioned above) – plain and simple.
One single AUDIT table. Usually contains fields such as “TABLE_ID”, “FIELD_ID”, and “NEW_VALUE” / “OLD_VALUE”.
- This will work, but often will be not-too-optimal.
  This will work, but often will be not-too-optimal. Single AUDIT table tends to have increased contention (which is not too good by itself) – and finding of stuff within such AUDIT is (while certainly possible) is not too easy either (both SQLs are cumbersome and performance on retrieval is hit due to an extra index scan involved; while the latter is usually a non-issue for OLTP, it may be an issue for reporting replicas).
An audit table for each table to be audited. These audit tables can be either “clones” of the respective tables-to-be-audited (i.e. the table to be audited plus audit fields such as “who”, “when”, and “why”) – or can be implemented along the lines of single AUDIT table above (i.e. recording just the changes).
- This approach is perfectly viable – and quite optimal too.
- Keep in mind though that to establish a global order of events in the system, we need to have a global audit ID [[TODO:harmonize with section on IDs below]]
- Note that at least for some types of audit tables (in particular, when auditing money movements) I prefer to have significantly extended "clones" (or to avoid "clones" completely). What is important from my perspective is to have app-level fields such as “REQUESTED_OP” and “DELTA” for each audit record (for example, as a part of "audit fields" added to the "clone"). The reason for it goes along the following lines:
- - if we just have a “bare minimum clone” (without DELTA and REQUESTED_OP) - we effectively hide the real operation approved by whoever-initiated-it, and are merely assuming that the system has handled it correctly. It may lead to potential questions of “was it a really $10K transfer which this person in support has inappropriately authorized – or it was $10 authorized as he says, and it was a bug leading it to look as $10K in the audit?”; these questions are extremely unpleasant in concurrent multi-write-connection environments (where demonstrating that there is no possible race which can lead to such effects, can be extremely time-consuming), but even for single-write-connection architectures, I feel MUCH safer if I have a very straightforward field which directly corresponds to the intention of the transaction – before I say it was indeed $10K requested (and effectively accusing that person who authorized it, of wrongdoing).
  - note that in a certain sense, these "REQUESTED_OP" and "DELTA" fields can be seen as an implementation of "Event Sourcing" pattern (though if you want to use the goodies provided by this pattern - make sure to see more detailed discussion of it below).
- Another consideration when answering eternal "to clone or not to clone" question, is that if we're updating a single field - the whole row may be not easily available. While it is always possible to get the whole row after the update – it may incur substantial additional cost (and in multi-connection environments - make sure not to forget about transaction isolation levels).
- This is an interesting trick, as it eliminates quite a few UPDATES – and this (obviously) tends to improve write performance. However there are some drawbacks too...
  An extension of clone-style (or "extended clone-style" as described above) audit tables is to have the cloned audit table similar to the one described above, but without the original table-to-be-audited 🙂 . For example, instead of usual USERS and USERS_AUDIT, we’ll have just USERS_AUDIT (and not USERS). This is an interesting trick, as it eliminates quite a few UPDATES – and this (obviously) tends to improve write performance. However:
- - While logically this model is perfectly equivalent to having two tables, the cost of retrieval of the “current value” when we don’t have USERS table and user USERS_AUDIT instead, goes up. For example, if we’re speaking about MONEY field in USERS table – and are using classical USERS plus USERS_AUDIT, then to retrieve MONEY for specific USERID (which we need all the time in OLTP) we’ll need to issue simple “SELECT money from USERS where USERID=?”, which will result in a “point” index access (and for “point” indexes we can use “hash” indexes, which have roughly O(1) complexity and pretty low costs). On the other hand, if we’re using USERS_AUDIT without USERS, then to retrieve MONEY for USERID, we’ll need to issue SQL such as “SELECT money FROM users_audit where USER_ID=? ORDER BY AUDIT_ID DESCENDING LIMIT 1(or equivalent). This will normally result¹ into an execution plan which uses index scan over the two-field index on (USER_ID,AUDIT_ID) – and reads only one record out of it, aborting the index scan after the very first record is read (due to LIMIT clause or equivalent). However: (a) this index needs to be a ranged index (such as btree index)² as AUDIT_IDs in the request represent a range; and (b) search in a btree index (by USER_ID, before actual index scan even starts) will take O(log(N)) rather than O(1). As a result, such requests tend to suffer in performance (compared to straightforward SELECT money FROM users).
  - - On the other hand, if we’re using app-level caches (enabled by single-DB-connections, and I’m arguing for both these techniques) – app-level caches will eliminate this “cost of reading” problem almost entirely.
  - IMO even more importantly, this audit-only model doesn’t allow for easy truncation of the audit tables (actually – moving them out of OLTP DB, see discussion below), and relying on truncation/moving AUDIT records was Damn Important for quite a few serious OLTP DBs I’ve seen. In short – having your time-critical OLTP DB growing infinitely is going to hurt its performance, so we’ll likely want to truncate AUDITS (or more precisely – to move them into super-replicas as discussed below).
  - - On the other hand, it IS possible to truncate AUDITS table without affecting SELECTs of the last value – it is just more cumbersome. However, while reasonable execution plan does exist for such selective truncation³ – I am not sure whether it is easy/possible for any particular DB to write an SQL which will result in this execution plan (from what I’ve seen, this kind of trickery usually lies on Extremely Difficult part of spectrum – or even beyond :-( ). As a result – if you want to go the way of AUDIT-only tables and do need truncation - make 100% sure that such selective truncation is possible on your RDBMS without killing the whole thing.
- Yet another option is to use "Event Sourcing" pretty much as described in Stackoverflow.AuditTables. The idea is to record the whole history of the events incoming to our DB Server (while in usual audit we're recording results of the changes), and then we'll be able to rely on supposedly deterministic nature of our system to get all kinds of deterministic goodies (including Temporal Queries and Complete Rebuild). As I am a big fan of everything deterministic 🙂 , I am quite interested in this kind of stuff. However, there are a few issues to be kept in mind when implementing about Event Sourcing in practice:
- - for event sourcing to work, it is absolutely necessary to have your processing perfectly deterministic; otherwise, all hell breaks loose (in particular, if on one run you've calculated one winner of pretty much anything, and on the replay there will be a different one - you will regret the very moment when you decided to go for Event Sourcing). Worse - in presence of the arbitrary concurrent access, determinism is not possible (even Serializable transaction isolation is not sufficient to achieve determinism: Serializable just guarantees that some order exists between transactions, and does NOT guarantee that this order will be exactly the same on multiple runs). Therefore, at the moment I see only three ways to implement "event sourcing":
  - - using single-modifying-DB-connection architecture as discussed in [[TODO]] section above. Well, I am a fan of single-modifying-DB connections too, so I have no problems with it 🙂.
    - with certain app-level restrictions on concurrent transactions, it is possible to make system deterministic. However, I'm arguing for a long while that pushing synchronisation problems to the app level is a Really Bad Idea - and usually do NOT recommend this route.⁴
    - [[TODO! - deterministic replay over non-deterministic recording]]
  - [[TODO!: dangers of code changes if ever discarding state]]
  - I'd argue for having "events" written at the same transaction as the results of their processing; separating these two things, while possible, won't speed things up (and also special mechanisms to restore consistency after the crash will become necessary).
  - truncation of EVENTS table will be a problem (and replaying the whole history of the billions of events to restore current state is not going to be practical either). As a result - snapshots (such as online backups) will be necessary, and replay will need to work from the snapshot (NB: there is a direct analogy with Circular Logging for deterministic replay of Reactors, as was discussed in Chapter V).
  - "Event Sourcing" does NOT prevent from using usual audit tables - and the latter can be more usable for reporting too. On the other hand - for OLTP DB writing both incoming events and audit can be seen as excessive. The balance between the two depends a LOT on the specifics of your application.

Overall, all these solutions (except for the "last change" fields without any history) will work in a more or less reasonable manner. Personally, however, for Really Important Tables I tend to prefer per-table "extended clone audit tables" - with information in audit fields of these "extended clones" effectively representing event (in a sense used by Event Sourcing) which has lead to this change (and if events are referenced in more than one row/table - they can be moved to one or more of separate EVENTS tables with EVENTID referenced in audit); this IMO makes a quite a nice hybrid between classical audits and event sourcing - and it allows to check the hypothesis of being correct too (including possible violations of deterministic behaviour).

¹That is, if the table is long enough and your SQL compiler is smart enough; also we should have created that two-field index in advance

²Well, in theory it is possible to have “hybrid” hash+btree index, but I don’t know if such thing exists in practice

³start scanning the same index as the one used for SELECT, and delete all the records on the way EXCEPT the first one

⁴that is, unless it is a one-off very obvious restriction

On Audit IDs

The next all-important thing about the audits – is Audit_ID. In this regard, a few things need to be observed:

You DO need an Audit_ID for each of your audit records; it MUST be monotonic too. This is necessary because in addition to being a way to simply identify audit records - Audit_ID needs to define a 100% reliable ordering for them (as the final state of the system can easily depend on the ordering of incoming events).
- in real-world current computer time (whatever it is) can happen to go backwards
  Using TIMESTAMPs in lieu of Audit_ID doesn’t really cut it – first, TIMESTAMP is not guaranteed to be unique,⁵ and second – in real-world current computer time (whatever it is) can happen to go backwards 🙁 .
- - Of course, I am not arguing for abolishing TIMESTAMPs in audit records. Rather, I am arguing for having BOTH Audit_ID (as a counter strictly guaranteed to be monotonic) AND some kind of TIMESTAMP.
- Using DB-provided transaction IDs (see, for example, Stackoverflow.AuditTables or Stackoverflow.AuditTables or Stackoverflow.AuditTables) is better than TIMESTAMPs, but in case of multiple connections by not too much☹️ . This happens because DB-provided transaction IDs will be usually ordered according to beginning of the respective transactions, and not according to the commit of transactions - and in presence of multiple concurrent transactions these orders can be different 🙁 .
- - Still, transaction_IDs such as those above may be useful to match audit records in different audit tables (if per-table Audit_ID is used).
- One way to implement Audit_ID - is to use an (auto-)incremented field; however, in presence of concurrent transactions special care should be taken to ensure that reading/increment of this field is guaranteed to happen in exactly the same order as the changes to be audited, occur; this is heavily dependent on transaction isolation levels you're using (and on their implementation within your specific RDBMS), so figuring it out can be tricky ☹️.⁶
To be able to reconstruct the whole history as it happened (and to find bugs if there are any) – Audit_ID SHOULD be incremented for each new audit record (or for each new ACID transaction) in your whole database – even if you’re using several separate audit tables; in other words – I’m advocating against having per-table Audit_IDs.
- This rule is often ignored in real-world DBs (and it is not fatal either), but it can lead to very unpleasant situations when the problem becomes untrackable due to the order of the Audit_IDs being different from the actual order of transactions. Worse, probability of it happening grows with the load☹️.
- If using single-writing-DB-connection – such globally incremented Audit_ID is trivially achieved (by reading and caching Audit_ID as max() on all audit tables on the app start, and merely incrementing it in-memory each time audit record is made). Otherwise – achieving single Audit_ID incremented over all the audit tables may be (depending on your RBDMS) not so trivial, and/or can lead to performance hits. One possible way to bypass it is to use DB-provided transaction ID (see above) alongside your per-table Audit_ID; per-table Audit_ID will supposedly guarantee correct order of Audit_IDs compared to other changes (though depending on isolation levels - see above) - and DB-provided transaction ID will provide correct matching between the different Audit_IDs.
- - Bottom line: for classical multi-connection DBs – and if your DB doesn’t allow for fast retrieval of current transaction ID – it is up to you whether to use such global Audit_ID; otherwise – it is a Very Nice To Have feature.

⁵NB: on some RDBMS it may be possible to achieve almost-unique TIMESTAMPS (more specifically – “unique provided that system time doesn’t go back” – which happens ☹️).

⁶of course, if you're using single-writing-DB-connection - such problems do not exist at all 🙂

Truncating/Moving Audit Records

One more thing typical for audit/historical records in heavily-loaded OLTP systems, is related to truncation of the audit records. Motivation for it goes as follows:

As historical/audit records grow linearly with time – it is only a matter of time until they take any given size
As historical/audit records grow linearly with time – it is only a matter of time until they take any given size. In practice – if kept unchecked, they will take 99%+ of the DB size very soon. This leads to decreased performance (for example, it might become no longer possible to keep the whole OLTP DB in DB caches).
On the other hand, most of audit records, while being necessary for reporting purposes, tend to be unnecessary for OLTP decision-making.

Armed with these two observations, the following approach has been seen to be highly efficient:

At some point of operation of the heavy-loaded OLTP system, an asynchronous reporting replica is created (see “Stage 3” in [[TODO]] section above) – to take some load off OLTP into read-only async replica
However, while read-only replicas take away load from OLTP DB – OLTP DB still continues to grow in size (causing gradual performance degradation)
At this point (see “Stage 3a” in [[TODO]] above) – it is possible to start truncating historical data from OLTP (keeping the data in reporting replica(s)).
- This has been seen to arrest growth of OLTP DB, which in turn allows to keep OLTP as lean-and-mean as possible
- - In particular, it allowed to keep OLTP DB size for a game serving hundreds of thousands of simultaneous players, within 100G (probably smaller size if special measures are taken)
  - - This in turn allowed to keep 100% of OLTP DB in RAM – which clearly helps with performance (and eliminates performance spikes)
  - It should be mentioned that truncating unused historical tables is a rather heavy process from the performance point of view; a few tips in this regard:
  - - Truncation SHOULD be done in off-peak time
    - Truncation is usually best to be done in relatively small bunches (of the order of 100-500 rows) per transaction; much smaller bunches (~1-5 rows) will have too much overhead, and much larger ones (~100-500K) will cause too long locks and too much stress on the DB log.
    - Even if your architecture is generally “single-writing-connection” one (which I am advocating for), truncation still can be usually done from the second “writing” DB connection. This is possible because if only “very old” data (and unused by OLTP itself) is truncated, this second connection:
    - - Cannot possibly affect things such as app-level cache coherence (that is, as long as OLTP doesn’t use this historical information at all)
      - isn’t likely to create any contentions (except for inevitable contention over DB log) with the first writing DB connection

It should be noted that in case of OLTP history truncation and moving it to replicas, “reporting replicas” are no longer really “replicas”, but are instead primary sources of otherwise-irrecoverable information.

It should be noted that in case of OLTP history truncation and moving it to replicas, “reporting replicas” are no longer really “replicas” (which can be restored from the original), but are instead primary sources of otherwise-irrecoverable information. This hasn’t been a big practical problem, but has some implications (in particular, backups of these “super-replicas” become necessary).

On the other hand, truncating OLTP history is not without its own drawbacks; in particular –AUDIT tables without the tables to be audited (described above, and allowing INSERT-only processing) becomes much more difficult (if possible at all); still, IMO for majority of game-like processing, I’d choose truncation over INSERT-only approach (though if these too approaches can be combined – that would be even better 😉).

If we’re going to truncate the data anyway (and are not going to use it within OLTP DB) – a logical question of “do we really need to write audit in the first place?” arises. I cannot tell that I’m a big fan of not writing audits into OLTP (pushing the audit records to the replicas from the very beginning instead) – because of the potential loss of the audit records in such scenarios. On the other hand, for some audit records (for example, for not-so-critical stuff such as logins etc.) – it might be a viable approach, especially if you’re using multiple replicas (as such losses, while pretty much inevitable if you're not writing your audit to the DB, will be very few and far between).

[[TODO: pre-audit, post-audit, full-audit]]

[[To Be Continued...

This concludes beta Chapter 20(g) from the upcoming book "Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)". Stay tuned for beta Chapter 20(h), where we'll discuss "compiling" your SQL bindings (which are useful for quite a few reasons)]]

References

[Stackoverflow.AuditTables] "Audit tables: Each field for table or one table"

[Stolze] Knut Stolze, "Transaction IDs in DB2"

[Kyte] Tom Kyte, "How to get unique transaction id of the current transaction?"

[MSDN.tran_current_transaction] "sys.dm_tran_current_transaction (Transact-SQL)"

[Fowler] Martin Fowler, "Event Sourcing"

6IT Hare

Historical Data in Databases. Audit Tables. Event Sourcing