C++ for Games: Performance. Allocations and Data Locality

C++ performance

[[This is Chapter 16(a) from "beta" Volume V of the upcoming book "Development&Deployment of Multiplayer Online Games", which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the "release" book to those who help with improving; for further details see "Book Beta Testing". All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

[[TODO: cross-platform C++, with a real-world example going across PC/Mac/NDK]]

Using C++ in the context of games implies quite a few considerations, many of them being performance-related. As usual with performance, there are many things which are subjective and/or depend on the specifics, so it is important to take everything you hear about it (this book included) with a good pinch of salt. Still, I’ll try to provide a few pointers and my personal feelings in this regard.

On Performance in General

One thing which is very important when we’re speaking about performance, is so-called 90/10 law:

90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code

This law tends to stand for games too (ok, for some games the ratio might be as small as 70-30, but it is still quite uneven, as there are corner cases and corner cases of corner cases, and so on).

This observation means that

Not all the code needs to be absolutely optimized

In particular, the main question we’re facing during the development, is the one “whether this code is bad enough performance-wise to be rewritten right away, or it can wait until we run into performance problems and we can use profiler to identify the problem instead of guessing?”

My approach in this regard usually goes along the following lines:

Using the code which is orders-of-magnitude inefficient (and is executed more frequently than once-in-a-blue-moon) usually qualifies as a Pretty Bad Idea even for not-so-frequently-executed code. It becomes an even worse idea if we’re speaking about inefficiencies comparable to O(N), where N is total number of your players (or even number of players-per-server).
- if your interfaces are forcing your code to be Really Inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).
  It becomes even more important to make sure that all the things which dictate the code to be that inefficient, and are used in many places across the code, are fixed as soon as possible. In other words, if your interfaces are forcing your code to be inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).
- On the other hand, if the inefficiency in question is very local and can be fixed without affecting anything else, it can be allowed to live longer.
It is usually good to use some practices which allow you to optimize “for free”. In other words, if all it takes is to use “++it” instead of “it++” (see also [[TODO]] section below for more examples) – it is usually a Good Idea to do it.
It is Really Important to have Very Clean separation between different parts of your code, at the very least between your infrastructure-level code and game-logic-level code (but different parts of game-logic-level code usually need to be separated too - for example, as separate Reactors, see Chapter V for details). Such separation has numerous benefits, but from optimization point of view - such Very Clean separation opens the door for optimizing your code later, when it becomes necessary (and when it becomes apparent which pieces of code become bottlenecks in real-world scenarios)

As you can see, despite providing some (hopefully useful) hints, the approach outlined above still leaves LOTS of room for interpretation. Which, in turn, brings us to the most important statement of all:

Finding “what to optimize and what to leave as is for now” doesn’t not really qualify as science. It is an art.

On “noisy neighbor” code and “Resource Hogs”

One further thing to keep in mind with regards to 90-10 (or 70-30) rule is that even if only performance of 10% of the code matters,

the rest of the code can still affect performance of critical 10% in a Pretty Bad Way 😔

Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical code

It is related to the fact that LOTS of CPU (and more generally - computer) resources are de-facto shared between different pieces of code. Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical code¹ 😔. Such sharing happens because at least some of the CPU caches are shared between different pieces of code (and if we don’t dedicate 100% of a given core to a single task, even per-core caches will be effectively shared due to context switching), memory is shared too, and even some blocks of CPU are often shared between threads concurrently-run-on-the-same-code due to hyper threading a.k.a. simultaneous multithreading a.k.a. SMT.

And as soon as we have a resource which is shared between non-critical piece of code and critical piece of code – well, it can happen that non-critical piece of code will overuse this resource to the point that it will start affecting the critical code 😔. For example, if our non-critical code reads LOTS of RAM – it can easily cause CPU to start kicking data-which-belongs-to-critical-code from L3 cache (an effect known as cache pollution), which in turn will affect performance of our critical code.

These observations are usually not that bad in practice, but they lead us to one important consequence:

Even being non-critical doesn’t give a right to be a Resource Hog

¹ for example, via affinity masks

On Server Performance

In certain circles of gamedevs (especially some of those developers coming from client-side) there is a belief that the performance only matters on the client. In extreme cases, it may result in statements along the lines of “[performance is] not an issue on the backend, where scaling up a service can mean nothing more than dragging a slider to the right” GlazerMadhav.

Well, there is only one objection against this statement, and it is that the statement is deadly wrong. First of all, linear scalability should not be taken for granted (hint: scaling DB in linear manner is very difficult at least, see also Chapter [[TODO]]). Second, even when you do have linear scalability,

“dragging a slider to the right” is going to cost your company.

Let’s consider a hypothetical example first. Current industry standard for multi-player games revolves around 1000 players/2S-workhorse-server (or 100 players/core, which is more or less the same thing). Now let’s consider a scenario when you didn’t care about server performance at all, and wrote your game in a way so it runs only 10 players/server. And then your game got Really Successful and grew to a million of simultaneous players, which means that you need to run 100’000 of the servers. Not only finding 100’000 of servers to rent will be quite a task by itself, but also even at $250/server/month you’ll need $25M/month just with server costs (effectively requiring you each of your players to pay at least $25/month just for servers, and that’s without any profit or even money to pay your salary). Things become even worse when we’re speaking about costs for Free-to-Play (a.k.a. F2P) games, where most of the players are not paying; for F2P even 100 players/server MAY easily put your game at risk (and also each and every bit of extra cost will put you in a competitive disadvantage compared to your better-performing competition).

performance differences between different implementations CAN be REALLY HUGE.

And depending on many factors, performance differences between different implementations CAN be REALLY HUGE. Coming from hypothetical clouds back to real-world, let’s consider another example. Once upon a time, I’ve seen two games. One has had 400K simultaneous players, and another one – around 100K, and from the player's perspective the game was 99%-the-same. The former company, however, was running its game from mere 50 servers, while the latter one took 400 servers to run. It means that the first one has managed to get 8'000 players/server, while the second one was at 250, a whopping 32x difference in number of players per server (and therefore in server costs per player). And the difference was pretty much mirrored by the size of admin team, too, ouch. Now you’re in a good position to make a guess - which of these two companies has had more money for advertising?²

One last note in this regard – I do NOT mean that you should aim to absolutely-optimize performance of your Servers³ (and I do NOT rule out other-languages-than-C++ from server side too). All I want to say is that

Server-Side Performance Does Matter Too

And when going beyond this statement – see above about optimization being an art rather than science; which level of inefficiency to consider as “unacceptable from the very beginning”, is one of those choices which you’ll need to make for yourself.

For the rest of this Chapter, whenever speaking about best practices performance-wise, I’ll mean that they apply BOTH to Client-Side and to Server-Side (that is, unless it is explicitly stated otherwise).

² Yep, you guessed it right

³ Neither you should aim to absolutely-optimize performance of your Clients. While this may sound as an heresy to some of the Client-Side gamedevs out there, I stand for this statement

Allocations are Evil. Even More So Because of Locality

In the context of performance, first of all, let’s discuss one thing which IMHO tends to affect performance of C++ programs the-third-most⁴ – it is allocations.

The first important thing to remember for all allocations (whether it is C malloc() or C++ new) is that

Allocations are Expensive

Sure, there were quite a few improvements with allocation during last 10 years (including so-called thread-caching in tcmalloc and per-thread arenas in ptmalloc2), but well – it is still a rather slow operation (give or take, we’re speaking about the order of magnitude of 200 CPU clocks per allocation/deallocation pair, and more if you’re not so lucky with your allocator).

BTW, when it comes to discussions of "smart pointers are evil" - actually it is not smart pointers per se, it is allocations-inevitably-arising-from-smart-pointers, which affect performance of your code; on the other hand, if you replace smart pointer with a naked-pointer-allocated-via-malloc - it is very unlikely that you'll see any performance gain. More on it in [[TODO]] section below.

Allocations and Data Locality

In addition to allocations being relatively slow, even worse is the fact that

Allocations tend to reduce data locality

In many (most?) cases, poor data locality caused by allocation, incurs slowdown which is worse than just the cost of the allocation itself 😔. However, the concept of data locality is not that obvious (nor it can be easily separated from other effects), so let's see what is it all about.

During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs.

During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs. For example, on Z80 CPU⁵ one register-register op took 4 CPU clocks, and the same operation but register-memory, took 6 clocks.

Since that time, however, things have changed considerably. A register-register operation currently routinely takes less than 1 CPU clock (don’t ask me how it is possible 😉⁶), but reading from (main) memory takes 100+ CPU clocks. In other words, while memory latencies have been reduced since ancient times by a factor of 10x,⁷ over the same time frame CPU's pure number crunching speed has improved by a factor of 1000x+. This discrepancy was addressed by all kinds of caches (Z80 didn’t have any cache at all, and in recent years, it is common to have at least three levels of CPU caches, known as L1-L3, with an occasional L4). Not really surprisingly, these caches and their mechanics tend to have profound impact on the performance.

I don’t want to go into too much discussion of caches etc. (for this, you can read Drepper07, though beware of some outdated concepts such as Northbridge and Front-Side-Bus which are currently replaced with modern NUMA-related stuff, which is discussed, for example, in Lameter13 and GaudEtAl15[[TODO: if you know an up-to-date all-around reference covering BOTH usual RAM stuff and NUMA, please LMK]]).

It means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cache

However, I will make a few observations which are applicable to caches and programming:⁸

At physical level, all the work with memory is usually made via so-called “cache lines”. On x86/x64, cache line is 64 bytes for many years now
- It means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cache
- In addition, if your program exhibits a pattern of sequential reading, CPU usually does so-called “prefetch”, getting you the next-in-pattern cache line just in case you MIGHT need it.
Register-register operation is around 1 CPU clock
L1 read is like 4 CPU clocks these days (typical L1 cache data size is around 32K). Usually L1 is per-core
L2 read is around 10-12 clocks (typical L2 cache size is around 256K). L2 is usually per-core these days
L3 read is 30-50 clocks (typical L3 cache size is around 4-12M). Unlike L1-L2, L3 is usually shared between cores and is per-socket(!)
Main RAM access is around 100-150 CPU clocks
NUMA
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).
— Wikipedia —
NUMA accesses. NUMA applies only to multi-socket configurations, i.e. to servers and (almost-)only to servers:
- With NUMA, each CPU has its own RAM connected to it (and only to it); there is no more SMP/FSB-style “RAM directly connected to all the CPUs”
- - Whenever CPU A needs to access RAM connected to CPU B, CPU A issues a request to CPU B which goes over ultra-fast CPU-to-CPU connection (for example, Hypertransport for AMD, or Quick Path Interconnect a.k.a. QPI for Intel)
- NUMA access to another socket, with data happening to be in other-socket L3, costs 100-200 CPU clocks
- NUMA access to another socket, with data happening to be in RAM, costs 200-300 CPU clocks

Armed with this information, let’s take a look at a classical in-memory binary tree structure (the one used by STL’s map<> etc.); let’s assume that our tree/map has a million elements. Now, let’s see what happens if we want to look for something within the tree: even if our tree is reasonably well-balanced (which it is with STL), we’ll need to go through 20 nodes of the tree (as 2^20 ~= 1e6). And if each of our elements is allocated in heap, it means that we’ll need to make about 20 jumps in memory. In practice, some of these tree nodes will happen to be cached, but some won’t (causing so-called cache miss when we're trying to access them), so from a certain point in our lookup we’ll start incurring that 100+-clock penalty for each of these jumps(!).

Which leads us to a very important conclusion:

Data Locality DOES matter.⁹

In this regards, let’s first note that some of the allocators¹⁰ use some (though usually limited) trickery to improve data locality. This is a Good Idea, and in quite a few scenarios having such trickery MIGHT more than compensate for the allocator formally exhibiting a bit less performance than competition. Moral of the story: do NOT trust 3^rd-party measurements regarding performance of allocators, and compare such performance exactly within the context of your game.

For such ad-hoc tree-like structures, solution is not THAT easy, though if you run into problems, at least some mitigating approaches DO exist.

Let’s further note that while the data locality problem with STL-like trees can usually be addressed by switching to hash tables¹¹ a.k.a. std::unordered_*<> containers (though they do have their drawbacks in the worst-case scenarios, so you need to be careful), most of our programs have their own ad-hoc trees. One example is a list-of-players consisting of Player objects, each of Players containing player-name, and a list of Friends, and so on, and so forth; even more obvious example is a classical Scene Graph routinely used for 3D rendering (see, for example, Chapter 16 from McShaffryGraham). For such ad-hoc tree-like structures, solution is not THAT easy 😔, though if you run into problems, at least some mitigating approaches DO exist 🙂 .

⁴ after (a) using-O(N^2)-algo instead of O(N) one, and probably after (b) using blocking code instead of non-blocking one

⁵ or did it stand only for Z80’s predecessor i8080, and was improved for Z80 a bit? Doesn’t matter much for our purposes anyway

⁶ Ok, if you’re Really Interested – you may want to take a look at Wikipedia.MicroprocessorDesign as a starting point

⁷ Note that here we’re speaking about “RAM access as it is seen from CPU”, not about “CAS latency”, and these two, while being of the same order of magnitude, happen to be quite different

⁸ the numbers below are for modern-as-of-2016 x64

⁹ actually, both spatial locality and temporal locality matter, but we’ll speak mostly about spatial locality here

¹⁰ at least ptmalloc2, though make sure to check your favourite allocator – it might doing it too

¹¹ with a hash table and no collisions, you have at most 2 jumps over RAM, but with 1e6-element tree you can get up to 20, and this can make LOTS of difference

Lack of Locality. How Bad Can it be In Practice?

how bad the impact of bad data locality can be in practice?

You may feel that the above is more like a horror story and/or FUD; after all, everybody uses allocations and dynamic data structures (and all non-C/C++ programming languages are doing it at even larger scale). So, a reasonable question arises: how bad the impact of bad data locality can be in practice? I’ve also had such a question for quite a while, until I’ve got a very enlightening experience in this regard.

Once upon a time, there was a real-world program which performed certain calculations on a large in-memory ad-hoc-tree-like data structure: there were structures and collections of other structures within these structures, and so on, and so on. And it worked, but at certain point it started to use a bit too much memory (several Gigabytes), and we decided to “flatten” the structure – not to improve speed, but to save some RAM (by eliminating not-really-necessary extra pointers, and there were LOTS of them). So, I’ve rewritten the data structure, “flattening” it (more specifically, some of the structures were "flattened" into their immutable versions right after they were fully created; see more discussion on flattening in "Mitigation: Flattening" section below).

And yes, as a result of the rewriting we’ve got that about-2x reduction in memory size, which we were planning for. What we didn’t plan for, was that

The program as a whole has started to work 5x faster!

I should note that program was an almost-ideal case for speed-up due to flattening (it had small tree-like structures which flatten very nicely and into small-size blocks too), and 5x result should be seen as the one close to the maximum real-world speed improvement due to flattening. On the other hand, it does illustrate that gains due to flattening MIGHT be VERY compelling not only on paper, but in real world too.

Almost-Natural Locality: Reactors

Each of Reactors has its own state which is not accessed by anybody else

As we’ve discussed in Chapters V, VI, and VII, I suggest to implement as much as possible of your game (both on Server-Side and Client-Side) as Reactors (a.k.a. event-processing objects). We’ve already discussed many Very Significant Benefits of Reactors in those Chapters (including, but not limited to, Replay Testing and Production Post-Mortem), but from our current point of view, there is one additional benefit of Reactors which matters: it is that Reactors tend to have relatively good locality themselves.

Indeed, each of Reactors has its own state which is not accessed by anybody else. And as the size of state is generally limited - yes, it tends to help with locality a bit (though the larger the state of one single Reactor is – the more you may need to resort to other techniques to improve locality).

However, there is one thing which you need to do to exploit this property of Reactors; it is that

You need to use a separate allocator for each of your Reactors

When speed is an issue, I prefer to have explicitly separated allocators for separate Reactors (or at least to have an ability to do it this way)

Otherwise, if you're running dozens of Reactors within the same process, all of them using the same (usually default) allocator - then all the data from your different Reactors will be spread over one large default allocator, which will pretty much kill better-Reactor-locality. This negative effect is mitigated to certain extent by allocators exhibiting some thread affinity themselves (such as tcmalloc and ptmalloc2), but when speed is an issue, I still prefer to have explicitly separated allocators for separate Reactors (or at least to have an ability to do it this way). As a side benefit, such per-Reactor allocators can usually be made thread-agnostic (avoiding all the locks completely) – at the cost of using a different (global and thread-safe) allocator for message composing/parsing.

To make sure that you’re using “correct” allocators, you will probably need to wrap all the allocating data structures (such as std::basic_string<> and std::vector<>) into your own classes using your own allocators, and use these "your own" classes instead of usual std::string, std::vector etc. IMHO, having such wrappers qualifies as a Good Idea even if you end up using default std::allocator for your data later (at least you will be sure that you’ve tried everything and work in the best possible mode), but most of the time you'll be able to see some gains in this regard.

Mitigation: Reducing number of Allocations

The most obvious way to reduce the number of allocations and improve spatial locality is, well, avoiding allocations wherever possible. Three most common sources of allocations in most of the programs are (a) strings, (b) collections, and (c) "owning" pointers (which can be implemented either via smart pointers, or via manual allocations).

With strings, it is often possible to:

reduce number of strings, or at least reduce amount of strings actively used during time-critical processing
reduce number of strings, or at least reduce amount of strings actively used during time-critical processing. One example is to use integer IDs instead of player names for all internal purposes (especially for search keys); you still MAY have that std::string member within your Player structure, as long as you do NOT use it most of the time. You still may need to use it for rendering on the Client-Side (which is not a problem performance-wise), and for state sync on the Server-Side (in this case, “Flattening as an Intermediate Stage” may help, see below).
rely on short string optimisation (a.k.a. SSO) if your std:: library does it. See StackOverflow.StdString for details.
use eastl::fixed_string (for more on eastl, see also [[TODO]] section below)
use fixed-size char s[SZ]; instead of std::string (DON'T forget to encapsulate char array¹² and enforce all the appropriate bound checks to avoid nasty buffer-overrun bugs). If you know that your strings are up to 8 bytes in size, there is usually little sense to use std::string (if your std::string does support SSO, then you'll end up with 24+ bytes instead of 8, otherwise - you'll end up with allocation, which is even worse); even for 20-30 byte strings, using some extra memory to avoid allocation is often worth the trouble.
- On the other hand, making a decision “whether to use char[] or std::string in this particular case” is not always easy, so I suggest to have your own class MyString<max_size> for the use within the long-term-saved structs, with MyString deciding itself whether it uses char[] or std::string depending on max_size. This way, you will be able to adjust the balance between memory usage and locality (i.e. speed) without any changes to your app-level code, via changing MyString. Alternatively, MyString MAY use fixed_part_size approach described below for vectors. In a sense, this MyString is very similar to short string optimisation mentioned above, but is more flexible (also we can see MyString as a DIY version of eastl::fixed_string).

With collections, one very common case which can be optimized, is vectors which (usually) happen to be very small (like up to 5-10, though YMMV). With such vectors, it is often possible to replace them with fixed-size arrays. On the other hand, if outright restriction on array size is too harsh (or is too large), I often use my own class MyVector<T,fixed_part_size>; this class has T[fixed_part_size] as an array, and also some other fields related to potential allocation. Then, if the vector size is smaller than fixed_part_size – MyVector<> stores everything within the array, and avoids allocation completely; if the vector grows larger than fixed_part_size – it goes into allocation (for example, into a usual std::vector). MyVector<> can even have an interface which is identical to std::vector<> (though dealing with iterators may be not-so-trivial). BTW, similar thing is available in EASTL as eastl::fixed_vector<> (and they have eastl::fixed_string too; more on EASTL in [[TODO]] section below).

To avoid allocation in this case, it is often possible to 'guess' the maximum size (and maximum required alignment) of all your derivative classes...

For owning pointers, you CAN try to avoid them too. One common example for owning pointers is an owning pointer to a polymorphic base class, with actual object pointed to, being an object of derivative class (and then you can benefit of all the virtualisation goodies). To avoid allocation in this case, it is often possible to "guess" the maximum size (and maximum required alignment) of all your derivative classes, and then to have something like alignas(GUESS_MAX_ALIGN) uint8_t buf[GUESS_MAX_SZ]; instead of your "owning pointer", using "placement new" to construct your objects within buf. NB: use of asserts (or even better - static_asserts) to check that ALL your subclasses fit into this space (and are properly aligned too), is ABSOLUTELY REQUIRED! NB2: yes, it does expose your code to the knowledge of all the subclasses; whether performance is worth this complication - is up to you; as a rule of thumb - do NOT use this trick until you're sure that you need it. NB3: this technique MAY be used alone, or with the offsets as discussed in "Mitigation: Flattening" section right below.

¹² probably within a template class

Mitigation: Flattening

As I’ve mentioned above, flattening can provide very substantial performance benefits for complicated data structures. Let’s take a closer look on what flattening looks like.

Let’s say that we have an on-heap struct which has a string, vector of ints, and some other fields. Normally, such struct uses three allocated blocks: first one for the struct itself, second one for a string, and the third one for vector. To “flatten” it, we want to put all this data into one single allocation block.To do it, we can:

allocate one block of sufficient size
place struct in the beginning of the block
then, place string-data
then, place vector-data

Pointers to string and vector-data, MAY stay as pointers (pointing to appropriate places within the same allocation block), or you MAY replace them with offsets (or, for string-data – avoid offset altogether, as it is constant by design).

Bingo! We’ve got out flattened data structure, and can traverse it too.

Bingo! We’ve got out flattened data structure, and can traverse it too. As discussed above, you can expect data traversing over such a flattened structure to be significantly faster than traversing over original data structure, due to significantly better data locality. However, this comes at a price: modifying such “flattened” structures, while possible, is more complicated and expensive 😔. In other words: “flattened” structures are optimized for reading (and non-size-changing modifications), while usual C++ structures-with-ad-hoc-allocations are more like all-around jacks-of-all-trades (not too bad for modifications, not too good for reading). If you keep this observation in mind, it MIGHT be possible to achieve significant optimizations without rewriting too much of your program 😉 .

Now let's consider several flattening scenarios which are common for games, and see how they work with regards to this observation.

Flattening Modifiable Reactor State (such as Game World State)

One not-so-common-but-perfectly-feasible approach is to flatten the whole state of your Reactor (as one example, you may want to flatten your Game World State).

you generally SHOULD NOT try to completely flatten your whole state into one single allocation block (otherwise most of modifications will become hopelessly long).

In this case, you’ll need an ability to modify your flattened state. As a result, you generally SHOULD NOT try to completely flatten your whole state into one single allocation block (otherwise most of modifications will become hopelessly long).

Instead, it is better to reduce the number of allocated blocks, while keeping the size of each allocated block limited. This way, you’ll be able to modify flattened-structure-within-allocated-block at a limited cost (even if block modification requires reallocation-and-complete rewrite of the block, the cost stays bound as block size is limited).

If flattening of Game World State (the version which you normally work with) is your intention, then usually it is better to split your Game World State into reasonable-size objects (approx. 64-1024 bytes in size each), allocating each of them separately. This way, you will still keep an ability to modify your flattened structures at not-so-high cost, but will still save a lot on locality issues. You will still need to do some jumps in RAM to traverse your Game World State, but you’ll need MUCH less of them compared to traditional unflattened C++ allocation-based data structures.

Flattening as an Intermediate Stage: Client-Side

In games, there are at least two specific scenarios when flattening can be used without the need to modify flattened data structures at all.

The first such scenario occurs on the Client-Side, when we need to render our Client-Side game world state for the next (Client-Side) tick/frame. Then, it MIGHT be a good idea to create a flattened version of our game world state; after creating this flattened version – two Big Client-Side Tasks we have, can work perfectly in parallel.

The first Big Client-Side Task we have is rendering of the current-world-state; this is done over essentially-constant flattened structure, so all the traversings are very fast. The second Big Client-Side Task is modifying of the current game world state to figure out the world state for the next tick; this is done over usual C++ tree-like structure, which is easily modifiable.

Flattening as an Intermediate Stage: Server-Side

[[TODO: a case for Server-Side flattening suggested by Glenn Fiedler]]

Flattening using FlatBuffers

Indeed, FlatBuffers serialized data looks very much like the “flattened” structure which I’ve described above

One interesting trick with regards to flattening, is to use FlatBuffers (even when you do NOT want to send anything across the network). Indeed, FlatBuffers serialized data looks very much like the “flattened” structure which I’ve described above – and they already have all the automated code generation stuff to make it happen without much efforts from your side.

So, if you take your data structure, and serialize it using FlatBuffers – you will be able to use this FlatBuffers-serialized-version as a “better faster” version of your data structures. Using serialized data as a faster version to access the data might sound crazy – but with FlatBuffers most of the time it will indeed work exactly this way.

The only significant drawback in this regard is that last time I’ve checked, the only way to modify FlatBuffer (known as mutation in FlatBuffer-speak), didn’t support size-changing mutations at all. Other than that – while I didn’t try it myself, I love the idea of using FlatBuffers for flattening of the data structures stored as part of state of your Reactor (for example, a state of your Game World).

Oh, and keep in mind that if you want to use FlatBuffers to represent Game World State in a “Flattening Modifiable Reactor State” scenario – recommendation to keep size of allocated blocks reasonably small (usually of the order of 64-1024 bytes) still stands. In case of modification – you can either use FlatBuffer mutation, or (when Flatbuffer-provided mutation doesn’t fly) – build new FlatBuffer from the existing one, adding or removing fields as necessary.

On Data Locality and Hyperthreading

As you most likely know, most of x86/x64 CPUs have a feature known as Hyperthreading (HT).¹³ Actually, one of the Biggest Ideas behind HT is to have CPU’s ALUs etc. something to do while it waits for data from RAM 😉. Among other things it means that if locality of your data is Really Good, your program MAY perform better without HT.¹⁴

While on the Client we do NOT normally control HT, on the Server I suggest to try running your program both with and without HT, and see which one performs better.

¹³ actually, HT is a reincarnation of well-known-in-non-x86-world SMT as “Simultaneous MultiThreading”

¹⁴ and yes, it has been confirmed by some practical observations too, though YMMV

On Prefetch and _mm_prefetch()

One further thing which has significant impact on performance in the context of data locality, is so-called “prefetch”. In short – if you read memory in sequence, modern CPUs recognize the pattern, and get you the next cache line “just in case” if you might need it. I don’t want to go into lengthy discussions on “how prefetch is implemented” etc., but will concentrate on “what we can/should do about prefetch from the developer’s point of view” instead.

Most of the time, prefetch just silently works behind the scenes, and I didn’t see cases when messing with prefetch at application-level would be worth the trouble.

Most of the time, prefetch just silently works behind the scenes, and I didn’t see cases when messing with prefetch at application-level would be worth the trouble. At least in theory, however, such cases do exist.

If you still want to play with manual prefetch (also known as “software prefetch”), on x86/x64 you MAY want to take a look at the _mm_prefetch() intrinsic function (available both with GCC and MSVC). The idea of this function (actually an asm instruction from x86/x64 instruction set) is to inform CPU that you’re about to need certain memory location, so it can start getting corresponding cache line into cache (that is, if the line is not in cache yet). The idea is that when you’re done with some other stuff which happened between _mm_prefetch() and actual memory access, you will get your data from L1 cache. BTW, it implies that _mm_prefetch() is useless for writing; writing pretty-much-never-blocks anyway (and doesn’t require prefetching).¹⁵

Once again – do NOT expect miracles from _mm_prefetch(); moreover (unlike some cases of flattening), it is very unlikely to be worth the trouble (except, maybe, for very specific very tight algorithms already identified as Big Fat Bottlenecks). If you still feel you do need manual prefetch – make sure to read LeeEtAl. BTW, on ARM CPUs there is a somewhat similar PLD instruction (and at least on GCC, you have __builtin_prefetch() intrinsic).¹⁶

¹⁵ for modern CPUs, all the writes are buffered at CPU level one way or another, so you don’t need to wait until it is really written to RAM

¹⁶ technically, PLD is available only for ARMv7 and better, but this includes “pretty much any smartphone/tablet CPU out there”

On Locality within the Cloud

You might think that "hey, who cares about these locality things in the cloud age!". However, it is a bit more complicated than that.

First of all, let's note that locality DOES apply to cloud-hosted apps. While for cloud-hosted apps locality effects will indeed be somewhat diluted by "noisy neighbours" and you will indeed get less control, better-locality apps will still perform better than worse-locality ones (in particular, because characteristic times for locality effects to exhibit themselves are several orders of magnitude shorter than characteristic times of VM/hypervisor).

Now, let's note that whether it is worth to spend time on Server-Side locality issues (or it is better to pay extra for hosting) - is an open question. Ideally, I'd try to have a set of APIs which allow to write without caring about locality (allowing to release your game earlier a.k.a. ensuring better time-to-market) - but will allow to work on better locality later, saving you on server costs when they start to matter.

On writing Your Own Allocator

The fact that allocations are expensive, is rather well-known among gamedevs. As a result, it is fairly common to write their own allocators. I won’t say whether it is a good idea or bad idea (it Really Depends), but will note a few important observations instead:

unless your allocator is Very Specific to your case, it is Very Difficult to beat good modern general-purpose allocator algorithm-wise
unless your allocator is Very Specific to your case (like in “all my allocations have exactly the same size”), it is Very Difficult to beat good modern general-purpose allocator algorithm-wise
- At the very least, make sure to read ptmalloc2 to see what ptmalloc2 is doing, and realize which of their stuff you want and which you don’t
- If developing your own allocation algorithm, DO take locality into account. For example, LIFO allocators tend to work significantly better than FIFO ones due to better temporal locality.
That being said, not all the platforms have built-in allocators which are that good
- Therefore, there MIGHT be merit in using the same library-based allocator across all your Client platforms. In this case, ptmalloc2 and/or tcmalloc are two rather usual candidates for porting
While it is difficult to beat existing allocators algorithm-wise in a general case, it IS possible to beat them if you can restrict allocator functionality and create specific allocators (such as "no-lock thread-agnostic allocator", or "LIFO-only allocator")
Usually, I am arguing for having several different allocators for different purposes. At least, having one-instance-of-allocator-per-Reactor (as mentioned above in "Almost-Natural Locality: Reactors" section) is usually a Good Idea
- In particular, such per-Reactor allocator can be made thread-agnostic (avoiding all the nasty locks). NB: if you're using this technique, you usually need to make sure that a different thread-safe allocator is used for the inter-Reactor messages.

[[TODO: optimizing data structures depending on the way they're going to be processed, example vector-of-structs vs struct-of-vectors]]

[[To Be Continued...

This concludes beta Chapter 16(a) from the upcoming book "Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)". Stay tuned for beta Chapter 16(b), continuing discussion on performance of C++ game code beyond allocations and data locality.]]

References

[GlazerMadhav] Joshua Glazer, Sanjay Madhav, "Multiplayer Game Programming: Architecting Networked Games (Game Design)"

[Wikipedia.MicroprocessorDesign] "Microprocessor Design", Wikipedia

[Drepper07] Ulrich Drepper, "What Every Programmer Should Know About Memory", 2007

[Lameter13] Christoph Lameter, "NUMA (Non-Uniform Memory Access): An Overview"

[GaudEtAl15] Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, Mark Roth, "Challenges of Memory Management on Modern NUMA System"

[ptmalloc2] sploitfun, "Understanding glib malloc"

[McShaffryGraham] Mike McShaffry, David Graham, "Game Coding Complete, Fourth Edition"

[StackOverflow.StdString] "Why does libc++'s implementation of std::string take up 3x memory as libstdc++?"

[FlatBuffers] "FlatBuffers"

[LeeEtAl] JAEKYU LEE, HYESOON KIM, and RICHARD VUDUC, "When Prefetching Works, When It Doesn’t, and Why"