04 October 2020
There has been a lot of discussion coming to my attention recently regarding data oriented design (DOD) or data oriented architecture (DOA). It is something very close to my heart because I love optimisation; I have to do it for all the products I work on as a software engineer and so I am always thinking about the data first. Being data oriented is thinking about the data you have and the transformations you need to make, to get the output you desire on the hardware you are targeting.
When it comes to DOD a lot of the discussion (sometimes the only discussion) is about entity component systems (ECS) using structure of arrays (SoA) to improve cache utilisation. Every man and his dog now knows about “The Cache” and I hear them throw around the term “Cache Coherency” incorrectly (when it actually describes a different problem present in CPU architectures). I have worked with a number coders who continually try bringing cache efficiency optimisation into discussion at the wrong place and the wrong time, having never actually applied any DOD themselves. This is very frustrating because it is not something that has a one size fits all solution, they just picked up a few buzzwords thinking they know it all and some people have spent many, many sleepless nights grinding out those final CPU cycles to hit a target frame rate.
My interest in DOD all started when I was optimising CPU code for PS3 and Xbox360 games, I was deep in crunch and trying to grind out a 60hz fixed timestep update on a single PowerPC Unit (PPU) where having to shave off even a few milliseconds became increasingly difficult. We had PC ports of the projects which ran smooth as butter and there was a time I did not know the inherent differences in CPU architectures of PPU vs x86/64. But then I found great talks and blog posts from other devs explaining what was happening and how DOD was a way to solve the issues. This talk from Mike Acton was the canonical one of the time.
The PPU that I was optimising for was only capable of in-order execution or incapable of out-of-order (OoO) execution. This meant that CPU stalls while waiting for L2 cache-misses were one of the biggest penalties you were likely to see (~600 CPU cycles). The profiling tools for various platforms made this abundantly clear by highlighting cache-misses, but if you didn’t know about the problem of in-order execution and tried to optimise code at the instruction level, it would make little difference. The performance cost here was not as a result of the PPU doing too much work that needed reducing, it was because the PPU would be idle while waiting for data and unable to process anything else. CPU stalls can occur for a number of reasons; we may be reading from disk, we may miss the instruction cache or data cache, and even at the instruction level we have to wait some amount of CPU cycles for the result to complete. DOD helped to minimise the stalls and maximise data throughput. Compilers schedule operations to hide instruction latency, loop unrolling is an example of this and is effective because we can issue and interleave operations while we wait for latency in the pipeline. Modern, more advanced CPU’s can do more effective things with OoO hiding even greater latencies.
CPU pipelines these days are typically much more complex than the older PPU’s so the same techniques which worked on a PS3 might not be applicable or necessary today. PPU had costly penalties for branching, especially when SIMD was involved you could perform additional redundant ALU operations, which got masked out instead of taking a branch. These branch evasion strategies were also crucial in shader code on GPU’s of the same era. Modern x86 CPU’s have good branch predictors and I have found that taking a branch can be more efficient than doing more ALU and masking a result, but again it all depends on the hardware and what you are trying to do.
DOD doesn’t have to be (and shouldn’t be) all about SoA like some people seem to think. My first optimisations with my new found insights were quick and dirty because we were at the end of a development cycle. I made changes which split out transforms and matrices into linear arrays from the ~400 byte monolithic
GameObject base class and generated matrices from transforms, did hierarchical matrix multiplications, and performed frustum culling on a more cache efficient set of data. I also generated a list of active physics objects to lookup and update only active objects, instead of dereferencing a polymorphic object and drilling through a call stack only to early out when the object was inactive. This list was updated each frame as a byproduct of another update operation so was almost for free. These were simple optimisations that improved data cache utilisation, but also just reduced general overhead of calling functions and pointer chasing. They were by no means the most efficient solution, but they were certainly way more efficient than the previous implementation and they helped to ship a game with a fixed timestep at 60hz for every level of the game.
There are a lot of things that I consider to be data oriented optimisations. Loading binary files that have been prepared offline into a runtime ready layout so they can just be loaded straight into RAM with no unpacking or computational work is data oriented. Preparing textures that are already in a tiled GPU specific memory layout so it can just go straight into GPU would be an example of this, however if you have a large amount of data, and a platform with small storage, you may need to use precious CPU cycles to perform work to uncompress it so that it can fit onto flash storage, this is also DOD. You need to fit your data into the constraints of the hardware, runtime performance is not the only goal, memory usage might be a limitation, bandwidth might be an issue… and so on.
The last game I shipped is a live action (FMV) game with a huge amount of video data. The video data needs to be streamed from the internet, buffered onto HDD, loaded into RAM, and decoded on the GPU. We can playback videos and take many different storyline branches with video frames ready to render at literally zero latency because of prefetching, buffering and some lockless CPU/GPU synchronisation. The tools and build processes to manipulate data into forms that make it possible to achieve these goals are also data oriented. It isn’t just the runtime; from the initial conception and authoring of content until it is transmitted into a consumers eye balls is just one big long data oriented process.
Flippantly people say “we have larger CPU caches and clever branch predictors now… we don’t need DOD anymore”, do not discount the number of memory or performance limited embedded systems around, and in use today. Do not underestimate the difference a few microseconds saving inside a hot loop can save money (cloud services) or increase revenue (trading systems). Until you have a concrete application you cannot begin real data oriented design… The real process begins when you understand your architecture, profile your code, measure the results and find out what works best in that scenario, be it performance, data size or whatever else you may need to achieve.
Having said all of this, I do have my own game engine which has a SoA ECS, I will write a post about this, the history behind it and some of the decisions I made, if anyone cares to listen.
At the end of the day DOD is just all about the data. The king is the data, long live the data.