SIMD, memory locality, vectorization, and branch point prediction

The title of this post lists the three most important considerations for performant code these days (late 2023).

SIMD

GPUs can do a lot of compute in parallel. High end ($15K to $30K) GPUs like they use in big tech perform thousands of operations in parallel (50 teraflops for the H100). The catch is that they want all of those operations to be done in lockstep on different data. This is called single-instruction, multiple data (SIMD). Matrix operations, as used by today’s neural networks, are easy to code with SIMD.

GPUs cannot do 1000s of different things at once. This makes it challenging to write recursive algorithms like the no-U-turn sampler (NUTS) and is one of the reason people like Matt Hoffman (developer of NUTS) have turned to generalized HMC.

GPUs can do different things in sequence if you keep memory on the GPU (in kernel). This is how deep nets can sequence, feedforward, convolution, attention, activation, and GLM layers. Steve Bronder is working on keeping generated Stan GPU code in kernel.

Memory locality

Memory is slow. Really slow. The time it takes to fetch a fresh value from random access memory (RAM) is on the order of 100–200 times slower than the time it takes to execute a primitive arithmetic operation. On a CPU. The problem just gets worse with GPUs.

Modern CPUs are equipped with levels of cache. For instance, a consumer grade 8-core CPU like an Intel i9 might have a shared 32 MB L3 cache among all the cores, a 1MB L2 cache, and an 80KB L1 cache. When memory is read in from RAM, it gets read in blocks, with not only the value you requested, but other values near it. This gets pulled first into L3 cache, then into L2 cache, then into L1 cache, then it gets into registers on the CPU. This means if you have laid an array x out contiguously in memory and you have read x[n], then it is really cheap to read x[n + 1] because it’s either in your L1 cache already or being pipelined there. If your code accesses non-local pieces of memory, then you wind up getting cache misses. The higher up the cache miss (L1, L2, L3, or RAM), the more time the CPU has to wait to get the values it needs.

One way to see this in practice is to consider matrix layout in memory. If we use column major order, then each column is contiguous in memory and the columns are laid out one after the other. This makes it much more efficient to traverse the matrix by first looping over the columns, then looping over the rows. Getting the order wrong can be an order of magnitude penalty or more. Matrix libraries will do more efficient block-based transposes so this doesn’t bite users writing naive code.

The bottom line here is that even if you have 8 cores, your can’t run 8 parallel chains of MCMC as fast as you can run 1. On my Xeon desktop with 8 cores, I can run 4 chains in parallel, followed by another 4 in parallel in the same amount of time as I can run 8 in parallel. As a bonus, my fan doesn’t whine as loudly. The reason for the slowdown with 8 parallel chains is due not due to the CPUs being busy, it’s because the asynchronous execution causes a bottleneck in the cache. This can be overcome with hard work by restructuring parallel code to be more cache sensitive, but it’s a deep dive.

Performant code often recomputes the value of a function if its operands are in cache in order to reduce the memory pressure that would arise from storing the value. Or it reorders operations to be lazy explicitly to support recompilation. Stan does this to prioritize scalability over efficiency (i.e., it recomputes values which means fewer memory fetches but more operations).

Vectorization

Modern CPUs pipeline their operations, for example using AVX and SSE instructions on Intel chips. C++ compilers at high levels of optimization will exploit this if the right flags are enabled. This way, CPUs can do on the order of 8 simultaneous arithmetic operations. Writing loops so that they are in blocks of 8 so that they can exploit CPU vectorization is critical for performant code. The good news is that calling underlying matrix libraries like Eigen or BLAS will do that for you. The bad news is that if you write your own loops, they are going to be slow compared to loops optimized using vectorization. You have to do it yourself in C++ if you want performant code.

Another unexpected property of modern CPUs for numerical computing is that integer operations are pretty much free. The CPUs have separate integer and floating point units and with most numerical computing, there is far less pressure on integer arithmetic. So you can see things like adding integer arithmetic to a loop without slowing it down.

Branch-point prediction

When the CPU executes a conditional such as the compiled form of

if (A) then B else C;

it will predict whether A will evaluate to true or false. If it predicts true, then the operations in B will begin to execute “optimistically” at the same time as A. If A does evaluate to true, we have a head start. If A evaluates to false, then we have a branch-point misprediction. We have to backtrack, flush the results from optimistic evaluation of B, fetch the instructions for C, then continue. This is very very slow because of memory contention and because it breaks the data pipeline. And it’s double trouble for GPUs. Stan includes suggestions (pragmas) to the compiler as to which branch is more likely in our tight memory management code for automatic differentiation.

Conclusion

The takeaway message is that for efficient code, our main concerns are memory locality, branch-point prediction, and vectorization. With GPUs, we further have to worry about SIMD. Good luck!

21 thoughts on “SIMD, memory locality, vectorization, and branch point prediction

  1. I know Intel’s ill fated Itanium architecture gets lots of shit (much deservedly) but I think the basic idea that the compiler should do alot more to communicate to the hardware the data dependencies and what addresses need to be loaded in the cache.

    It’s crazy that compilers spend so much time and effort writing instructions that pretend to be executed in a naive sequential fashion and then for the hardware to spend so much effort prefetching and speculatively executing.

    I get that this is needed to hide architectural details but given the success of things like Rosetta seems to open up the possibility of a more hardware aware true machine code with software distributed using higher level VM instructions.

    • Already tried with dataflow architectures, and failed. Those dataflow architecture are technically what are inside modern CPU : register renamming and OOO execution are just a way to transform a sequential machine code to a dataflow-like machine code.

  2. Bob:

    Thanks for posting this explanation. These things are so mysterious to me. As a user, what I want to do is plug a GPU into my USB port and then Stan will just run faster. That is, I think of hardware the way that a researcher will typically think of statistics or the way that a person will typically think of his heart: it’s an incredibly useful tool that, if it works well, will not be noticed.

    • One, your notebook’s CPU isn’t up to the task, and two, you can’t get the GPU close enough or provide enough cooling. We’re talking about clusters that cost millions of dollars (or at least one well-cooled desktop in the $60K range).

      • Well, he could use a thunderbolt eGPU, which should be compatible with most mac USB C ports. As far as I know, stuff like the Razer Core X should support CUDA, OpenCL, and anything else nVidia. But the slower transfer rate probably will bottleneck anything that needs frequent communication between the CPU and GPU.

        Some people also do silly nonsense like this

        https://youtu.be/Cw2zn_H1WOA?si=BGhDInTaCys8886O

        where you liquid nitrogen cool a chip and pump the clock speed way up. Before long though, most of those cycles would be blocking on data transfer or memory access anyways though.

      • ” one well-cooled desktop in the $60K range).”

        Huh??? Arent you off by a factor of TEN there??? The last time I checked, top-of-the-line gaming peecees were 3 to 4K. That was with everything water cooled. Of course, to keep prices under control you might have to build it yourself. Or put up with higher prices for a name brand with better service.

        Obviously, if you need to use multiple top-of-the-line server class GPUs (N100/A100, whatever the current model is), that’s another thing, but a latest/greatest Intel I9 plus GTX4090 really ought not be more than 5K$. The first two i9 13900/4090 peecees that appear on a search on Amazon Japan are under 600,000 yen, which is US$4,000. Albeit from companies you’ve never heard of.

        My general thoughts on using cloud services is that if it’s not coming out of your employer’s (or grant’s) pocket no questions asked, you shouldn’t be doing that. Cloud CPUs really can’t provide more computation _per thread_ than a 6K$ desktop (although, again, the server GPUs are noticeably better). If you know how to radically parallelize the non-GPU parts of your program, then server farms might the way to go. (The last sever CPUs I looked at had like 64 processors, gobs and gobs and gobs of local cache, and seriously zippy communications. But the Intel i9 and Apple M3 performance cores are no slouches, and there are getting to be more and more of them per chip at the high end.)

        • Both CPUs and GPUs used for scientific computing are ridiculously overpriced costing more than 10k each. Scientific hardware is way more powerful than consumer hardware and has other useful features, like fast higher precision floating point arithmetic. And you wouldn’t be ok with just one card, right? 60k is already cheap, specialized hardware or signal processing FPGAs used, for example, in military applications can cost more by a factor of 10.

        • I thought you were going to object to me lowballing the price!

          I was talking about something with a nearly state of the art GPU, CPUs, and 1TB of fast memory.

          To get something that can do research-grade transformer training, you’re probably looking at an order of magnitude more because you need to be able to chain together four or eight GPUs to get enough GPU memory. At least that’s the scale of computing people are using here (Flatiron Institute) to generate results for their NeurIPs submissions.

        • “To get something that can do research-grade transformer training,”

          Uh, I doubt that’s what Andrew’s doing. He wants something faster than his notebook.

          FWIW, from experience, a new higher-end notebook with some sort of notebook GPU is terrible for running “AI”, i.e. current NN Go programs. Even a 3080 in an older i7 is like 3 orders of magnitude faster (tens of nodes vs. tens of thousands of nodes).

          So twice the money gets you 1000 times the performance, and another 10 to 20 times that money gets you another order of magnitude of performance. It’s clear where the curve bends.

          But, ROFL. Actually purchasing server-grade hardware never crossed my mind. But that’d be wackier than purchasing cloud services. Maintenance, power, cooling. Peecees are already in the 1 KW power dissipation range, and something multiply times more powerful would draw that many times more power. Not for the generic academic’s office. Running a multiple KW heater in the summer wouldn’t be fun.

          Meanwhile, for getting to the point where it makes sense to purchase cloud services, an M3 or i9 + high-end consumer GPU gets you way more computation than a notebook.

        • > FWIW, from experience, a new higher-end notebook with some sort of notebook GPU is terrible for running “AI”, i.e. current NN Go programs. Even a 3080 in an older i7 is like 3 orders of magnitude faster (tens of nodes vs. tens of thousands of nodes).

          I don’t know about Go programs but three orders of magnitude is a lot. That’s the difference between ten minutes and one week. For other “AI” things like running LLMs a two-year old MacBook Pro is well within an order of magnitude of a 3080 – and actually faster if the model is larger than the 10-12Gb of memory in the 3080.

  3. One thing that’s exciting about the Apple M series of chips (other than fat L2 caches) is unified memory. They just have one big block of memory that’s fully addressable by the CPU and GPU simultaneously, while other desktop designs require an expensive copy. I’m wondering if this can allow for significant speedups of code with smaller parallel chunks. Like, use a big matrix multiplication on the GPU with serial symplectic integrator steps. I suppose with multiple parallel chains in stan this could bottleneck. And maybe GPU vs CPU mat mul has like a row-major/column-major mixup or something else I’m not seeing. As far as I can tell, this hasn’t been used; I think OpenCL apps like stan will still copy the data unnecessarily.

    One day, I will be a serious enough person to learn OpenCL, CUDA and the like. One day…

    • The ARM architecture is actually pretty good with Stan because it is good with ad hoc memory access. It’s also good at multi-threading for the same reason. You can see this running Stable Diffusion through DiffusionBee (get the 2nd edition). But it’s not as fast as running the same thing on a $1.5K GPU and that can’t compete with a $15K GPU (which even more importantly scales GPU memory).

    • There’s also the on-demand GPU “dynamic caching” that Apple announced a few days ago, eg: https://www.apple.com/newsroom/2023/10/apple-unveils-m3-m3-pro-and-m3-max-the-most-advanced-chips-for-a-personal-computer/

      > It features Dynamic Caching that, unlike traditional GPUs, allocates the use of local memory in hardware in real time. With Dynamic Caching, only the exact amount of memory needed is used for each task. This is an industry first, transparent to developers, and the cornerstone of the new GPU architecture. It dramatically increases the average utilization of the GPU, which significantly increases performance for the most demanding pro apps and games.

      Patent here, I think: https://patents.google.com/patent/US20210271606A1/en

      > Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a storage element). In some embodiments, private memory allocation circuitry is configured to generate page table information and map private memory pages for requests if the page table information is not already setup. In various embodiments, this may advantageously allow dynamic private memory allocation, e.g., to efficiently allocate memory for graphics shaders with different types of workloads. Disclosed caching techniques for page table information may improve performance relative to traditional techniques. Further, disclosed embodiments may facilitate memory consolidation across a device such as a graphics processor.

      Another layer in the “memory locality” discussion would be storage, eg using a fast multi-TB NVMe as a cache drive, or having to read in multiple terabytes of data — similar considerations apply (sequential I/O operations being much faster than random ones).

    • The shared memory of Apple M serie is rarely a big advantage : you remove some copîes (generally unfrequent) but have to share the memory bandwitch to both CPU and GPU. And scientific code with a lots of matrices are very memory-bandwitch intensive. And also : it screw up caching, because there is no clear cache coherency between CPU and GPU (GPU don’t even have cache coherency for their cache). Bad idea in general, even if it save monetary cost.

    • The GPUs of the Apple silicon chips are sufficiently fast for many natural language processing tasks with billion parameter models; the trick is to compile and structure architectures amenable to the operations optimized for the chips. The CPU itself is also quite fast, with lower level operations accessible via the higher-level language of Swift via Accelerate/BNNS/etc., and writing asynchronous code is about as easy as it can get, given Swift’s async/await approach. The end result is that it’s possible to build quite useful neural-based applications entirely on-device. That then removes a lot of complications for end-users (with the hardware details fading into the background, as per Andrew’s comment above)—no more messing with CUDA/etc.; no more sending to the cloud; responsive graphical interfaces instead of command lines: just open your laptop and start. That’s what we’ve done at my startup, Reexpress AI.

      (Incidentally, our natural language data analysis software may be of interest to readers of this blog, as it provides a means of adding uncertainty quantification to neural language models.)

Leave a Reply

Your email address will not be published. Required fields are marked *