AMD vs NV Drivers: A Brief History and Understanding Scheduling & CPU Overhead

AMD vs NV Drivers: A Brief History and Understanding Scheduling & CPU Overhead

Hi Guys, today’s topic is one that I’ve been wanting to do for a long time now. I figured it should be covered before Vega arrives. So what is it about? Well, you often hear these mantras repeated, that with DX11, “AMD has more driver overhead”, or somehow “NV’s drivers are much more efficient”. While in DX12 or Vulkan, we often hear the reverse, that “AMD is better at next-gen APIs”. That’s the observed effect, yet there’s been few in-depth attempt at explaining why.

It is a very complex topic to delve into. It took me awhile to think of how to best explain it as a simpler concept, in a way that allows gamers to better appreciate the journey of both NVIDIA & AMD in recent times. Before I start, just let me make this disclaimer, I’m just a software guy, unaffiliated with any of these corporations. What I present is based on my own understanding, I could be wrong, it is ultimately up to you if you are curious enough, to do your own research if you have doubts about what I say.

In order to tell this tale and do it justice, we have to go on a short trip back in history. Let’s start with Fermi, the architecture responsible for the GTX 480, the really hot, loud and power hungry beast from NVIDIA back in 2010. The biggest reason why it was so power hungry was that it actually had a very powerful Hardware Scheduler with advanced features that pre-date AMD’s GCN architecture. Fermi had very fast Context Switching, allowing it to interweave graphics & compute workloads quickly. It also supported concurrent execution of multiple different kernels, allowing for PhysX or CUDA to run in parallel with graphics rendering. Do these things sound familiar? It should, because it’s DX12 & Vulkan capable, NV just never enabled it in their drivers. The problem with Fermi was TSMC’s early 28nm was horrible and it exacerbate the power usage.

NV did tame Fermi with a redesigned chip in the GTX 580, but that initial hot & hungry problem badly hurt NVIDIA’s pride. This would revolutionize the way NVIDIA design GPU architectures afterwards, with everything focused on efficiency. With Kepler, NVIDIA decided to remove a major part of the Hardware Scheduler, making it purely dependent on software or drivers to process and optimally distribute workloads to the GPU and this led to a good efficiency increase. Interestingly, just when NV moved away from Hardware Scheduling, AMD moved to one in their new GCN architecture. The reason is multi-faceted, but it involves the Fermi Teslas dominating High Performance Computing & Super Computing workloads, where having a good Hardware Scheduler is a big advantage.

AMD wanted to chase after this lucrative market so they had to have a powerful & dynamic GPU compute architecture. Secondly, AMD worked with Sony to develop GCN, and on the console, they have APIs that are vastly different & superior to Windows PCs. Now, with that short hardware history out of the way let’s move onto the inner details of how AMD & NV’s drivers currently operate with these different APIs.

Let’s start with a typical DX11 game, light on all thread utilization, but heavy on the primary thread. With AMD up to now and NV prior to the last few years, how this works is that the draw call processing in DX11 is restricted to the main rendering thread. Even if a game utilizes 1-2 threads, with the rest idling, that game can become CPU bound really quick. DX11 actually has a way to reduce this single thread bottleneck, to better use multi-cores, called Command Lists or Deferred Context. But at the time, nobody bothered with it, being single thread bound was the status quo back in the days, as game devs were reliant on higher clocked & higher IPC CPU architectures to remove the bottleneck. It took a few years after DX11 came out for a game to use it’s Multi-thread Commandlist feature. That game was Civilization 5, with the engine developed by some of the guys now at Oxide, like Dan Baker, the creators of Nitrous Engine & Ashes of the Singularity.

A well written summary from Anandtech’s editor, Ryan Smith, is available on the forum that covers this topic way back in 2011. How DX11 multi-threading works is by breaking down draw call submission into an Immediate Context and a Deferred Context. Within this Deferred Context, you pile up your draw calls, and when it’s time to send to the GPU, you assemble or batch them into a Command List and submit that to the GPU. The submission is still based on the primary thread, but now the extra cores or threads of the CPU can assist in batching up draw calls, a semi-approach to Multi-thread rendering.

Suddenly NV GPUs were twice as fast in Civilization 5 at low resolution where it’s CPU bound. Even AMD’s Richard Huddy at the time, admitted that such an approach could yield twice the performance. We’ll go back to this figure later. But it’s important to note, this improvement requires both drivers and game developers to code in a specific way to take advantage of it. To better explain this, I’ve made a diagram which simplifies DX11 Multi-threading.

Instead of having the usual main thread maxing out due to running heavy game logic as well as all the draw call processing and submission, DX11 can split up that draw call processing, offloading some of the work to worker threads on the other cores. When the task is completed, the data is assembled into a Command List for submission. This process can improve performance in situations where the main thread is heavily loaded. But there’s a cost, some CPU cycles are required to split up the workloads, and more for combining them into a Command List, basically there is added CPU overhead to this approach. But back in 2011, most games rarely max out cores 2 onwards, so this approach was the perfect solution. Returning back to the 2x increase, why not 3x or 4x for more CPU cores? It’s down to the nature of process itself, splitting up draw calls into batches cannot scale indefinitely.

There are dependencies and complexity increases the more cores you scale rendering to as they eventually have to recombine into the primary thread for submission. Typically a return of 2x draw call throughput for DX11 Multi-threading is a good result. Now, here’s the kicker, the real magic of NV’s driver team. After NV saw a huge benefit from Civilization 5’s use of DX11 Command Lists, they were working behind the scenes on how to bypass the need for developers to multi-thread DX11 games, a way to use more cores regardless of developer coding. Their solution is a brilliant one, often rarely discussed in public, and the simplest way to explain it is that NV’s Driver has an active “Server” thread that monitors draw calls.

If it’s not sent as a Deferred Context for Command List preparation, the Server process intercepts, then slices the workloads up sending it to worker threads to process and eventually assemble a Command List. It sounds easy but trust me, it is not, this feature is a marvel of software engineering. With the new Multi-threaded DX11 Driver, NV GPUs suffer much less primary CPU thread bottlenecks. It essentially only needs a small portion of the primary thread to run the Server process for batching with worker threads on the other CPU cores.

Games that are single threaded suddenly run great on NV GPUs, examples like World of Tanks or Arma 3, single thread bound, runs much faster on NV GPUs. Even 3dMark’s DX11 API overhead test, runs almost identical for NV GPUs in Single Thread or Multi-Thread mode. As a side effect, intensive CPU PhysX in games have almost no performance impact for NV’s GPUs, again, because it’s immune to primary thread bottlenecks. It’s been smooth sailing for NVIDIA in DX11 due to this secret sauce. At this point, you have to ask, if DX11 Command List is so good, why can NV implement it but AMD cannot? It comes down to one major difference, NV’s GPU use a Software Scheduler, or the driver. After the draw calls are sent to the Scheduler, it decides how to distribute the workloads in an optimal manner into the Compute Units or SMX.

Note that this compiling or scheduling has an inherent CPU cost associated with it, so it’s not a free lunch, but in many low-thread games, lots of CPU resources are left untapped so this software scheduling overhead is hidden. Because the main Scheduler for NV is software based, it can accept many Command Lists, even very large ones, as they are buffered for execution. In contrast, AMD GPUs have a Harware Scheduler, both the Global and Warp Scheduler are present on chip. The Driver does not Schedule, it just passes it to the GPU for the HWS to handle. By design, AMD’s HWS relies on more immediate context, or a constant stream of draw calls entering, not big packets or a Command List that has to be stored prior to execution. Basically, AMD’s HWS lacks the capability or mechanism to properly support constant usage of DX11 Command Lists.

So when AMD claims they did not focus on Multi-threading DX11, and instead focused on breaking through the Draw Call ceiling directly with a new next-gen API like Mantle, you can understand why. It’s not that they can but they won’t, it’s simply because GCN is incapable of Multi-threading well under DX11’s restrictions. This isn’t to say that AMD’s DX11 is awful, games can be coded to run really well on AMD DX11. They just need to be well threaded with the game logic, and able to leave the primary rendering thread mainly to handle draw calls & submissions to the GPU. An example, you may have heard of the Total War series.

The previous ones like Attilla and Empires, were single threaded and ran very badly on AMD GPUs. With their recent Total War Warhammer, AMD had a partnership with Creative Assembly to focus on both Multi-Threading their engine and adding DX12. The result is that AMD GPUs run this game well in both DX11 and DX12, and that’s just down to proper game thread usage. However, the reverse is also true, games can be coded to really hurt AMD performance. It’s just a simple matter of loading more game logic to the primary thread like CPU intensive effects or PhysX. With the main thread fully loaded, draw calls to AMD GPUs gets stalled and GPU utilization and performance drops.

This happens in some games and it has led to this incorrect believe that AMD has worse DX11 driver overhead. Contary to what the mantra suggests, as games become more threaded, it can benefit AMD if the game logic is spread to all the threads, keeping the rendering thread free for draw calls. While the inverse can also be true, in games that are well threaded and CPU intensive, NV’s GPU can suffer both from the overhead of DX11 Command Lists and also from the Software Scheduler. In simple terms, when a game pushes all threads, the extra overhead reduces the overall CPU cycles available for game logic and the frame rate slows down as a consequence when there’s not enough CPU power. Or if the CPU is powerful enough, it results in higher CPU usage for a similar frame rate.

We can see this typically in recent generation of multi-threaded console ports, such as Call of Duty Black Ops 3 benchmarked here on an GTX 970 and RX 480. Note the 970 is fully maxing the 4 CPU cores, while outputting lower frame rates than the RX 480 which still has CPU resources in reserve. Again in Crysis 3, on a quad core, similar frame rates, but much higher CPU utilization for the GTX 970 and 1060. This is the DX11 Command List & Software Scheduling overhead, it takes CPU cycles to perform these tasks, this cost is often hidden because most games do not stress all CPU threads so highly. The same can be seen in The Witcher 3 in Novigrad where the game becomes CPU bound on a quad-core. Again, similar frame rates for all 3 GPUs, but note the CPU usage is maxed for NVIDIA while the RX 480 still has ample in reserves. It is a specific phenomenom, whereby games are both CPU intensive and multi-threaded, but they also leave the rendering thread free for draw calls for AMD GPUs to function well.

When you actually think about it, it is actually NVIDIA’s Driver that has higher CPU overhead as it needs more CPU cycles to function. The reason they benefit with this approach is because games have been very slow to push multi-threading, think back from 2012 to 2015, most titles barely use 2 threads and so a lot of CPU threads remain idle. Meanwhile, during these years, AMD GPUs suffer in some DX11 games and people falsely assume they have higher driver overhead. Games do not come close to pushing the draw call ceiling even for AMD’s GPUs. The problem all along was these games fully loaded the main rendering thread. The correct statement is that NVIDIA GPUs have a higher CPU overhead, it uses more CPU cycles, but can potentially have much higher DX11 draw calls if there’s idling CPU resources available. Inversely, AMD’s Driver is more direct to the GPU, but it is very weak against the rendering thread bottlenecks. As for the situation on DX12 & Vulkan, adoption has been slow and many studios design their games for both DX11 and these APIs, it’s not an approach that yields the best result for either DX12 or Vulkan as these APIs require a ground up engine redesign to truly benefit.

The reason DX12 & Vulkan benefit AMD’s GPU because the GCN architecture was designed for an API that can fully tap into it’s Hardware Scheduler. DX12 and Vulkan natively allows draw call processing and submission on all CPU threads. On AMD, this goes directly to the Hardware Scheduler which can have 64 indepedent queues for distributing to the Compute Units. In some ways, these new APIs are removing the shackle that AMD GPUs has been carrying in the DX11 era. In regards to NVIDIA’s DX12/Vulkan, because of it’s Sofware Scheduler, there’s some overhead to separate draw call submissions directly across many threads as the streams have to be reassembled. This is not going to be a big impact, considering the draw call potential of DX12 or Vulkan, but when these games do not stress draw calls and are not coded to scale as well as NV’s DX11 Multi-threading, performance can regress.

However, done right by a good developer, both DX12 & Vulkan has big potential in future games that will push scene complexity much higher than currently. We are still really on the verge of truly meaningful performance gains with these APIs because game complexity has somewhat stagnated due to the current console generation limiting game designs complexity. Knowing these differences between AMD & NV’s strategy, can we conclude which approach is better? I think we can, clearly NVIDIA has been winning and is still winning, with marketshare, revenue and profits. However, it puts AMD’s approach in perspective. As an underdog, they had to gamble with Mantle to spur on real next-gen APIs that can fully tap into their GCN architecture as well as their high core count FX CPUs. If it had never been done or pushed, DX11 would live on for much longer and that is simply not in AMD’s interest.

NVIDIA meanwhile, moved towards Software Scheduling that benefits from an API that works best with more Driver intervention, and they invested heavily in maximizing their DX11 capabilities. It is therefore in their interesting to ensure DX11 remain the target API for as long as possible. Moving forward, what is the optimal strategy for AMD & NVIDIA? Note that their strategies do not align, what is good for one is bad for the other. AMD needs Vulkan & DX12 to be adopted faster, with more multi-threaded game engines, and this is particularly important now with Ryzen CPUs in the market. AMD’s main advantage here is their 8 core Jaguar CPUs in the major consoles dictate game studio optimizations. However, NVIDIA has more money, simply put, they can dictate how games are optimized for the PC port. In this kind of competitive battle, NVIDIA has a huge advantage because money matters here, it allows more engineers on site with game devs and it allows more studio sponsorships. This ultimately results in NVIDIA having a major say in how the PC port is coded, and assuredly, next-gen API or not, they will make it run superior on their GPUs.

Therefore, AMD has no choice but to design their hardware to be more resilient against NVIDIA’s strategy. They started with Polaris, the Discard Acelerator nullifies x32 and x64 Tessellation bottlenecks. The improved HWS & Instruction Pre-Fetch slightly alleviates primary thread bottlenecks. Vega must take this further, vastly improving it’s HWS & shader efficiency because DX11 is still going to be important in the foresable future. It’s going to be a very interesting few years ahead as we witness these great tech companies battle it out, this we can be sure of. If you’ve stuck around for this long, things will start to add up and perhaps something clicks in your mind regarding Ryzen’s odd gaming performance. Because of NVIDIA’s Software Scheduler and DX11 Multi-threading, if it is not optimized for a new CPU architecture, like Ryzen with it’s unique CCX design, it could negatively impact performance in a few ways.

First, the Sofware Scheduler under DX12 with each thread submitting draw calls, the re-organization and optimization of these multiple sources has high thread data dependencies. In DX11, the auto Multi-threading feature constantly splits up draw calls into multiple worker threads, then reassembles them into a Command List. This again is a high thread dependency task. If these trips across the CCXs to share or sync thread data happens often, there’s extra latency. This is exacerbated when reviewers test low resolution, very high frame rates, as each frame now has to be rendered by the CPU much faster, the latency penalty kicks in.

With AMD’s GPUs, both in DX11 and DX12, draw call submission is thread independent as the draw calls get sent straight to the GPU’s HWS. This is not to say there aren’t other issues with Ryzen, such as game specific optimizations or Windows thread management. But there’s no multi-threaded thread dependent GPU Drivers that can compound the problem. I’ve been waiting to see whether any tech journalist would go down this route and as expected, most of the tech press aren’t interested in delving deep, there’s an alarming lack of curiosity for a profession that demands it. However, it is refreshing to see a certain Youtuber, Jim @ AdoredTV, has uncovered the surface of this issue. You’ve guys no doubt have seen it, if you have not, I will link it in the description. It’s a brilliant piece of tech journalism. This video has been in the back of my mind for a long while as I thought about how best to present it, hopefully this has been satisfactory and insightful for you guys. And btw, sorry for the voice, I’ve got a nasty cold.

Until next time…

Click here for more details

    JVZoo Product Feed

Share