Inside Look – RPCS3 https://rpcs3.net/blog RPCS3 is an open-source PlayStation 3 emulator and debugger written in C++, developed for Windows, Linux, macOS and FreeBSD Sun, 23 Aug 2020 03:13:36 +0000 en-GB hourly 1 https://wordpress.org/?v=6.5.4 RPCS3 Inside Look: A Deep-Dive into Hardware and Performance Scaling! https://rpcs3.net/blog/2020/08/21/hardware-performance-scaling/ Fri, 21 Aug 2020 16:20:04 +0000 https://rpcs3.net/blog/?p=1923 Continue reading RPCS3 Inside Look: A Deep-Dive into Hardware and Performance Scaling!]]> Hey everyone! Today we’re going to talk about something that’s a little different than what you are used to in the progress reports. We’ll be going in-depth on how certain hardware and software configurations could significantly affect your performance in RPCS3.

There are several aspects that could make RPCS3 not perform as well as it should, and memory speeds is one of them. In our case, memory performance will be stressed by RPCS3 in several ways:

  1. Cell emulation: SPUs access to main memory goes through DMA. This is a beastly exercise to emulate all on its own.
  2. RSX emulation.

RSX memory operations fall into two major categories: Upload and Download. Upload operations include transfer of textures, shaders, and shader data (vertex buffers and other register configuration tables) from the host CPU to the host GPU. This process is usually optimized by the GPU driver to occur asynchronously and with heavy use of batching. It is bandwidth heavy, as the sets of data are rather large and transport has to go through PCI-E. We do a lot to hide this issue, and for the most part it works well, but if your memory is too slow or if you are stuck on an older PCI-E revision, the transfer lag can have a huge performance impact, especially if a GPU sync is required.

Download operations for instance include transfer of textures and arbitrary data from the host GPU to the host CPU. This one has very serious implications on performance because we can’t really hide the memory latency for the transfer operation. Most of the time the memory in question will be accessed by Cell without warning, which means we have to stop everything until the GPU has processed the information we need, and then we read all that data back over PCI-E all while our CPU thread is blocked. It is for this very reason that we have the ‘buffer options’ disabled by default: to reduce the penalty of this hard stop as most games might trample on older GPU-resident data without really needing to read it back later, in which case we can just pretend nothing existed for that memory block. This means that it’s also not advisable to run RPCS3 with your GPU usage maxed out or close to it as your GPU will not be quick enough to respond to these random synchronization requests. There is a lot of optimization that could be done in this area however, with a very good predictor that can guess with high accuracy whether or not a memory block will be accessed by the CPU soon and start queueing up the GPU instructions before it happens.


Let’s begin by showing the differences between the memory speeds that are going to be used and compared throughout the post. These performance stats were gathered by AIDA64 so the readers can have a better understanding of what manually tweaked memory looks like when compared to stock.

AIDA64 Stock Profile

AIDA64 Tweaked Profile

Manually tweaking the memory, or even doing something as simple as enabling XMP (Extreme Memory Profile), has a lot of benefits not only in RPCS3, but pretty much everywhere. It lowers latency and can significantly increase memory bandwidth, which definitely helps RPCS3 as you can observe below (pay attention to the min/max FPS graph):

Tweaked Memory Configuration

Stock Memory Configuration

The performance loss seen here will be even more pronounced in RSX bottlenecked games since data will be moved a lot from the host CPU to the host GPU.

PCI-E Bottlenecks

Providing help in our official discord server is something we are quite proud of; in fact, we have in the past diagnosed some issues that significantly impacted performance for some users, but one of the biggest issues that some might be unaware of is what PCI-E revision their GPU is running at. We’ve shown how RAM speeds alone can impact performance, but how is performance affected when you’re running an older PCI-E revision, or with less lanes?

Performance comparison in NieR showcasing the impact of a PCI-E bottleneck

PCI-E 1.1 x16

PCI-E 2.0 x16

PCI-E 3.0 x16

By making sure your GPU is using the latest supported PCI-E revision with all the 16 lanes, you ensure that not only will RPCS3 perform to its full potential, but also some native PC games. Most recently, the PC port of the game Horizon: Zero Dawn was found to be really sensitive to it, so that’s the perfect excuse to check how your GPU is operating.

Overclocking

Pushing systems to its limits is something RPCS3 is very well known for, and when it comes to overclocking, RPCS3 can greatly benefit from it as expected, but something that some users might not be familiar with is cache/uncore/ringbus overclocking on Intel based systems. On the AMD side, the closest equivalent to this would be Infinity Fabric overclocking, but this won’t benefit RPCS3 as much. What typically happens is most people overclock their CPU core frequency, and after getting it stable, they will usually forget or not know about increasing the ringbus frequency (cache), which can further improve performance in RPCS3 as you can see below (pay attention to the min/max FPS graph):

48X Cache

40X Cache

It is also worth mentioning that RPCS3 takes advantage of the AVX (Advanced Vector Extensions) instruction set, so users that have overclocked their CPUs using an AVX OFFSET are going to take a hit.

Mitigations

In early 2018 the hardware world was shocked, as some features that were widely used to increase performance of modern PCs were abused to create security vulnerabilities such as Spectre and Meltdown. Those names should definitely ring a bell for those that follow hardware news or even for those who don’t. These vulnerabilities generated a quick response from the manufacturers to keep users safe which was great, but unfortunately performance was lost along with it. That being said we investigated the issue and how much the mitigations against these attacks affect RPCS3, which you can see below:

Impact to performance due to mitigations in Skate 3 running on an Intel i7-4790

Mitigations enabled

Mitigations disabled

Keep in mind that Ryzen and newer Intel based systems were only affected by one of the aforementioned vulnerabilities. These newer CPU’s are unlikely to experience the performance gains that an older CPU lacking hardware fixes would. It also goes without saying that the RPCS3 team does not recommend disabling these security fixes, but we found that it was worth mentioning for those that seek that very last bit of performance in those demanding titles.

So how does all this information look like when compared right next to each other? Looking at the images below, you can see the cumulative impact of leaving memory and cache at stock settings as well as mitigations enabled (pay attention to the min/max FPS graph):

Optimal performance

Worst case scenario performance

With all this data we believe you should be able to figure out for yourself whether it’s worth tweaking your system or not.

Upgrading your CPU

Now that we’ve covered the tweaks that could improve your RPCS3 performance without actually replacing your hardware, if your CPU is still letting you down, we would like to help by addressing one of the questions that we get asked the most: what CPU should I buy? Well, that question is already answered on our Quickstart guide, which has recommended specifications that should work very well for all the games with the playable status in the compatibility list. But what if you’re looking to brute force your way into making games that aren’t there yet performance wise into a playable experience? What are the things that you should consider? In this case, there are 3 main points to keep in mind when looking for a new CPU:

  1. How many threads does it have?
  2. How fast is the single threaded performance?
  3. Do the games you want to play require TSX?

The first thing that you should consider is that RPCS3 can heavily utilize up to 16 CPU threads, and once you go past that it’s very likely that you won’t see improvements. What this means is that once you have a CPU with 16 threads, you should invest in a faster single core performance instead. Keep in mind that you definitely won’t need 16 threads for all the titles, in RDR and a few other titles for instance won’t care if you go from 8C/8T to 8C/16T.

Moving on, what is TSX exactly? TSX is an instruction set that adds hardware transactional memory support, which is a very good addition to have when looking for a new CPU for RPCS3. To put it simply, TSX allows you to touch a piece of memory and then proceed with a task, but if any other thread touches that exact piece of memory, it aborts all the speculative work done. In the absence of TSX, the thread would have to use a mutex to lock the memory for its sole use, which makes it significantly slower than TSX.

Although TSX does look very good, it has its downsides due to being affected by mitigations. Every CPU with TSX support that has the microcode update will see TSX-FA appear in the second line of the log of the emulator which stands for Force Abort. As previously explained, the way TSX works is that it allows you to touch a piece of memory and then proceed with a task, but if any other thread touches that exact piece of memory, it aborts all the work done. However, with the microcode in place it results in a massive increase in the abort rate, which cripples performance. This issue was mostly alleviated with Nekotekina’s Fallback Path, where the emulator will use the non-TSX path when it detects a high abort rate on the TSX Path much better but not as good. Obviously, you can still rollback the microcode and get the original TSX working just fine, which will also make you vulnerable to attacks the microcode mitigates.

Unfortunately, transactional memory support was only implemented on Intel, and while AMD had plans of an equivalent instruction set called ASF, it was never implemented. While TSX used to be almost a must back in the day, things have changed thanks to our developer elad335, which means CPUs without TSX aren’t as far behind as they used to be in terms of performance, shifting TSX use case primarily to stability reasons with performance only being an ancillary benefit. The vast majority of titles won’t require TSX, but the ones that do are very sought after, some of these include God of War III, God of War Ascension, Uncharted 2/3, and The Last of Us. These titles are going to hang or crash frequently without TSX and that gives users who lack said instructions a hard time. We are aware that all that sounds really bad for CPUs without TSX but thankfully there’s a way around that with the “Accurate RSX Reservations” option found in the emulator’s Advanced settings tab. While it does help mitigate crashes and will significantly improve stability, the current performance with accurate RSX reservations enabled is subpar, since it coarsely emulates what TSX is doing, but via software instead. However, we are thrilled to announce today that this will be changing very soon with kd-11’s work. These changes are currently being tested and worked on behind the scenes and already delivers very comparable performance to TSX, as you can see below:

Performance comparison in God of War: Ascension using various RSX reservation methods

TSX Off (Legacy reservation system)

TSX Off (New reservation system)

TSX On

This is huge for CPUs without TSX, especially since Intel has removed it from the 10th generation of their CPUs. It’s possible that TSX is added again in Intel’s 12th gen as Intel recently added new instructions to the TSX instructions set, although there’s no official confirmation yet. However, with these improvements we won’t have to worry about it in the future.

Benchmarks

With all that said we’ve gathered data across a variety of CPUs in our discord server that people may consider for RPCS3 and to answer once and for all some of the questions as to what CPU is the best for RPCS3. So, here’s a graph with some actual numbers in Red Dead Redemption, which we believe to be one of the most demanding titles at the moment:

The testing was performed on Windows 10 2004 with the following settings differing from the default ones: SPU Loop Detection: Disabled; SPU Block Size: Mega; Anti-Aliasing: Disabled; Render: Vulkan; Relaxed ZCull: Enabled; Sleep Timers Accuracy: As Host; VBlank: 120Hz (to go above the 30FPS lock).

One of the first questions one may ask after seeing the graph is how a 3800X is performing better than a 3950X even though it has twice the cores and cache? The answer to that is due to increased latency from the 3950X’s multi-chiplet design. While the 3800X only has to communicate across two 4-core CCXes, the 3950X takes it a step further, and has two chiplets each with two 4-core CCXes it has to communicate across.

Unlike other software, RPCS3’s PPU & SPU threads need to communicate constantly which results in a major bottleneck if these threads are split across multiple CCXes / chiplets. That ends up with the CPU hitting this bottleneck constantly with all the data moving around. This is why we do not recommend Ryzen CPUs unless they have a 3 or 4 core CCX design (6-8 core Ryzen CPUs, or a 4 core Ryzen APU). A 4 core CCX design is ideal as RPCS3 can fit all the PPU & SPU threads onto a single CCX, allowing users to bypass inter-CCX latency bottleneck entirely, provided the PPU & SPU threads are being scheduled properly to be placed on a single CCX.

While later Ryzen generations have greatly improved latency, it’s still a major bottleneck for RPCS3 if all the PPU & SPU threads cannot be placed on a single CCX. Another thing to note is that Ryzen users should definitely update to Windows 10 1903 or later, as Microsoft improved their scheduler which helps to avoid this bottleneck as well.

The Intel CPUs on the other hand are quite the opposite. They do not suffer from the latency issues explained above due to its monolithic design and when you combine all that with a faster single core, you will notice that they do often perform better than their AMD equivalent.

That being said, we also would like to address why the 9900KS@5.1GHz is performing 5FPS higher than the 9900K@4.9GHz. Although most of the performance achieved does come from the frequency alone because RDR really takes advantage of it, the higher cache and faster memory speeds boosted the performance to where it is.

Closing Words

This concludes our in-depth hardware report. Based on the feedback from the community, we may publish more such reports. We hope you liked it and look forward to the next one. Thanks for reading!

If you would like to contribute to the project, you can do so either by contributing code, helping the community or becoming a patron. RPCS3 has two full-time developers working on it who greatly benefit from the continued support of the many generous patrons. In exchange, patrons also get special support over on our Discord server and get access to early updates directly from our lead developers. If you are interested in supporting us, consider visiting our Patreon page at the link below and becoming a patron, or join our Discord server to learn about other ways of contribution.

This report was written by Yahfz.

]]>