Myth Debugging: Is the Wii More Demanding to Emulate than the GameCube?

On the Dolphin Forums, one of the more common questions that come up is "How come I can emulate this Wii game just fine but this GameCube game is slow?" While those more knowledgeable about the intricacies of emulation may roll their eyes, it does warrant some explanation. Usually when stepping down from a newer console emulator to an older console emulator, the minimum requirements for emulation drop significantly. While there are some exceptions when dealing with exceptionally obtuse hardware, that concern doesn't hold up here: The GameCube and Wii, they're nearly identical hardware wise! The Wii at its core is really just a Super Charged GameCube with a few extra pieces bolted on.

Common knowledge would say that games running on stronger hardware should be harder to emulate, but that isn't always necessarily true, even in the case of incredibly similar hardware. In order to explain further, let's look at what makes each of these consoles tick.

CPU Emulation - Gekko V.S. Broadway


NetplayCrash.png
"Gekko" By Baz1521 - ja.wikipedia.org, CC BY-SA 3.0


The GameCube and Wii have very similar CPUs. The biggest difference between the two is that the GameCube's Gekko PowerPC Processor, a modified PPC750CXe, is only 486MHz while the Wii Broadway Power PC Processor, a close to stock PPC750CL, is 729MHz. That pegs the Wii processor at roughly 67% more powerful than the GameCube! That means that Dolphin's CPU thread should be much more stressed out by Wii games, right? Not necessarily, in fact the Wii's extra power can actually play out in Dolphin's favor!

The GameCube's slower processor may have proven to be a bottleneck for many developers. Plenty of GameCube games have some pretty creative solutions that rely on rather obscure behaviors of the processor which are more expensive to emulate. On the other hand, once the Wii was released and developers had a little more breathing room, a lot of those tricks suddenly stopped happening! Most of the power went toward more math calculations which map incredibly well to Dolphin's JIT. On that note, many Wii games actually are easier on Dolphin thanks to the processor being stronger.

That is not to say that a Wii game can't be more demanding than a GameCube games, though. Ambitious developers could get everything and more out of the Wii that they could out of the GameCube. Overall though, the Wii's stronger processor doesn't do much to make emulation harder.

GPU Emulation - Flipper V.S. Hollywood

Hollywood

Much like with the processor, the Wii's GPU is a supercharged version of what was in the GameCube. It features a 243MHz Fixed-Function GPU (Hollywood) compared to a nearly identical except clockrate 162MHz GPU (Flipper.)

The main bottleneck for emulating their GPUs comes from Dolphin having to flush the pipeline cache in situations when Flipper and Hollywood's GPUs didn't! They were extremely robust and could handle tons of state changes without flushing. The fact that Hollywood can draw more pixels and render more triangles is of very little concern to Dolphin, as the host GPU in a modern PC mixed with the CPU Vertex Loader JIT is more than enough to handle it. While there may be some extreme examples here and there of a game taking advantage of the Wii's extra GPU power to make Dolphin stumble, the bump in clock for the Wii GPU doesn't make a major difference in most cases.

Sound Emulation - The GameCube and Wii DSPs

There isn't that much to say here except for a few minor differences. Even then, if you only care about performance, High Level Emulation (HLE) trivializes the performance impact of emulating the DSP in modern Dolphin. The main difference between the two DSPs is that the Wii processes 3ms of data at a time vs the GameCube, which processes five 1ms pieces of data at a time. While the Wii does have more channels than the GameCube, its workflow is slightly more efficient to emulate. The result? Both DSPs are roughly the same when it comes to emulation.

Wii Specific Hardware - Starlet and Friends

For a quick aside, on top of having faster variants of the GameCube hardware, the Wii also had a bunch of new hardware related to Bluetooth, Wi-Fi, the SD Slot, etc. The PPC can't directly access the new hardware and instead talks to a small ARM coprocessor (Starlet) which manages access. Because Dolphin uses HLE for Starlet, it doesn't impact performance in most cases.

The only time where IOS-HLE noticeably impacts Dolphin is on various asynchronous events (such as a some parts of online functionality) that Dolphin handles synchronously, resulting in some stuttering here and there. This is mostly noticeable when connecting to servers in games.

RAM Watch - GameCube ARAM V.S. Wii MEM2

It shouldn't be any surprise that the Wii also saw an upgrade in RAM over the GameCube. Both consoles feature a 24MB MEM1 region that is more or less identical for emulation purposes.

The main change is that the GameCube had 16MB of slower A(udio)RAM that wasn't directly mappable to the processor. The Wii dropped this ARAM in favor of adding a 64MB MEM2 region that games could directly map without having to use paging. This is the first hardware change that greatly impacts emulation.

ARAM Flips The Script!

While ARAM could be used for audio processing, some ambitious developers thought it would be a waste to leave that 16MB of RAM sitting there just for audio when they only have 24MB of regular RAM. They couldn't directly map this 16MB of RAM, but they could use it as virtual memory by paging data in and out via Direct Memory Access!

Games commonly achieved this through accessing currently unpaged memory and triggering an exception. Then their specially programmed exception handler would map in memory from the ARAM into the page they were trying to access while paging something else out (which if accessed later on, could be paged back in!) When the exception handler returned program execution back to where it was, the fetch/load/store would go through as if the memory was mapped all along. Some of the most demanding games on the GameCube resorted to this to get an edge over their competition by essentially having an extra 16MB of virtual memory to work with.

Virtual Memory
Click the image to go to our blog article that deals with MMU emulation in more detail.

The early libraries provided by Nintendo didn't provide for using the ARAM as virtual memory, so most of the early games (except for Rogue Squadron II, of course) didn't use the ARAM for virtual memory. This behavior became much more widespread when Nintendo started providing a library to help developers page memory into and out of ARAM. While there are a few custom implementations, a majority of games that make use of the ARAM as virtual memory use Nintendo's libraries.

Dolphin didn't actually receive its initial implementation of full MMU emulation until July of 2010, so instead of leaving these games completely broken, a more evil solution was devised. The games using Nintendo's library were very predictable in how they set up their page tables. By simply marking those regions of memory as valid, the games would be able to read from and write to virtual memory without issues. This bypassed the whole song and dance of memchecks, the exception handler, and paging data in and out of the ARAM!

This didn't quite work for games with custom solutions. For them, Full MMU was devised. By emulating memchecks and the rest of the process of paging data in and out of ARAM, Dolphin could finally boot those games... at a cost. Full MMU was incredibly slow and left most of these games nearly unplayable despite the emulator finally being able to boot them. A few notable examples include, but are not limited to: Star Wars: Rogue Squadron 2, Star Wars: Rogue Squadron 3, Star Wars: The Clone Wars, and Spider-Man 2. A special mention has to go to Star Wars: The Clone Wars, as its requirement of Dynamic Block Address Translation forced a rewrite of Dolphin's MMU emulation that slowed down the other full MMU games thanks to added emulation.

If you're wondering why Full MMU games are so much slower, it's because Dolphin can no longer assume the memory the game is reading to and writing to is actually valid. Instead it has to enable expensive memchecks and make sure that memory is valid before letting the game access it. This is slow, very slow. In recent years, many, many optimizations to the JIT have made some Full MMU titles run fairly well. They're still extremely demanding, but, many of them can run on modern hardware without slowdown.

Wii games don't have to do any of this - they have full access to the 64MB of MEM2. This means that Dolphin doesn't have to worry about MMU emulation in general. As with any great rule, there are some exceptions, such as games that are purposefully trying to break Dolphin. Those games will be covered in another article as they seem to particular target Dolphin's weaknesses and thus some of their behaviors doesn't actually make sense for the source hardware.

Other Demanding Behaviors

Some games are very lightweight to emulate while others are problematic. This is because of the various features these games require. Full MMU emulation is just one of many, but, here are some of the other common behaviors that affect emulator performance.

Reading/Writing Embedded/External FrameBuffer Copies

If a game reads the framebuffer, Dolphin has to send what the host GPU is rendering to the CPU. By default, Dolphin turns this feature off because it's very slow, instead opting to store it only as a texture on the GPU. This works for most effects as long as the GameCube/Wii CPU doesn't need to use it for anything in particular. With Store EFB/XFB Copies to RAM, we faithfully copy the framebuffer off of the host GPU and into RAM. This is much slower on modern computers than it is on the GameCube and Wii for several reasons, including the fact that the GameCube/Wii CPU and GPU share RAM. Dolphin has to copy the memory across PCIe, send everything to the GPU, and then wait for it to finish doing its work before continuing. Because GPUs work in batches and Dolphin doesn't know when to flush in advance, this is very inefficient.


Super Mario Sunshine
Mario is unaffected (and rather unimpressed) by the goop in Super Mario Sunshine with Store EFB Copies to Texture Only enabled.


CPU Reading/Writing directly to the Embedded FrameBuffer

While the CPU/GPU shared RAM on the console, sometimes developers wanted to access the framebuffer directly. The framebuffer is embedded in the GPU (EFB), but mapped to the CPU and can be accessed, The option to enable this behavior in Dolphin is known as EFB Access from CPU. Games like Super Mario Galaxy and Twilight Princess (on Wii) do this to help see the depth of what you're pointing at when aiming various weapons with the Wii pointer. By disabling this with Skip EFB Access from CPU in Dolphin, you'll end up missing a lot due to the game not being able to tell how far your target is from where you're aiming. This will make for some interesting shot trajectories.


Super Mario Galaxy
Pull Stars and aiming both require EFB Access from CPU in order to function in Dolphin.


It should be noted that Dolphin is extremely good at minimizing the impact of certain access patterns, and has an EFB peek cache available to OpenGL and Vulkan. An example of a game slow strictly from EFB Access is Monster Hunter Tri, which uses it for selecting custom colors from a palette and controlling bloom intensity.

Lots of State Changes

As mentioned above in the GPU section, one particularly difficult thing about emulating the GameCube/Wii is that Flipper/Hollywood can pipeline a lot of state changes that require modern PC GPUs to flush the pipeline. These games may be doing something that is tolerable for their target hardware but is incredibly inefficient to do on a modern GPU! Two prominent examples are the snow in Flanoir in Tales of Symphonia and the minimap in Twilight Princess.


The Legend of Zelda: Twilight Princess
The fully rendered minimap can require a beast of a computer.


When properties are varying per-batch, modern GPUs struggle to keep up and Dolphin does little to help currently. The one thing to note is that Vulkan helps a lot in draw call bottlenecked situations thanks to more efficient handling of draw calls and far fewer API calls. This is because Vulkan has a single immutable pipeline switch and can be potentially cut down to just three API calls, compared to tens in OpenGL. In the case of The Legend of Zelda: Twilight Princess, the performance difference is so huge that maximum speed can be doubled just by using Vulkan in some areas.

Other Notable Demanding Behaviors

Some other examples include games like F-Zero GX, Super Monkey Ball, and Dragon Ball: Revenge of King Piccolo, which rely on CPU behaviors that require more exact emulation of certain CPU instructions. Suffice to say, games have a lot of ways to make things hard on Dolphin.

A Look at the Numbers

Now it's time to put everything into practice and get some test results. The test machine for the following is an i7-6700K running Windows 10 and used the OpenGL graphics backend unless specified otherwise.

Settings:

  • Single Core
  • Native Resolution
  • EFB/XFB Copies specified by GameINI
  • MMU Emulation Enabled
  • Per-game speedhacks disabled
  • All other settings default unless specified otherwise


Base Analysis

Despite the small sample size, some trends start to show up. In general GameCube games did run at a higher framerate than Wii games in this test. But that alone doesn't tell the whole story! Many of the more demanding GameCube games have higher requirements than most of the Wii games. Star Wars: Rogue Squadron III, a GameCube game, had the lowest performance out of all of the games. Other interesting notes is that Metroid Prime 1 (GC) is more demanding than Super Smash Bros. Brawl and Super Mario Galaxy within the confines of this test. In order to understand why, deeper analysis into the games is required.

Metal Gear Solid: The Twin Snakes

This game uses a feature known as Line-Width to render the minimap which is fairly slow to emulate thanks to requiring a geometry shader. Certain areas also use Line-Width's sister feature, Point-Size in order to render dots of snow during some of the outdoor segments. It should be noted that modern GPUs do have an implementation of Line-Width/Point-Size, but, differences in how they work cause the effects to render incorrectly. In order to get an accurate reproduction of what the GameCube/Wii rendered, Dolphin must use a geometry shader. Depending on the GPU, geometry shaders can become a bottleneck, so it's worth noting when a game uses these two features.

The game also optionally requires Store XFB Copies to Texture and RAM in order to render the codec screen correctly and does use ARAM as virtual memory. While Store XFB Copies to Texture and RAM greatly limits the max potential framerate, it didn't push the framerate lower in more demanding areas. The scene used in the test did not require Store XFB Copies to Texture and RAM.

Mario Party 2

Nintendo 64 Virtual Console games use a MIPS JIT that runs on PPC. Emulating a JIT within a JIT is not the most efficient task, but sometimes that's the task you're given. With N64 Virtual Console games not being much of a challenge graphically, all of the challenge comes from trying to JIT their JIT fast enough. Thanks to many optimizations to previously expensive instructions, games like Mario Party 2 can be emulated without too much difficulty.

This wasn't always the case - even Super Mario 64 was a struggle back in the days of 3.5 and many other Virtual Console games didn't run correctly, were missing audio, or even required ridiculously precise depth emulation. It's easy to say that these games are trivial to emulate because emulation is nearly perfect now, but getting here was a rather bumpy ride.

Metroid Prime and Metroid Prime Trilogy - Metroid Prime

NOTE: Metroid Prime 1 on GC has a speedhack available to assist Dolphin with the detection of the idle loop. For the purposes of this test, it was not used.

Metroid Prime Trilogy's version of Metroid Prime requires Store EFB Copies to Texture and RAM for the scanner (much like the other Prime games) but the original GameCube version does not. Metroid Prime 2 does require Store EFB Copies to Texture and RAM for the scanner and runs a bit slower because of that, but not as slow as the Wii version.

Metroid Prime Trilogy
Metroid Prime Trilogy is among the most demanding games in Dolphin.

Another factor that slows down the Metroid Prime games is the insistance to draw their 3D map screens primarily using the Line-Width feature. Some of their more complicated maps can push this feature so hard that it can bottleneck Dolphin by itself! The test for the article was in one of the areas not affected by the usage of Line-Width. On a final note, Metroid Prime Trilogy is one of the games where performance just seems to be relatively static regardless of what how much is going on. Very little fluctuation with a ton of action, doesn't speed up when nothing is going on. This is partially due to Dolphin not detecting the idle-loop and a low performance ceiling thanks to expensive features like Store EFB Copies to Texture and RAM being required.

Super Smash Bros. Melee and Brawl

Both of these games use fairly generic settings and don't do anything that's all that slow in Dolphin. Brawl pushes more vertices than Melee and has more sophisticated physics, making it a little bit slower to emulate overall. There are some menus in Brawl that use EFB access for something but that was easily solved with the EFB peek cache.

While Melee is lightweight overall, there is a particular level that is slower than the rest by a non-trivial margin - Fountain of Dreams.

Super Smash Bros. Melee
Fountain of Dreams ends up Melee's most demanding stage due to the starry backdrop.

While a lot of people have attributed the slowdown to the reflection in the water, the true culprit is actually the backdrop and fountain sprays! To render the many stars and the spraying water, Melee uses Point-Size. Just like with Metal Gear Solid: The Twin Snakes, Dolphin needs to use a geometry shader in order to emulate the effect properly, resulting in the stage being more demanding. Running the performance test on this stage would have made Super Smash Bros. Melee slower than Super Smash Bros. Brawl, which would have been misleading.

Super Mario Galaxy and Super Mario Sunshine

Turning radius tightens when Mario can't be seen!

As mentioned above, Super Mario Galaxy needs EFB Access from CPU for seeing what the Wii pointer is pointing at on screen. Without it enabled, star bits fly toward the lower left corner and pull stars cease to function whatsoever. Super Mario Galaxy also pushes a lot of polygons and can easily bottleneck weaker computers with the number of draw calls it has in the main hub areas. Surprisingly enough, many of the actual levels are easier to emulate than the Comet Observatory.

Super Mario Sunshine is a bit less demanding thanks to being a 30 FPS game, likely due to the limitations of the GameCube. Despite needing Store EFB Copies to Texture and RAM in order to clear the goop and EFB Access from CPU to tell if Mario is behind objects, Super Mario Sunshine ended up less demanding in the performance test.

Spider-Man 2

As a full MMU title, Dolphin has a spotty record of being able to run it well in the past. It runs as well as it ever has in Dolphin nowadays, minus some lost performance thanks to Dynamic BATs. It also requires Store EFB Copies to Texture and RAM for various reflections to render correctly. Despite the MMU pedigree, Spider-Man 2 is nowhere near as slow as the two Rogue Squadron games on the GameCube and easily stays above full-speed on the i7-6700K.

Spiderman 2
Spider-Man 2's webswinging was its main draw and remains robust.

The Legend of Zelda: Twilight Princess (GC and Wii)

The Legend of Zelda: Twilight Princess represents four bars on the graph above because it manages to highlight several issues all at once. It shows how a particular bottleneck can even the odds between a GameCube and Wii game and shows a performance difference when that bottleneck is no longer the primary factor! Unlike OpenGL, Vulkan is able to power through the Minimap's many draw calls. Both the GameCube version and the Wii version are bottlenecked by the minimap in OpenGL and run about the same speed. But when you're no longer bottlenecked on draw calls, it becomes clear that the Wii version is more demanding than the GameCube version. This is due to heavier use of EFB Access from CPU.

Star Wars: Rogue Squadron III

Last but not least, Rogue Squadron III stands alone as the slowest game tested by far. The actual reason for why it's slow is very predictable:

...What else did you expect? There isn't a single reason why this game is slow - the reason it's difficult to emulate is everything it does.

To start off with the basics, Star Wars: Rogue Squadron III is a very high polygon GameCube game, hitting over 120,000 polygons in land levels like Hoth! This is a tremendously high count for a 60 FPS GameCube game and can stand tall against many Wii games. For comparison, in the hub area of Super Mario Galaxy, the game will push out between 160,000 and 180,000 polygons depending on where characters are moving. Rogue Squadron III also pushed the GameCube CPU as hard as it would go, sometimes even lagging on console during intense battle scenes. Increasing Dolphin's emulated CPU clock rate can smooth out this lag - assuming you have a PC capable of running this game at full speed.

Transition with Store XFB Copies to Texture Only

Rogue Squadron III also uses features that are slow for Dolphin to emulate on top of just being demanding in general. Unlike Rogue Squadron II, Store XFB Copies to Texture and RAM is required for menu transitions as they take a capture of what was on screen and fade it out rather than actually fading out what's on screen. Store EFB Copies to Texture and RAM is needed in several stages where the game pulls out some fairly fancy framebuffer effects. Those two alone would be bad enough, but Rogue Squadron III also need EFB Access from CPU for various occlusion effects, even though Dolphin can't render them correctly anyway. The game has self-shadowing models and a lot of dynamic lighting effects that cause tons of shader generation when not using Ubershaders. Other notable features to mention is that Rogue Squadron III has very high resolution textures for the GameCube era and fake HDR bloom.

But we're not done, yet! The game also uses obscure features like zFreeze for its skyboxes and Line-Width for rendering wireframes of weapons and ships. While it's not particularly demanding about either of these features, it's more that poor Dolphin has to emulate on top of everything else.

They used every feature...

Star Wars: Rogue Squadron III is, of course, a full MMU game much like its launch title older brother. While Rogue Squadron II started booting in 2010 alongside full MMU emulation, Rogue Squadron III was much more stubborn about it, requiring hands on debugging to uncover its new trick of storing/reading data across pages.

If that wasn't bad enough, as of the writing of this article, Rogue Squadron III doesn't actually run in Dolphin. Instead, two separate test builds were used to gather frame-rate data, the most recent of which being Stenzek's pull request that should allow the game to run in single core again until a more drastic rewrite can be done to Dolphin's FIFO emulation. It should also be noted that while Rogue Squadron III had the lowest frame-rate reading in the test, there's really more to it than that. The amount of code it generated forced many JIT cache flushes (which result in a nifty stutter,) and there were particular areas in the game where Dolphin would drop to single digit framerates and the game's lighting effects would begin flickering. With all due respect to Star Wars: The Clone Wars and its Dynamic BATs, Rogue Squadron III is the ultimate game to emulate on the GameCube.

Optimizing Dolphin For These Cases

Just because a game is slow now doesn't mean it will always be slow in Dolphin (except for Rogue Squadron III). Throughout Dolphin's past, various features have been merged targeting individual bottlenecks, doubling or tripling the max performance of a game overnight. One recent example of this is Silent Hill: Shattered Memories which bottlenecked on PE Tokens which were previously sent one at a time. By simply batching them together, the game went from being one of the slowest in Dolphin to merely very demanding.

Silent Hill Chart
Silent Hill is a lot faster these days.

Is there hope for these cases to have a similar turn around?

Draw Call Bottleneck

The draw-call bottleneck is something that's been looked at in the past and a plan has even been drawn up. By having a second level FIFO, Dolphin could better analyze the most efficient way to render a game by tracking the state and batching tasks for the GPU in a more efficient manner. The major problem with this is that it would require a sizable rewrite to Dolphin's GPU emulation core.

CPU and MMU Bottlenecked Titles

While not the death knell it once was, the Enable MMU titles are still slow to emulate. All hope isn't lost, as some of the remaining performance hit could be mitigated by inlining the page table cache. The other potential way to make these games faster is for the JIT to be faster in general, as the CPU thread is where they're getting stopped up. While Dolphin's JIT pushes its current design pretty hard, it's relatively simple compared to many modern efforts. With a full redesign, it's possible that Dolphin's JIT could be made far more efficient for some of the games that currently trip it up.

Store EFB/XFB Copies to Texture and RAM Titles

A lot of effort has gone into trying to find ways to make storing EFB/XFB copes to RAM faster over the years, but it's been rough going to solve the problem. It was previously thought that locking could potentially lessen the impact of these features, but various limitations have currently grounded the experiment. The concept is that Dolphin can lock the EFB/XFB copy regions in memory and turn off Storing to RAM unless the game actually tries to access those addresses! That way, Dolphin could enable the feature only when needed, and if they were on for a fraction of the time, the performance hit would be smaller in most games.

The problem for this one is twofold. Dolphin does not get the granularity it wants when locking memory, and those calls to the kernel to actually lock the memory end up slower than waiting for the GPU in previous attempts. There is the potential that locking could be optimized to be a net gain in the future, but for now efforts have been put on pause.

In Conclusion

Overall, whether a game was released on the GameCube or Wii doesn't matter all that much to how demanding it is to run on Dolphin. It's how the game makes use of what resources are available that determines how difficult a game is to run. Whether it's a Wii game using that extra clockrate for tough to emulate instructions or a GameCube game making use of ARAM as virtual memory, it all comes down to what the game itself was doing. Dolphin may have lucked out a bit in how the Wii was designed because most games don't rely on features that make games more strenuous to emulate.

There is one more question to pose: what if a Wii game pushed Dolphin as hard as Star Wars: Rogue Squadron III? That'd be a nightmare! And Factor 5 planned to do just that with Star Wars: Rogue Leaders for the Nintendo Wii. This port of all three Rogue Squadron games to Wii with additional features looked like it would be the ultimate test for Dolphin!

Star Wars: Rogue Leaders on Wii

Unfortunately, it never got released. On a personal note, we'd like to think if it were released and could run on Dolphin, it'd be incredibly slow and missing a bunch of effects.

Następny wpis

Poprzedni wpis

Podobne wpisy