Performance Report

I profiled the game with AMDuProf for 120 seconds, since performance is still an issue, and the results clearly point to data-side TLB pressure.
I am not a game developer, but as far as I know, lib_burst_generated.dll is related more to the simulation itself than to graphics.

City: ~900 year 8,4k Pop.
CPU: amd ryzen 9950x3d
OS: Win 11
Game: v1.1.0P4

raw_data:

Functions,Modules,L1_ITLB_MISSES_L2_HITS,ITLB_Reload_from_Page_Table_walk.Coalesced_4k.walk_1G.walk_2M.walk_4K,L2_DTLB_MISSES,L1_DTLB_MISSES,ALL_TLB_FLUSHES,L2_ITLB_MISSES
PublicKeyToken=null({5}),lib_burst_generated.dll,0,0,43.072.750.000,54.644.250.000,0,0

PID-17148 : farthest frontier.exe,0,0,43.072.750.000,54.644.250.000,0,0
TID-7512 : Thread-7512,0,0,0,1.750.000,0,0
TID-9316 : Thread-9316,0,0,0,157.000.000,0,0
TID-9480 : Thread-9480,0,0,0,3.250.000,0,0
TID-10724 : Thread-10724,0,0,0,30.250.000,0,0
TID-11728 : Thread-11728,0,0,0,29.500.000,0,0
TID-14764 : Thread-14764,0,0,43.072.750.000,54.321.250.000,0,0
TID-15028 : Thread-15028,0,0,0,62.500.000,0,0
TID-15872 : Thread-15872,0,0,0,5.750.000,0,0
TID-16284 : Thread-16284,0,0,0,33.000.000,0,0

  • L2 DTLB miss ratio among L1 DTLB misses: ~79.3%
  • L2 DTLB hit ratio among L1 DTLB misses: ~20.7%

So roughly 80% of L1 DTLB misses also miss the second-level TLB and require a page walk, which in turn places additional overhead on the cache hierarchy for translation-related work.

This is a very high amount of data-side translation churn, especially for a single thread.

A comparatively straightforward mitigation could be to back the hot memory with huge pages, because with this level of DTLB pressure, 4 KiB pages appear to be a significant bottleneck.

It is also notable that this work appears to be handled by just a single thread.