We need that perf output for Radeon, it would help find these bottlenecks, as was recently added for Intel.

radeontop can help a bit, but it can't tell what's going on inside the driver, only whether you're cpu or gpu limited, and what units inside the gpu are getting used the most.