However the main reason for developing the upgradeable rwlocks was not just to
create more critical sections that other CPUs can have read access. Ultimately
I had a pipe dream that it could be used to create multiple runqueues as you
have done in your patch. However, what I didn't want to do was to create a
multi runqueue design that then needed a load balancer as that took away one
of the advantages of BFS needing no balancer and keeping latency as low as
possible.
I've not ever put a post up about what my solution was to this problem because
the logistics of actually creating it, and the work required kept putting me
off since it would require many hours, and I really hate to push vapourware.
Code speaks louder than rhetoric. However since you are headed down creating
multi runqueue code, perhaps you might want to consider it.
What I had in mind was to create varying numbers of runqueues in a
hierarchical fashion. Whenever possible, the global runqueue could be grabbed
in order to find the best possible task to schedule on that CPU from the entire
pool. If there was contention however on the global runqueue, it could step
down in the hierarchy and just grab a runqueue effective for a numa node and
schedule the best task from that. If there was contention on that it could
step down and schedule the best task from a physical package, and then shared
cache, then shared threads, and if all that failed only would it just grab a
local CPU runqueue. The reason for doing this is it would create a load
balancer by sheer virtue of the locking mechanism itself rather than there
actually being a load balancer at all, thereby benefiting from the BFS approach
in terms of minimising latency, finding the best global task, not requiring a
load balancer, and at the same time benefit from having multiple runqueues to
avoid lock contention - and in fact use that lock contention as a means to an
endpoint.
Alas to implement it myself I'd have to be employed full time for months
working on just this to get it working...