As mentioned in the posts by the developer, traditionally -Os has been the fastest option because most of they have a ton of cold code that hardly ever gets executed, and -Os allows for better caching behavior.

It sounds like they've narrowed the issue down to the way gcc is selecting what code to inline when given the -Os flag. Apparently it's not inlining some code even when doing so would result in smaller output.