Announcement

**starshipeleven** · 03 May 2017, 02:04 PM

Yeah, I really needed this in my HTPC. I'm sick of the OOM killing my programs.

/sarcasm

**waxhead** · 03 May 2017, 02:44 PM

4 PiB ought to be enough for everybody.... Let's see how long it last!

**quikee** · 03 May 2017, 02:50 PM

5-level? would this mean that to resolve physical memory address from a virtual would require 5 look-ups?

**starshipeleven** · 03 May 2017, 04:52 PM

Originally posted by quikee View Post

5-level? would this mean that to resolve physical memory address from a virtual would require 5 look-ups?

there is full docs of the feature here https://software.intel.com/sites/def...hite_paper.pdf

It seems yes, and that it is an evolution of a 4-level paging system already in use (in systems that need that much ram anyway). So it is "only" adding one level on top of 4.

**GI_Jack** · 03 May 2017, 05:07 PM

I think its time for intel to just make CPUs that address all 64 bits of memory directly. By the time this heads to market, there will be a need within 10 years.

**bridgman** · 03 May 2017, 05:15 PM

Originally posted by quikee View Post

5-level? would this mean that to resolve physical memory address from a virtual would require 5 look-ups?

Yes, however as starshipeleven said existing CPUs already typically require 4 lookups. If you have heard the term "Translation Lookaside Buffer" or TLB that is what caches looked-up translations so that most memory accesses do not have to repeat the lookups. Most devices also cache information from the intermediate table accesses so that when TLB misses happen the walker can often avoid having to fetch all of the table levels.

That said, when you start dealing with random accesses across very large address spaces even a CPU or GPU with a lot of TLBs starts to encounter a lot of misses, and performance can drop significantly when that happens. The most common solution to this is larger page sizes (hugepage in Linux, typically 2MB rather than 4KB) which allows each TLB to cover a larger range of virtual addresses.

GPUs also use multi-level page tables and TLBs to cache translations. The MMU in an AMD GPUs has a "fragment" mechanism which allows a single TLB to cover multiple GPUVM pages if the translation is the same for all pages, ie if a range of contiguous pages in virtual address space are backed by contiguous physical memory.

In an AMD APU the ATC block (Address Translation Cache) contains an additional set of TLBs, used for accessing system memory under HSA/ROC. On a TLB miss the ATC logic initiates an ATS (Address Translation Services) PCIE request. The IOMMU block then walks the CPU page tables (this is what allows CPU and GPU to share virtual addresses and PASIDs) and returns the translated address to ATC for use in subsequent memory accesses.

So yes, lots of lookups but maybe 99% of them are avoided via translation caches.

**coder** · 04 May 2017, 12:55 PM

Originally posted by bridgman View Post

On a TLB miss the ATC logic initiates an ATS (Address Translation Services) PCIE request. The IOMMU block then walks the CPU page tables (this is what allows CPU and GPU to share virtual addresses and PASIDs) and returns the translated address to ATC for use in subsequent memory accesses.

That sounds impractically slow, for most purposes. So, even though my CPU and GPU can share VM addresses, I'm probably still restricting shared data structures to a narrow address range (i.e. which probably means a fair bit of copying by the CPU). At that point, the benefits over having to use physical addresses would seem to become rather limited.

So, can the CPU reach back into the GPU's MMU, to invalidate cache entries? Does this slow down any change to the CPU's paging tables, in order to ensure the change is synchronized with the GPU? I imagine the CPU could even model the GPUs cache, in order to know whether this is necessary.

**bridgman** · 04 May 2017, 01:34 PM

Originally posted by coder View Post

That sounds impractically slow, for most purposes. So, even though my CPU and GPU can share VM addresses, I'm probably still restricting shared data structures to a narrow address range (i.e. which probably means a fair bit of copying by the CPU). At that point, the benefits over having to use physical addresses would seem to become rather limited.

Why would it be slow ? If everything was going through a real PCIE bus there would be TLP overhead for the ATS request/response but APUs have a shorter path to the IOMMUv2. The rest of the process (eg reading page tables) is no different from a CPU or from a GPU going through its own MMU (what we call GPUVM). The page tables are in system memory rather than VRAM but reads to most of the levels hit in caches anyways.

Remember that using physical addresses brings its own overhead, since the memory now has to be pinned in order to maintain a consistent PA, which brings overhead for the pinning operation and limits your ability to manage memory efficiently. Accessing via ATS/PRI allows the GPU to reliably access unpinned memory.

Originally posted by coder View Post

So, can the CPU reach back into the GPU's MMU, to invalidate cache entries? Does this slow down any change to the CPU's paging tables, in order to ensure the change is synchronized with the GPU? I imagine the CPU could even model the GPUs cache, in order to know whether this is necessary.

Yes, CPU invalidations also extend to the IOMMUv2, then IOMMUv2 sends an invalidation command to the GPU's ATC block. This is all wrapped in MMU notifiers on Linux.

**coder** · 05 May 2017, 02:50 AM

Originally posted by bridgman View Post

Why would it be slow ?

Because one memory transaction now turns into two. It should be worse for writes, which should otherwise be faster than reads, but would now have to block on a synchronous read.

Originally posted by bridgman View Post

APUs have a shorter path to the IOMMUv2.

I'm less concerned about APUs. Partly for the reason you mention, but also because discrete GPUs are obviously where the the horsepower resides.

Originally posted by bridgman View Post

Remember that using physical addresses brings its own overhead, since the memory now has to be pinned in order to maintain a consistent PA, which brings overhead for the pinning operation and limits your ability to manage memory efficiently.

Yeah, but if you're restricting your address range to minimize thrashing the GPUVM, then it doesn't seem much different than using pinned memory.

Originally posted by bridgman View Post

CPU invalidations also extend to the IOMMUv2, then IOMMUv2 sends an invalidation command to the GPU's ATC block. This is all wrapped in MMU notifiers on Linux.

The bad part about this is that it should block until the transaction to notify the GPU has completed. Seems like it could take a while, if the PCIe bus is particularly busy.

IMO, the idea of cache-coherent (for HSA-compliant systems, IIRC) shared memory sounds good, until you realize exactly what it entails. This stuff ain't free, and naive users will shoot themselves quite easily.

Announcement

5-Level Paging Work Heads Into Linux 4.12

5-Level Paging Work Heads Into Linux 4.12

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment