IOMMUFD Submitted For Linux 6.2 To Overhaul IOMMU Handling

Written by Michael Larabel in Virtualization on 12 December 2022 at 03:20 PM EST. 14 Comments
VIRTUALIZATION
After being in various forms of discussion since 2017, IOMMUFD has been submitted for the Linux 6.2 kernel as it lays the groundwork for aiming to overhaul IOMMU handling by QEMU and virtual machines on Linux.

NVIDIA's Jason Gunthorpe sent out the IOMMUFD pull request today for the Linux 6.2 merge window. Here's how he explains it:
This is the first PR for a new char device called 'iommufd'. This has been in discussions for about two years now, while attempts to implement the needed features have been tried since as far back as 2017. In January a group of us sat down and actually made it.

It may seem strange that a lowlevel, seemingly internal, function like the iommu would have a user API. However, for performance, it has become quite popular to setup virtual machines so that HW (eg networking/storage devices) can have direct DMA into the virtual machine memory itself. To achieve this requires special programming of the IOMMU HW.

Further, we have advanced PCI features like Process Address Space ID (PASID) and Page Request Interface (PRI) that rely on the IOMMU HW to implement them. In particular PASID & PRI are used to create something called Shared Virtual Addressing (SVA or SVM) where DMA from a device can be directly delivered to a process virtual memory address by having the IOMMU HW directly walk the CPU's page table for the process, and trigger faults for DMA to non-present pages.

Naturally people would like to have virtualized versions of all of this. A vIOMMU driver that can implement vPASID and vPRI to achieve vSVA within a VM.

Thus, we get to iommufd. It is a uAPI to allow something like qemu to have some direct control over the IOMMU to implement, in userspace, a vIOMMU device emulation that can provide all these services to the IOMMU driver running in the VM. Like KVM this is done in a general way where we delegate functionality to userspace and if the userspace is something like qemu then it will compose that functionality into an emulated vIOMMU device. iommufd itself doesn't interact with virtualization or KVM.

This PR is the starting point, it just gets all the infrastructure setup so that it is as good as VFIO is right now. We see a broad need for extended features, some being highly IOMMU device specific:

- Binding iommu_domain's to PASID/SSID
- Userspace IO page tables, for ARM, x86 and S390
- Kernel bypassed invalidation of user page tables
- Re-use of the KVM page table in the IOMMU
- Dirty page tracking in the IOMMU
- Runtime Increase/Decrease of IOPTE size
- PRI support with faults resolved in userspace

More details and the current status on IOMMUFD can be found via the pull request.
IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing IO page tables from userspace using file descriptors. It intends to be general and consumable by any driver that wants to expose DMA to userspace. These drivers are eventually expected to deprecate any internal IOMMU logic they may already/historically implement (e.g. vfio_iommu_type1.c).

At minimum iommufd provides universal support of managing I/O address spaces and I/O page tables for all IOMMUs, with room in the design to add non-generic features to cater to specific hardware functionality.

In this context the capital letter (IOMMUFD) refers to the subsystem while the small letter (iommufd) refers to the file descriptors created via /dev/iommu for use by userspace.

There is also the kernel documentation that also covers more details on IOMMUFD.


Barring any objections from Linus Torvalds, the initial IOMMUFD code should make it into Linux 6.2 while there is further feature work already being tackled for future kernel cycles.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week