Open-Source Linux Driver Published For Habana Labs' "Goya" AI Processor

Written by Michael Larabel in Hardware on 22 January 2019 at 07:36 PM EST. 9 Comments

Habana Labs is one of the companies working on an "AI" processor for speeding up deep learning inference and training workloads. Their initial product is the Goya processor that is already production-qualified. Today they published initial open-source Linux kernel driver patches for review to potentially include in the mainline kernel moving forward.

The Habana Labs start-up has published quite compelling AI benchmarks that for popular inference workloads puts its Goya performance ahead of the likes of the NVIDIA Tesla T4, Intel Cascade Lake, Xilinx Alveo, and other competing platforms. They claim this AI processor can achieve 15,000 images per second on ResNet-50. The Goya HL1000 is primarily catered to inference workloads while for training they will also be releasing the Gaudi HL-2000, which is expected to begin sampling next quarter.

The AI processor consists of multiple fully-programmable Tensor Processing Cores, five separate DMA channels, a PCIe 4.0 x16 interface, and up to 16GB of DDR4 memory. Those wanting to learn more about the Goya hardware can visit Habana.ai.

Coming as a surprise this evening is the initial Habana Labs kernel driver for Linux and initial support for the Goya processor while they intend to extend it for Gaudi in the months ahead once that is available to customers. A set of 15 patches making up over 99,600 lines of code (a lot of it being header files) is what was published today.

The patch message does explain how work is offloaded to this AI processor:

The driver currently exposes a total of five IOCTLs. One IOCTL allows the application to submit workloads to the device, and another to wait on completion of submitted workloads. The other three IOCTLs are used for memory management, command buffer creation and information/status retrieval.

In addition, the driver exposes several sensors through the hwmon subsystem and provides various system-level information in sysfs for system administrators.

The first step for an application process is to open the correct hlX device it wants to work with. Calls to open create a new "context" for that application in the driver's internal structures and a unique ASID is assigned to that context. The context object lives until the process releases the file descriptor AND its command submissions have finished executing on the device.

Next step is for the application to request information about the device, such as amount of DDR4 memory. The application then can go on to create command buffers for its command submissions and allocate and map device or host memory (host memory can only be mapped) to the internal device's MMU subsystem.

At this point the application can load various deep learning topologies to the device DDR memory. After that, it can start to submit inference workloads using those topologies. For each workload, the the application receives a sequence number that represents the workload. The application can then query the driver regarding the status of the workload using that sequence number.

It will be a big undertaking getting the code reviewed and merged to the mainline kernel; it's great Habana Labs has already open-sourced this Linux driver code. This Linux kernel driver effort is actually being led by Oded Gabbay, the former Red Hat developer and maintainer of the AMDKFD kernel compute driver. So given his experience and work with the upstream kernel community, this "habanalabs" driver actually stands good chances of eventually seeing the mainline kernel.

9 Comments