Announcement

**milkylainen** · 02 October 2020, 02:31 PM

Meanwhile @ Intel HQ.
-"This Leenux guy. The one complaining about AVX being shit...?"
"Yes sir?"
-"Lets piss him off!"
-"Do something equally stupid and call it ... AMX! Yes! AMX! That'll teach him!"

**numacross** · 03 October 2020, 03:18 AM

Originally posted by milkylainen View Post

Meanwhile @ Intel HQ.
-"This Leenux guy. The one complaining about AVX being shit...?"
"Yes sir?"
-"Lets piss him off!"
-"Do something equally stupid and call it ... AMX! Yes! AMX! That'll teach him!"

It'll be hard to beat AVX-512 though

**carewolf** · 03 October 2020, 05:14 AM

Wow, an entire new instruction set and registers, just to do fast dot product multiplications... How.. CISC.

**angrypie** · 03 October 2020, 11:14 AM

Intel will never do Cray vectors because that would thwart their ability to segment through ISA extensions.

By the way expect this (and AVX-512) to become an "industry standard" when Zen 3 puts the final nail in their coffin.

**Alex/AT** · 04 October 2020, 12:23 PM

Seems like another yet-to-almost-never-be-used command set, like AVX512.
Kernel would probably have to save another set of registers each context switch?

**coder** · 05 October 2020, 02:56 AM

Originally posted by carewolf View Post

Wow, an entire new instruction set and registers, just to do fast dot product multiplications... How.. CISC.

If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

You can read more about it, here:

My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.

**coder** · 05 October 2020, 03:03 AM

Originally posted by Alex/AT View Post

Seems like another yet-to-almost-never-be-used command set, like AVX512.

Not to cast myself as an AMX proponent, but it's more constructive to think of it like crypto-acceleration extensions. In both cases, they're very specialized instructions that need only be supported by a few key libraries, in order to reap the benefits. AVX is far more general than at least AMX's initial incarnation.

Originally posted by Alex/AT View Post

Kernel would probably have to save another set of registers each context switch?

Yes, on CPUs with the feature enabled, the context would bloat by over 8 k (8 registers * 1024 bytes + configuration). However, I think the TILERELEASE might allow saving/restoring of the AMX state to be skipped? It'd be nice if you only had to pay that penalty for threads currently using AMX, which might be the case.

From what I can see, this really could've been a separate functional unit. Its registers are inaccessible by the CPU's other instructions, so it doesn't much benefit from sharing the CPU's execution pipeline. I think they'd have probably done better to extend their GPU with this functionality and added an iGPU block to some of their server CPUs.

**carewolf** · 05 October 2020, 03:35 AM

Originally posted by coder View Post

If you think it's about computing dot-products, you're missing the point. This is really about optimizing data-movement, and quite obviously for deep learning, as the initial data types (int8 and BFloat16) aren't good for much else (e.g. HPC) that will run on Sapphire Rapids CPUs.

You can read more about it, here:

My hunch is that they do some further optimizations to reuse register contents when loading a tile from an overlapping position, as one does in convolutions.

Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products. You don't do data movement faster by moving it through tiles, it is already limited by memory speed. The only data "movement" that benefits from being done in tiles are rotations, and it doesn't do those.

**coder** · 05 October 2020, 05:18 AM

Originally posted by carewolf View Post

Note however, that it only has two operations. TDPBF16PS and TDPB[XX]D, both dot products.

I didn't say it didn't do dot products, just that the raw computation isn't the key point of it.

Originally posted by carewolf View Post

You don't do data movement faster by moving it through tiles,

Unless what you actually need is a tile arrangement!

Originally posted by carewolf View Post

it is already limited by memory speed.

And why do you think that's not one of the problems they're trying to solve? One thing you're missing is the optimizations they can do behind the scenes. The hardware can track which memory region was loaded into a tile, and they can potentially shift the tile in-place, if you just offset it by one, so they only need to load the leading row or column.

It seems like you don't understand the specific problem they're trying to solve. Without that, I don't see how we can hope to have a meaningful discussion about their solution.

Look at the types of computations it accelerates and boil those down to AVX-512 operations and you'll see my point. AVX2/AVX-512 sucks for small, non-separable 2D convolutions. And it's all the data-movement overhead that really hurts. However, data-movement is cheap to do, in hardware.

Although I take issue with these being implemented as CPU instructions, I get what Intel is trying to do, here. It's somewhat analogous to what Nvidia did with the tensor "cores", in their GPUs, with similar potential benefits.

Announcement

Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

Intel Prepares Linux Kernel Support For Advanced Matrix Extensions (AMX)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment