Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 34

Thread: Improving The Linux Kernel's Memory Performance

  1. #21
    Join Date
    Feb 2011
    Location
    France
    Posts
    214

    Default

    Quote Originally Posted by Shining Arcanine View Post
    SSE4A is AMD's variant of Intel's SSE4 extensions.
    Not at all.
    http://en.wikipedia.org/wiki/SSE4

  2. #22

    Default

    Quote Originally Posted by RealNC View Post
    No, that's not what -mtune is doing. It does not generate any instructions that would not run on other CPUs. It just applies changes that will work everywhere, but are known to result in faster execution. From the docs:

    "-mtune=cpu-type - Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions."

    So no multiple code paths or anything. That's what the Intel compiler does. GCC doesn't provide that functionality.
    You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.

    Quote Originally Posted by whitecat View Post
    You say that, but your reference agrees with me:

    AMD currently supports only 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and alternatively AMD processors aren't supporting Intel's SSE4.1. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
    The general story is that some of the instructions Intel was implementing were hard to implement in the K8 architecture, so they picked the easy ones and added a few. It is AMD's version of SSE4.

  3. #23
    Join Date
    Feb 2011
    Location
    France
    Posts
    214

    Default

    Quote Originally Posted by Shining Arcanine View Post
    You are right, but my initial point about him having to recompile his kernel is correct. Since GCC is not generating multiple code paths, getting a kernel that uses SSE3 requires compiling it for SSE3.
    A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
    You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.


    Quote Originally Posted by Shining Arcanine View Post
    You say that, but your reference agrees with me:
    SSE4 is:
    Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
    Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
    AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).


    Quote Originally Posted by Shining Arcanine View Post
    It is AMD's version of SSE4.
    Very light version hence.
    AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
    Intel: 54 instructions

  4. #24

    Default

    Quote Originally Posted by whitecat View Post
    A program that uses SSE3 in order to manipulate data and a program that is compiled with SSE3 is 2 different things.
    You don't have to compile the kernel with "sse3" in order to enable (ie. "use") SSE3 for your program/kernel. I'm not specialist but given the purpose of SSE3 I think that if you compile the kernel with SSE3, there is quite few lines of codes which will be efficiently SSE3 instructions. Also, if devs use SSE3 optimizations to speedup some functions, you don't have to compile your kernel with SSE3.
    I am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.

    It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.

    Quote Originally Posted by whitecat View Post

    SSE4 is:
    Intel -> SSE4.1 : 47 instructions implemented in Intel CPU.
    Intel -> SSE4.2 : 7 instructions implemented in Intel CPU.
    AMD -> SSE4a : 4 "intel" instructions (don't know which ones precisely) + 4 exclusive AMD instructions (not found in Intel implementation).



    Very light version hence.
    AMD: 8 instructions (4 "intel" + 4 "amd exclusive")
    Intel: 54 instructions
    Whatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
    Last edited by Shining Arcanine; 08-17-2011 at 02:15 PM.

  5. #25
    Join Date
    Oct 2010
    Posts
    322

    Default

    Quote Originally Posted by Shining Arcanine View Post
    It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
    You are wrong about this, a classic example would be software RAID. Do a grep for raid6 in your dmesg (assuming you have software RAID support built in) and you will get something that looks like this:

    raid6: int64x1 1715 MB/s
    raid6: int64x2 2401 MB/s
    raid6: int64x4 1701 MB/s
    raid6: int64x8 1626 MB/s
    raid6: sse2x1 2894 MB/s
    raid6: sse2x2 4924 MB/s
    raid6: sse2x4 5496 MB/s
    raid6: using algorithm sse2x4 (5496 MB/s)
    md: raid6 personality registered for level 6

    As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.

  6. #26
    Join Date
    Feb 2011
    Location
    France
    Posts
    214

    Default

    Quote Originally Posted by Shining Arcanine View Post
    I am not a Linux kernel developer, but I have enough experience in software development that I have a decent idea of how the kernel works. If the processor family is not set to a SSE3 capable processor at the time it is compiled, then SSE3 should not be used in the kernel. The only exception to this should be if the compiler could generate multiple code paths, which I thought GCC could do, but RealNC demonstrated that I was wrong in thinking that.
    I only said that I can write lines of ASM code (SSE3) in my program, compile it with no option (amd64 - sse2 for instance) and it will works well, with optimized SSE3 paths. This program can be the kernel or anything else. The only thing I have to do is to check at runtime if SSE3 if available, obviously.

    Quote Originally Posted by Shining Arcanine View Post
    Whatever it is called by either of us is pointless as it has no effect on how actual CPUs function. I still call it AMD's version, as I understand it to be a derivative extension.
    Google tells me SSE4a is not a derivative. Or explain me how AMD engineers have implemented 54 Intel instructions in only 8.

  7. #27
    Join Date
    Oct 2008
    Posts
    3,174

    Default

    Quote Originally Posted by Shining Arcanine View Post
    It is possible to write code for less capable x86 processor families that will automatically detect more capable x86 processor families and adjust the path to make things faster, but that is something that developers would rely on a compiler to do. It is difficult to maintain kernel code if you do your compiler's job for it. The reason for this is that the Linux kernel supports more than just x86. It doesn't make sense for the kernel developers to write hacks that second guess a specific processor family unless they do it in a way that benefits all architectures. I would be surprised if Linus Torvalds committed code that did this while only being relevant to a single architecture.
    The kernel has to include a decent amount of architecture specific assembly code. The C/C++ language has no way to interact with a lot of the instructions required to simply boot an OS and manage the CPU. Because there's a lot of this code already present, kernel developers tend to be OK with adding more assembly code for specific hotspots in the kernel or drivers - at least if it can be shown that such code makes a significant difference. Typically you just have to make sure there is a generic C path, then you can add optimizations to specific architectures without having to worry about keeping it portable.

    In fact, I believe the kernel's memcpy function is already written directly in assembly, likely just to ensure that everything was being done in the best way possible for such an important function. While using a compiler is generally better for large chunks of code, humans can often still beat it at optimizing small pieces and that can be important. So simply adding a new SSE3 case to the existing x86 assembly probably isn't a big deal.
    Last edited by smitty3268; 08-17-2011 at 04:02 PM.

  8. #28
    Join Date
    Oct 2008
    Posts
    3,174

    Default

    Quote Originally Posted by whitecat View Post
    Google tells me SSE4a is not a derivative. Or explain me how AMD engineers have implemented 54 Intel instructions in only 8.
    SSE4a is not a SSE4 derivative at all. That's like calling 3DNow an SSE derivative - the only difference is that in the former case AMD took the SSE name rather than creating their own. Something Intel wasn't very happy about, IIRC.

  9. #29

    Default

    Quote Originally Posted by smitty3268 View Post
    SSE4a is not a SSE4 derivative at all. That's like calling 3DNow an SSE derivative - the only difference is that in the former case AMD took the SSE name rather than creating their own. Something Intel wasn't very happy about, IIRC.
    How is SSE4a not a SSE4 derivative if half of its instructions match SSE4 instructions in opcode, name and functionality?

    SSE was made after 3DNow, while SSE4a was made after Intel published its SSE4 extensions. The instructions provided by SSE and 3DNow do not intersect.

    I feel like these points on SSE4a not being a SSE4 derivative are derived from the following rather than any actual technical reason:

    http://arstechnica.com/science/news/...self-image.ars
    Last edited by Shining Arcanine; 08-18-2011 at 09:22 AM.

  10. #30

    Default

    Quote Originally Posted by Ansla View Post
    You are wrong about this, a classic example would be software RAID. Do a grep for raid6 in your dmesg (assuming you have software RAID support built in) and you will get something that looks like this:

    raid6: int64x1 1715 MB/s
    raid6: int64x2 2401 MB/s
    raid6: int64x4 1701 MB/s
    raid6: int64x8 1626 MB/s
    raid6: sse2x1 2894 MB/s
    raid6: sse2x2 4924 MB/s
    raid6: sse2x4 5496 MB/s
    raid6: using algorithm sse2x4 (5496 MB/s)
    md: raid6 personality registered for level 6

    As you can see the kernel developers did implement code that works only on x86 machines and they also wrote a mini-benchmark for figuring out which version is best for the CPU it is being ran on. If the sse3 version of memcpy proves to be fast enough to justify the effort they will do the same.
    Thanks for that. I didn't realize that they had something like this in the code. Still, there are two things I said here. The first was that it was unlikely that the kernel developers would second guess the processor architecture. The second was that if the kernel developers implemented something like this, it would be done in a way that benefits all architectures. I was wrong about the first, but I think I am quite right about the second. Those messages suggest to me that they modularized the code that does this, so on other architectures, code for supporting similar extensions can be put in its place.

    My original point about having to do recompilation is not wrong though. The compiler won't generate its own SSE3 assembly unless it is told to do it by the build system, so strictly speaking, he would need to recompile his kernel to get SSE3 instructions into areas where the kernel developers did not do this manually.
    Last edited by Shining Arcanine; 08-18-2011 at 09:17 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •