Announcement

Collapse
No announcement yet.

LPython Is The Latest Python Implementation Aiming To Be Very Fast, Multiple Backends

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • LPython Is The Latest Python Implementation Aiming To Be Very Fast, Multiple Backends

    Phoronix: LPython Is The Latest Python Implementation Aiming To Be Very Fast, Multiple Backends

    LPython is the latest open-source Python implementation aiming to be a very performant version of Python among other interesting features...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    I wonder if "dropping in" LPython or Mojo as a replacement for the python interpreter would ever be practical? Maybe even the default for performance focused distros like CachyOS and Clear?

    I ran Pyston Lite in CachyOS before the 3.11 migration, and the only thing that broke was Pytorch's triton compilation (which is quite understandable).


    ​​​​I am hoping for something similar in Java land, with GraalVM (or Azul's Falcon JIT if it ever goes open source) replacing the default JDK.
    Last edited by brucethemoose; 29 July 2023, 01:29 PM.

    Comment


    • #3
      Well...
      You do not have permission to view this gallery.
      This gallery has 1 photos.

      Comment


      • #4
        Originally posted by brucethemoose View Post
        I wonder if "dropping in" LPython or Mojo as a replacement for the python interpreter would ever be practical? .
        TL;DR: No, you cannot use it as a dropping in interpreter also it is a fast implementation from their own benchmarks.

        It is quite interesting that they showed their AoT Python Complier generated faster code than clang++/g++ on a Dijkstra implementation. But to archive this you are writing vendor specific python because you have to do something like
        Code:
        from lpython import i32
        to explicitly set the data type. So the performance benefit does require modification of code and your python code has to be type-annotated. This could never be a drop-in replacement for traditional python interpreters.

        A question: How many python implementation do we have nowadays? Has anyone ever counted it?

        Comment


        • #5
          That sounds similar to Cython.

          Comment


          • #6
            Originally posted by gnattu View Post

            TL;DR: No, you cannot use it as a dropping in interpreter also it is a fast implementation from their own benchmarks.

            It is quite interesting that they showed their AoT Python Complier generated faster code than clang++/g++ on a Dijkstra implementation. But to archive this you are writing vendor specific python because you have to do something like
            Code:
            from lpython import i32
            to explicitly set the data type. So the performance benefit does require modification of code and your python code has to be type-annotated. This could never be a drop-in replacement for traditional python interpreters.

            A question: How many python implementation do we have nowadays? Has anyone ever counted it?
            Never ever found a code snippet that would be faster when done in Python than in C/C++, unless the algorithms (and implementations) used were at different level of sophistication. If anyone has a real one under which it really occurs, I'll, gladly, take a look.

            Comment


            • #7
              Originally posted by acobar View Post

              Never ever found a code snippet that would be faster when done in Python than in C/C++, unless the algorithms (and implementations) used were at different level of sophistication. If anyone has a real one under which it really occurs, I'll, gladly, take a look.
              The following is from their official website:

              The lpython implementation:
              Code:
              from lpython import i32
              
              def dijkstra_shortest_path(n: i32, source: i32) -> i32:
                  i: i32; j: i32; v: i32; u: i32; mindist: i32; alt: i32; dummy: i32; uidx: i32
                  dist_sum: i32;
                  graph: dict[i32, i32] = {}
                  dist: dict[i32, i32] = {}
                  prev: dict[i32, i32] = {}
                  visited: dict[i32, bool] = {}
                  Q: list[i32] = []
              
                  for i in range(n):
                      for j in range(n):
                          graph[n * i + j] = abs(i - j)
              
                  for v in range(n):
                      dist[v] = 2147483647
                      prev[v] = -1
                      Q.append(v)
                      visited[v] = False
                  dist[source] = 0
              
                  while len(Q) > 0:
                      u = -1
                      mindist = 2147483647
                      for i in range(len(Q)):
                          if mindist > dist[Q[i]]:
                              mindist = dist[Q[i]]
                              u = Q[i]
                              uidx = i
                      dummy = Q.pop(uidx)
                      visited[u] = True
              
                      for v in range(n):
                          if v != u and not visited[v]:
                              alt = dist[u] + graph[n * u + v]
              
                              if alt < dist[v]:
                                  dist[v] = alt
                                  prev[v] = u
              
                  dist_sum = 0
                  for i in range(n):
                      dist_sum += dist[i]
                  return dist_sum
              
              
              def test():
                  n: i32 = 4000
                  print(dijkstra_shortest_path(n, 0))
              
              test()​
              And the C++ implementation:

              Code:
              #include <iostream>
              #include <unordered_map>
              #include <vector>
              
              int32_t dijkstra_shortest_path(int32_t n, int32_t source) {
                  int32_t i, j, v, u, mindist, alt, dummy, uidx;
                  std::unordered_map<int32_t, int32_t> graph, dist, prev;
                  std::unordered_map<int32_t, bool> visited;
                  std::vector<int32_t> Q;
              
                  for(i = 0; i < n; i++) {
                      for(j = 0; j < n; j++) {
                          graph[n * i + j] = std::abs(i - j);
                      }
                  }
              
                  for(v = 0; v < n; v++) {
                      dist[v] = 2147483647;
                      prev[v] = -1;
                      Q.push_back(v);
                      visited[v] = false;
                  }
                  dist[source] = 0;
              
                  while(Q.size() > 0) {
                      u = -1;
                      mindist = 2147483647;
                      for(i = 0; i < Q.size(); i++) {
                          if( mindist > dist[Q[i]] ) {
                              mindist = dist[Q[i]];
                              u = Q[i];
                              uidx = i;
                          }
                      }
                      Q.erase(Q.begin() + uidx);
                      visited[u] = true;
              
                      for(v = 0; v < n; v++) {
                          if( v != u and not visited[v] ) {
                              alt = dist[u] + graph[n * u + v];
              
                              if( alt < dist[v] ) {
                                  dist[v] = alt;
                                  prev[v] = u;
                              }
                          }
                      }
                  }
              
                  int32_t dist_sum = 0;
                  for(i = 0; i < n; i++) {
                      dist_sum += dist[i];
                  }
                  return dist_sum;
              }
              
              
              int main() {
                  int32_t n = 4000;
                  int32_t dist_sum = dijkstra_shortest_path(n, 0);
                  std::cout<<dist_sum<<std::endl;
                  return 0;
              }​
              ​
              The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).

              Comment


              • #8
                Originally posted by gnattu View Post

                The following is from their official website:

                The lpython implementation:
                Code:
                from lpython import i32
                
                def dijkstra_shortest_path(n: i32, source: i32) -> i32:
                i: i32; j: i32; v: i32; u: i32; mindist: i32; alt: i32; dummy: i32; uidx: i32
                dist_sum: i32;
                graph: dict[i32, i32] = {}
                dist: dict[i32, i32] = {}
                prev: dict[i32, i32] = {}
                visited: dict[i32, bool] = {}
                Q: list[i32] = []
                
                for i in range(n):
                for j in range(n):
                graph[n * i + j] = abs(i - j)
                
                for v in range(n):
                dist[v] = 2147483647
                prev[v] = -1
                Q.append(v)
                visited[v] = False
                dist[source] = 0
                
                while len(Q) > 0:
                u = -1
                mindist = 2147483647
                for i in range(len(Q)):
                if mindist > dist[Q[i]]:
                mindist = dist[Q[i]]
                u = Q[i]
                uidx = i
                dummy = Q.pop(uidx)
                visited[u] = True
                
                for v in range(n):
                if v != u and not visited[v]:
                alt = dist[u] + graph[n * u + v]
                
                if alt < dist[v]:
                dist[v] = alt
                prev[v] = u
                
                dist_sum = 0
                for i in range(n):
                dist_sum += dist[i]
                return dist_sum
                
                
                def test():
                n: i32 = 4000
                print(dijkstra_shortest_path(n, 0))
                
                test()​
                And the C++ implementation:

                Code:
                #include <iostream>
                #include <unordered_map>
                #include <vector>
                
                int32_t dijkstra_shortest_path(int32_t n, int32_t source) {
                int32_t i, j, v, u, mindist, alt, dummy, uidx;
                std::unordered_map<int32_t, int32_t> graph, dist, prev;
                std::unordered_map<int32_t, bool> visited;
                std::vector<int32_t> Q;
                
                for(i = 0; i < n; i++) {
                for(j = 0; j < n; j++) {
                graph[n * i + j] = std::abs(i - j);
                }
                }
                
                for(v = 0; v < n; v++) {
                dist[v] = 2147483647;
                prev[v] = -1;
                Q.push_back(v);
                visited[v] = false;
                }
                dist[source] = 0;
                
                while(Q.size() > 0) {
                u = -1;
                mindist = 2147483647;
                for(i = 0; i < Q.size(); i++) {
                if( mindist > dist[Q[i]] ) {
                mindist = dist[Q[i]];
                u = Q[i];
                uidx = i;
                }
                }
                Q.erase(Q.begin() + uidx);
                visited[u] = true;
                
                for(v = 0; v < n; v++) {
                if( v != u and not visited[v] ) {
                alt = dist[u] + graph[n * u + v];
                
                if( alt < dist[v] ) {
                dist[v] = alt;
                prev[v] = u;
                }
                }
                }
                }
                
                int32_t dist_sum = 0;
                for(i = 0; i < n; i++) {
                dist_sum += dist[i];
                }
                return dist_sum;
                }
                
                
                int main() {
                int32_t n = 4000;
                int32_t dist_sum = dijkstra_shortest_path(n, 0);
                std::cout<<dist_sum<<std::endl;
                return 0;
                }​
                ​
                The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).
                Will try their implementation and compare the generated code. It will be a nice surprise if their claims end up to be true. I'm kind of skeptical, though, as I saw, many times, these kinds of claims, be debunked.

                Comment


                • #9
                  Nice with a Python implementation that uses a standard open source license, unlike the official Python which uses a vanity license.

                  Comment


                  • #10
                    Originally posted by gnattu View Post

                    The following is from their official website:

                    [ ... ]

                    The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).
                    That is ... very bad C++. Replace

                    Code:
                    std::unordered_map<int32_t, int32_t> graph, dist, prev;
                    with
                    Code:
                    std::unordered_map<int32_t, int32_t> dist, prev;
                    std::vector<int32_t> graph;
                    graph.resize(n*n);
                    for a 3x speedup (intel core i5 m460) . Next to try would be the map of bools, but I don't think that would have much effect. And that is before I even look at the algorithm itself. Let alone start optimizing for cache lines...
                    unordered_map has specific uses, this is not one of them...

                    So, yeah, for a quick & dirty PoC that is actually usably fast, lpython might do fine.

                    edit: another 3x speedup comes simply from making dist & prev vectors of int32_t, and resizing to n. This is C++ as a python programmer would write. Their use of the word "equivalent" on the blog is to be taken literally: not only equivalent algorithms, but equivalent code constructs. Even it they're totally inappropriate in the other language.
                    Last edited by Serafean; 29 July 2023, 03:24 PM.

                    Comment

                    Working...
                    X