Announcement

**brucethemoose** · 29 July 2023, 01:27 PM

I wonder if "dropping in" LPython or Mojo as a replacement for the python interpreter would ever be practical? Maybe even the default for performance focused distros like CachyOS and Clear?

I ran Pyston Lite in CachyOS before the 3.11 migration, and the only thing that broke was Pytorch's triton compilation (which is quite understandable).

I am hoping for something similar in Java land, with GraalVM (or Azul's Falcon JIT if it ever goes open source) replacing the default JDK.

**raystriker** · 29 July 2023, 01:28 PM

Well...

**gnattu** · 29 July 2023, 01:55 PM

Originally posted by brucethemoose View Post

I wonder if "dropping in" LPython or Mojo as a replacement for the python interpreter would ever be practical? .

TL;DR: No, you cannot use it as a dropping in interpreter also it is a fast implementation from their own benchmarks.

It is quite interesting that they showed their AoT Python Complier generated faster code than clang++/g++ on a Dijkstra implementation. But to archive this you are writing vendor specific python because you have to do something like

Code:

from lpython import i32

to explicitly set the data type. So the performance benefit does require modification of code and your python code has to be type-annotated. This could never be a drop-in replacement for traditional python interpreters.

A question: How many python implementation do we have nowadays? Has anyone ever counted it?

**Lycanthropist** · 29 July 2023, 01:58 PM

That sounds similar to Cython.

**acobar** · 29 July 2023, 02:17 PM

Originally posted by gnattu View Post

TL;DR: No, you cannot use it as a dropping in interpreter also it is a fast implementation from their own benchmarks.

It is quite interesting that they showed their AoT Python Complier generated faster code than clang++/g++ on a Dijkstra implementation. But to archive this you are writing vendor specific python because you have to do something like

Code:

from lpython import i32

to explicitly set the data type. So the performance benefit does require modification of code and your python code has to be type-annotated. This could never be a drop-in replacement for traditional python interpreters.

A question: How many python implementation do we have nowadays? Has anyone ever counted it?

Never ever found a code snippet that would be faster when done in Python than in C/C++, unless the algorithms (and implementations) used were at different level of sophistication. If anyone has a real one under which it really occurs, I'll, gladly, take a look.

**gnattu** · 29 July 2023, 02:31 PM

Originally posted by acobar View Post

Never ever found a code snippet that would be faster when done in Python than in C/C++, unless the algorithms (and implementations) used were at different level of sophistication. If anyone has a real one under which it really occurs, I'll, gladly, take a look.

The following is from their official website:

The lpython implementation:

Code:

from lpython import i32

def dijkstra_shortest_path(n: i32, source: i32) -> i32:
    i: i32; j: i32; v: i32; u: i32; mindist: i32; alt: i32; dummy: i32; uidx: i32
    dist_sum: i32;
    graph: dict[i32, i32] = {}
    dist: dict[i32, i32] = {}
    prev: dict[i32, i32] = {}
    visited: dict[i32, bool] = {}
    Q: list[i32] = []

    for i in range(n):
        for j in range(n):
            graph[n * i + j] = abs(i - j)

    for v in range(n):
        dist[v] = 2147483647
        prev[v] = -1
        Q.append(v)
        visited[v] = False
    dist[source] = 0

    while len(Q) > 0:
        u = -1
        mindist = 2147483647
        for i in range(len(Q)):
            if mindist > dist[Q[i]]:
                mindist = dist[Q[i]]
                u = Q[i]
                uidx = i
        dummy = Q.pop(uidx)
        visited[u] = True

        for v in range(n):
            if v != u and not visited[v]:
                alt = dist[u] + graph[n * u + v]

                if alt < dist[v]:
                    dist[v] = alt
                    prev[v] = u

    dist_sum = 0
    for i in range(n):
        dist_sum += dist[i]
    return dist_sum


def test():
    n: i32 = 4000
    print(dijkstra_shortest_path(n, 0))

test()

And the C++ implementation:

Code:

#include <iostream>
#include <unordered_map>
#include <vector>

int32_t dijkstra_shortest_path(int32_t n, int32_t source) {
    int32_t i, j, v, u, mindist, alt, dummy, uidx;
    std::unordered_map<int32_t, int32_t> graph, dist, prev;
    std::unordered_map<int32_t, bool> visited;
    std::vector<int32_t> Q;

    for(i = 0; i < n; i++) {
        for(j = 0; j < n; j++) {
            graph[n * i + j] = std::abs(i - j);
        }
    }

    for(v = 0; v < n; v++) {
        dist[v] = 2147483647;
        prev[v] = -1;
        Q.push_back(v);
        visited[v] = false;
    }
    dist[source] = 0;

    while(Q.size() > 0) {
        u = -1;
        mindist = 2147483647;
        for(i = 0; i < Q.size(); i++) {
            if( mindist > dist[Q[i]] ) {
                mindist = dist[Q[i]];
                u = Q[i];
                uidx = i;
            }
        }
        Q.erase(Q.begin() + uidx);
        visited[u] = true;

        for(v = 0; v < n; v++) {
            if( v != u and not visited[v] ) {
                alt = dist[u] + graph[n * u + v];

                if( alt < dist[v] ) {
                    dist[v] = alt;
                    prev[v] = u;
                }
            }
        }
    }

    int32_t dist_sum = 0;
    for(i = 0; i < n; i++) {
        dist_sum += dist[i];
    }
    return dist_sum;
}


int main() {
    int32_t n = 4000;
    int32_t dist_sum = dijkstra_shortest_path(n, 0);
    std::cout<<dist_sum<<std::endl;
    return 0;
}

The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).

**acobar** · 29 July 2023, 02:46 PM

Originally posted by gnattu View Post

The following is from their official website:

The lpython implementation:

Code:

from lpython import i32

def dijkstra_shortest_path(n: i32, source: i32) -> i32:
i: i32; j: i32; v: i32; u: i32; mindist: i32; alt: i32; dummy: i32; uidx: i32
dist_sum: i32;
graph: dict[i32, i32] = {}
dist: dict[i32, i32] = {}
prev: dict[i32, i32] = {}
visited: dict[i32, bool] = {}
Q: list[i32] = []

for i in range(n):
for j in range(n):
graph[n * i + j] = abs(i - j)

for v in range(n):
dist[v] = 2147483647
prev[v] = -1
Q.append(v)
visited[v] = False
dist[source] = 0

while len(Q) > 0:
u = -1
mindist = 2147483647
for i in range(len(Q)):
if mindist > dist[Q[i]]:
mindist = dist[Q[i]]
u = Q[i]
uidx = i
dummy = Q.pop(uidx)
visited[u] = True

for v in range(n):
if v != u and not visited[v]:
alt = dist[u] + graph[n * u + v]

if alt < dist[v]:
dist[v] = alt
prev[v] = u

dist_sum = 0
for i in range(n):
dist_sum += dist[i]
return dist_sum


def test():
n: i32 = 4000
print(dijkstra_shortest_path(n, 0))

test()

And the C++ implementation:

Code:

#include <iostream>
#include <unordered_map>
#include <vector>

int32_t dijkstra_shortest_path(int32_t n, int32_t source) {
int32_t i, j, v, u, mindist, alt, dummy, uidx;
std::unordered_map<int32_t, int32_t> graph, dist, prev;
std::unordered_map<int32_t, bool> visited;
std::vector<int32_t> Q;

for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
graph[n * i + j] = std::abs(i - j);
}
}

for(v = 0; v < n; v++) {
dist[v] = 2147483647;
prev[v] = -1;
Q.push_back(v);
visited[v] = false;
}
dist[source] = 0;

while(Q.size() > 0) {
u = -1;
mindist = 2147483647;
for(i = 0; i < Q.size(); i++) {
if( mindist > dist[Q[i]] ) {
mindist = dist[Q[i]];
u = Q[i];
uidx = i;
}
}
Q.erase(Q.begin() + uidx);
visited[u] = true;

for(v = 0; v < n; v++) {
if( v != u and not visited[v] ) {
alt = dist[u] + graph[n * u + v];

if( alt < dist[v] ) {
dist[v] = alt;
prev[v] = u;
}
}
}
}

int32_t dist_sum = 0;
for(i = 0; i < n; i++) {
dist_sum += dist[i];
}
return dist_sum;
}


int main() {
int32_t n = 4000;
int32_t dist_sum = dijkstra_shortest_path(n, 0);
std::cout<<dist_sum<<std::endl;
return 0;
}

The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).

Will try their implementation and compare the generated code. It will be a nice surprise if their claims end up to be true. I'm kind of skeptical, though, as I saw, many times, these kinds of claims, be debunked.

**uid313** · 29 July 2023, 02:52 PM

Nice with a Python implementation that uses a standard open source license, unlike the official Python which uses a vanity license.

**Serafean** · 29 July 2023, 03:07 PM

Originally posted by gnattu View Post

The following is from their official website:

[ ... ]

The Lpython complier used --fast flag and g++/clang++ used -ffast-math -funroll-loops -O3 flag. The Lpython generated code is faster on both x86-64(tested with AMD Ryzen 2500U) and arm64 (tested with M1 Pro).

That is ... very bad C++. Replace

Code:

std::unordered_map<int32_t, int32_t> graph, dist, prev;

with

Code:

std::unordered_map<int32_t, int32_t> dist, prev;
std::vector<int32_t> graph;
graph.resize(n*n);

for a 3x speedup (intel core i5 m460) . Next to try would be the map of bools, but I don't think that would have much effect. And that is before I even look at the algorithm itself. Let alone start optimizing for cache lines...
unordered_map has specific uses, this is not one of them...

So, yeah, for a quick & dirty PoC that is actually usably fast, lpython might do fine.

edit: another 3x speedup comes simply from making dist & prev vectors of int32_t, and resizing to n. This is C++ as a python programmer would write. Their use of the word "equivalent" on the blog is to be taken literally: not only equivalent algorithms, but equivalent code constructs. Even it they're totally inappropriate in the other language.

Announcement

LPython Is The Latest Python Implementation Aiming To Be Very Fast, Multiple Backends

LPython Is The Latest Python Implementation Aiming To Be Very Fast, Multiple Backends

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment