But if the code was generated by LLVM, wouldn't it be then possible to recover some info from decompilation?
No. At most, it could retain a more accurate representation of the control flow (as applying the inverse process of how LLVM treats them should suffice, reducing the degree of heuristics needed), but local variables' names are lost when compiling independently of the compiler used. Only debug build retain those.
Decompilation serves mostly to have something to analyze. Building will likely not work without some analysis, as it probably couldn't guess included headers, and for bugfixing you need to know what you are looking for and to understand the code.
I should also note that last I checked, in the EU you're legally allowed to RE for interoperability purposes using any means necessary. So if my memory is correct, clean room is not needed here.
Indeed. But other jurisdiction might require clean room.
so the usual workflow is:
- take developers in a RE-friendly country (Russia is an example)
- RE the fuck of your target using any possible mean (decompiler such a libbeauty, for exemple)
- use this code to analyse the workflow
- use pieces of the code to compile small proof-of-concepts test (see the few opensource skype project in russia)
- document how these tests are working.
- take a second team of developer.
- have the developer read the documentation produced by the precedent team (but do not read the actual decompiled code to avoid tainting)
- have the developer try to code their own re-implementation from scratch of the same functionality.
(add in some exchange between the two team to make clear points where documentation is lacking, isn't clear or is ambigous)
- now you have your own new implementation, that you can release worldwide as GPL or BSD licensed code.
(Note: beware of patents. Eventually try implementing the same functionality using a different approach: It helps if patented algorithm is a special case of a more generic approach which wasn't patented. e.g.: the patented arithmetic coding, is a special type of the unpatened range coding, where the range is define as 0:1 using real numbers)
Speaking of decompilation, this brings fonds memory of oldschool assembly decompilation of DOS-era, which not only tried to put assembler mneumonic to machine code (like any debugger does) but also tried to track memory location, but even tracked the exact API used (INT calls, like INT 21h, INT 10h, INT 13h, and such) the ports used (and recognised quite a bunch of hardware components) and tried to put useful comments and meaningful variable name. I managed to learn outputting WAV sample to the PC Speaker simply by analyzing such RE (the comments where that much useful)
Perhaps, with some API tracking, libbeauty could manage some of the same.
(recognise some variable depending on the API where they are used. e.g.: "char *format" instead of "char *str_1398" if that string pointer is used as a format in subsequent fprintf calls).