
Originally Posted by
ldesnogu
I think you are wrong: the L/S instructions have their own pipe, and issue can send instructions both to that pipe and to NEON pipes. Did you try it?

Yes, of course. I have learned long ago that nobody can be trusted (both random dudes on the Internet and the people I actually consider to be quite knowledgeable). Documentation can't be also trusted without verification (not no mention that it is often incomplete or vague). It goes without saying that I can't be trusted too 
I have encountered the sad fact of Cortex-A9 being unable to dual issue NEON instructions with any L/S instructions (both ARM and NEON) in practice long ago. The Cortex-A9 NEON Media Processing Engine Technical Reference Manual says "with the exception of simultaneous loads and stores, the processor can execute VFP and Advanced SIMD instructions in parallel with ARM or Thumb instructions", which is admittedly not very clear. But there is not need guessing and misinterpreting because we can easily run a simple benchmark program:
Code:
.text
.arch armv7-a
.fpu neon
.global main
#ifndef CPU_CLOCK_FREQUENCY
#error CPU_CLOCK_FREQUENCY must be defined
#endif
#define LOOP_UNROLL_FACTOR 20
.func main
main:
push {r4-r12, lr}
ldr ip, =(CPU_CLOCK_FREQUENCY / LOOP_UNROLL_FACTOR)
b 1f
.balign 64
1:
.rept LOOP_UNROLL_FACTOR
vorr d30, d30, d30
vorr d31, d31, d31
vorr d30, d30, d30
vorr d31, d31, d31
#ifdef DO_ARM_LDR
ldr r0, [sp]
#endif
vorr d30, d30, d30
vorr d31, d31, d31
vorr d30, d30, d30
vorr d31, d31, d31
2:
.endr
subs ip, ip, #1
bne 1b
mov r0, #0
pop {r4-r12, pc}
.endfunc
Cortex-A9:
Code:
$ gcc -DCPU_CLOCK_FREQUENCY=1200000000 bench_mixed_ldr_neon.S && time ./a.out
real 0m8.093s
user 0m8.080s
sys 0m0.000s
$ gcc -DCPU_CLOCK_FREQUENCY=1200000000 -DDO_ARM_LDR=1 bench_mixed_ldr_neon.S && time ./a.out
real 0m9.048s
user 0m9.035s
sys 0m0.000s
Using LDR instruction adds an extra cycle for Cortex-A9.
Cortex-A8:
Code:
$ gcc -DCPU_CLOCK_FREQUENCY=1000000000 bench_mixed_ldr_neon.S && time ./a.out
real 0m8.018s
user 0m8.016s
sys 0m0.000s
$ gcc -DCPU_CLOCK_FREQUENCY=1000000000 -DDO_ARM_LDR=1 bench_mixed_ldr_neon.S && time ./a.out
real 0m8.019s
user 0m8.000s
sys 0m0.008s
Cortex-A8 can dual-issue L/S instructions with NEON arithmetics perfectly fine.