You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

This problem is so bad that people don't even try to write fast tight code even in video drivers.

example of video driver code (changed to protect the identity of the guilty)

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
int i, r;
struct some_gpu_bytecode_alu alu;

for (i = 0; i < 4; i++) {
memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

alu.dst.sel = ctx->shader->input[input].gpr;
alu.dst.write = 1;

alu.dst.chan = i;

alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
alu.src[0].chan = i;

if (i == 3)
alu.last = 1;
r = some_alu_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
return 0;
}

How it should be written for performance

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
int i, r;
struct some_gpu_bytecode_alu alu;

memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

alu.dst.sel = ctx->shader->input[input].gpr;
alu.dst.write = 1;

alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;


for (i = 0; i < 4; i++) {

alu.dst.chan = i;

alu.src[0].chan = i;

if (i == 3)
alu.last = 1;
r = some_alu_bytecode_add_alu(ctx->bc, &alu);
if (unlikely(r))
break;
}
return r;
}

One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.