I was thinking about alloc() lately, so I spent some time investigating how I could pull it off. The short answer is: I can't. The longer answer is: I shouldn't try.
Alloca is fabled for having buggy implementations with lots of tricky edge cases which are not handled well. Beyond this, there's the fact that we're tight on stack space as it is. Opening that space up for potential abuse is just asking for trouble.
I suppose I could come up with a scheme to pull it off anyway, but the stack frame I'm currently using is not friendly for such a thing. I'd have to either make an alternate stack frame for functions which use alloca or simply use another one altogether. Since there are lingering problems with malformed stacks, I don't think it would be wise to make things any more complicated than they already are. At least for now.
That being said, I was looking at the prologue and epilogue code currently in use. There are opportunities for optimization I should look at. The basic idea is that we increment the stack pointer as we restore registers. If local stack is used, there is a final adjustment to correct for that.
There is a four cycle cost for each of those increments. The current code attempts to fold the final increment into the adjustment for the final stack frame adjustment, but only actually does this in rare circumstances.
There was also a pointless "ai r10, 0" instruction emitted in the prologue if no stack usage was used.
That's all easy to fix, but I'm wondering if it's better to not do the folding for small stack sizes. Let's consider the case where one register is saved, and we have two bytes for local usage.
Case 1: Fold increment into stack adjustment
mov *r10, r11 # 4+14+4=22 clocks
ai r10, 4 # 4+14+4=22 clocks
Total: 44 clocks, 6 bytes
Case 2: Preserve increment
mov *r10+, r11 # 4+14+8=26 clocks
inct r10 # 4+10=14 clocks
Total: 40 clocks, 4 bytes
OK, that's a pretty clear win for not folding. I'll get on that.