You will need to place a limit on how big the intergral term can get other wise it can build up and swap the others. I didn't use interupts for the encoder - just polled it at 40 kHz and read them all at once. The problem I saw was if the encoder was sitting right on the edge of a transition you could get a continous stream of encoder interrupts and not get any time to do anything else.To keep the speed up i kept a 8 bit count for the step/dir interrupt and encoder polling and then transfered them to a 32 bit counter periodically before they could overflow.

Are you writing the whole thing in assembler or just the interrupts? After looking at the C generated code I decided that I couldn't really save much by using asm.
The encoder polling routine was 103 instructions (for all 3 encoders) and the step/dir 23 instructions, assuming 2 clocks per instruction the total, for 3 step interrupts and a encoder poll is about 11uS with a 16 MHz clock. At 40 kHz which is as fast as most driving software will go 44% of the processor time is being used so there is a bit to spare.