Can "swap nybble" and "byte mask" tricks do multi byte logical shift by 4 much faster than the naive method of using bit shift chains

I'm writing a fixed-point (16Q16) algo that does division using the Newton–Raphson method outlined on Wikipedia. (Related SE question HERE.)

The first step requires logical right-shift by anywhere from 1-16 bits. My CPU is an 8-bit microcontroller, so does not have a barrel shifter; the hardware can only support shifting by 1 bit. (Related SE question HERE.) To shift by more than 1 bit, one can use the single shift instruction n times, however, this has significant worst case timing. This bad worst case becomes abysmal when compounded by the multi-byte nature of the shift. If it takes 1 cycle to shift 1 byte 1 time, we can quickly imagine the problem when needing to shift 16 bytes 16 times. Remember, this is just step 1.

The obvious optimization is to divide and conquer; manually compute for all powers of 2. The first, trivial case is shifts by 16 and 8, since this is just changing the memory index to the fixed-point number. Doing this it only takes ~4 cycles/instructions to shift 16 bytes by 16 bits, a 64x speed improvement over "single shift chaining."

The problem I'm having is pesky shifts by 4.

My intuition, as well as a few snippets and posts here and there, tells me that their exists an efficient way to combine nybble swap instructions with bit-masks and logical ops to create a sort of "nybble swap zipper effect" that could optimally shift an arbitrary length array by 4 bits. (Related SE question HERE. Links to answer 3)

I KNOW it can at least fundamentally be done, because I have a prof of concept using only SWAP, XOR, and AND instructions. I also know it can be done FASTER because what I have is one cycle faster (lol, yes, one) than a simple shift chain method. (see code box below)

What I do not know, is can this be done with a time complexity much closer to one cycle per byte than to one cycle per bit using any kind of nybble swap byte mask trick?

Note: This is PIC18 ASM, but it's pretty obvious what's going on. Any suggestions on how this can be majorly improved would be an answer to this question. YES! I realize that it may very well be close to optimal already, but realize shifting by 4 is a recurring hot spot in a few pieces of code. I'm expecting to cut at least one instruction from each block. Cutting out two would be amazing.

; Shift the denominator -> 4 (19 cyc)
;------------------------------------------
        SWAPF   Denom+3, W, BANKED
        MOVWF   Denom+3, BANKED
        ANDLW   0xF0
        XORWF   Denom+3, F, BANKED

        SWAPF   Denom+2, F, BANKED
        XORWF   Denom+2, F, BANKED
        XORWF   Denom+2, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+2, F, BANKED

        SWAPF   Denom+1, F, BANKED
        XORWF   Denom+1, F, BANKED
        XORWF   Denom+1, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+1, F, BANKED

        SWAPF   Denom+0, F, BANKED
        XORWF   Denom+0, F, BANKED
        XORWF   Denom+0, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+0, F, BANKED
;------------------------------------------


Read more here: https://stackoverflow.com/questions/66274164/can-swap-nybble-and-byte-mask-tricks-do-multi-byte-logical-shift-by-4-much-f

Content Attribution

This content was originally published by Charlie at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: