Delay slot is used, in our course at least, a lot. One requirement of out assignment is to use forwarding and select between delayed slot and BTB (branch target buffer). The second is actually a predictor for whether we are going to take the branch or not and a predictor for the address target of the branch.
However, since we have managed to avoid branches, we use only two. One for the loop and one for the check, so the loss is not big. All this, in my current code, that does 933 cycles. (the one with the loop unrolling). I have BTB enabled for now. Delay slot gives a bit more cycles, but could give a bit less I guess if I make some re-arrangements.
But, since the number of branches is just two, I feel that the loss we have now is, that we use a count array and then we should load the values in r17-r21. Nomimal thought something, which uses slti, that can get us around this. At this thought is where we should focus now, I would say. :)