Any reason you're concerned about what's probably going to amount to a relatively small speed up? The best the algorithm can do is roughly O(strlen(src) * strlen(find)), so you're squeezing cycles out of this, not orders of magnitude. I'm sure you're aware of this, but I'm wondering if the speedup will justify the effort and whether cranking up the optimization settings on the compiler wont be your best option.
The only advice I have is don't write your own strcmp/memcmp. It's not likely to be any faster than the standard library versions, which are typically highly optimized for the architecture. They will "unroll" the comparisons for you, checking a whole word at a time if possible (you're comparing enough bytes and it can work something out with alignment/word boundaries), instead of just one byte as your code does. The compiler might inline memcmp and memcpy if it deems it beneficial.
Code:
find_len = strlen(find);
rep_len = strlen(rep);
while (*src) {
if (memcmp(src, find, find_len) == 0) {
memcpy(dst, rep, rep_len)
src += find_len;
dst += find_len;
}
else {
*dst++ = *src++;
}
}
*dst = '\0';
I didn't test that, but that's the general idea.