Interview Question

**nndhawan** · 02-17-2011

Hey Guys,

I am new to C board and wish I wasn't since it seems like there is a ton of good information presented here. In any case, I just finished an interview today and I was asked a particular c problem that stumped me and when I thought I was right, I was wrong. But still the interviewers explanation of why he was right does not convince me and that's why to further my understanding and stay true to the C programming language I want to get more experienced C programmer's input as to what is really going on.

So basically the premise of the question:
Compare the two code snippets below. What is it that they do and why is one faster than the other?

Code:

foo(char *_a, const char *_b, int _count)
{
do{
*_a++ = *_b++;
}while(--_count > 0);
}

Code:

foo2(char *_a, const char *_b, int _count)
{
int n = (_count + 7)/8;
switch(_count % 8) {
case 0:  do { *_a++ = *_b++;
case 7: *_a++ = *_b++;
case 6: *_a++ = *_b++;
case 5: *_a++ = *_b++;
case 4: *_a++ = *_b++;
case 3: *_a++ = *_b++;
case 2: *_a++ = *_b++;
case 1: *_a++ = *_b++;
}while(--n> 0);
}
}

My reasoning:
Both functions are essentially doing the same thing: the pointer a is pointing to data stored into data pointed by b.

The first one is faster because it has a lot less statements and will branch a lot less. Where as the second one deals with a lot of comparisons in the switch statements before it does any of its data manipulation. Therefore the second one implements comparisons/branch instructions and loops before it copies data and the first one just has one while loop without any branching thus is more efficient.

The interviewers said the second one is much better because it reduces the loop by an order of n/8, where n is the number of loops. I don't really see how this works, I can see how the idea is more efficient, but to me the it seems the second code is working harder for no reason at all and I don't get how it achieves that

If there is something I am missing here please let me know, cause I still don't see it. Thanks.

**GReaper** · 02-17-2011

In a Pentium CPU, we really don't know which one will be faster. Because, while the first does 8 times more iterations, the second does 8 times more comparisons. In worst cases though, when the branch is lost 50% of the time, the latter will be much much slower!

**Bayint Naung** · 02-17-2011

Duff's device - Wikipedia, the free encyclopedia
Edit: Tom Duff on Duff's Device

**~~CommonTater~~** · 02-17-2011

I agree with Sipher... that second setup is just plain silly.

Doing modulus is one heck of a lot slower than counting.
Using case statements is slower still.

Plus underscores on variable names ir a bad idea, it risks colision with the decoration used in librarys and object files.

Your interviewer needs to take some C programming lessons.

EDIT: Thanks Bayint... I haven't seen Duffs Device in years. I thought it was finally gone...

**nonoob** · 02-17-2011

The second one is faster because it does not have to decrease count for each byte copied.
Every eight copy operations it does do one decrement of n but the rest of the time it does no other work than what it has to. Dropping through case statements is zero overhead.

**GReaper** · 02-17-2011

... Inc or Dec are fast, really fast, you know.

**GReaper** · 02-17-2011

Yes, i forgot about jump tables.

But still, the second version creates a long dependance chain that stalls the execution just as much.

**Bayint Naung** · 02-17-2011

The second version may *not* be faster.
But that illustrates 2 features of C.
switch fallthrough and
switch() allows statements;

**~~CommonTater~~** · 02-17-2011

Not to mention the risk of underrunning allocated memory if Count isn't a multiple of 8.

**GReaper** · 02-17-2011

Originally Posted by CommonTater

Not to mention the risk of underrunning allocated memory if Count isn't a multiple of 8.

No, that is encountered by placing the switches backwards.

EDIT: But, it is valid C? Can cases be nested inside other repeat structures?

**brewbuck** · 02-17-2011

Originally Posted by Sipher

No, that is encountered by placing the switches backwards.

EDIT: But, it is valid C? Can cases be nested inside other repeat structures?

Yes, it is valid. As was mentioned, it's called Duff's device, and it used to be a significant optimization. It's just a loop that's been unrolled by hand instead of the compiler. I say "used to be" because most modern compilers will do this optimization for you automatically.

I wouldn't challenge somebody on it during an interview, but I'd be surprised if a decent compiler didn't produce almost the exact same code for the two versions.

EDIT: Another use of Duff's device is to create a loop with more than one entry point. It's good to have the option, but in most cases there are other ways to write it to avoid doing this weird stuff.

**nonoob** · 02-17-2011

Originally Posted by CommonTater

Doing modulus is one heck of a lot slower than counting.
Using case statements is slower still.

Modulo is super fast. Probably 1 clock cycle like most integer operations. It is only done once. Especially since _count % 8 essentially masks off the lower 3 bits. So it doesn't even require a divide if the compiler optimizes. Whereas if you decrement a counter per byte copy it will impact performance that many times... potentially thousands or millions depending on the amount of data.

**~~CommonTater~~** · 02-17-2011

Of course you can nest cases inside loops of all kinds... even inside each other (that one's done in windows all the time).

**~~CommonTater~~** · 02-17-2011

Originally Posted by nonoob

Modulo is super fast. Probably 1 clock cycle like most integer operations. It is only done once. Whereas if you need to decrement a counter per byte copy it will impact performance that many times... potentially thousands or millions depending on the amount of data.

inc and dec are 1 clock cycle CPU instructions as well... part of the microcode, whereas modulo has to be calculated.

A++ pretty much amounts to INC EAX; in assembler.

**brewbuck** · 02-17-2011

Originally Posted by nonoob

Modulo is super fast. Probably 1 clock cycle like most integer operations. It is only done once. Whereas if you need to decrement a counter per byte copy it will impact performance that many times... potentially thousands or millions depending on the amount of data.

Modulo by a power of two, if the compiler can tell it's a power of two, is super fast. In general, modulo is one of the slowest arithmetic operations if not the slowest.

This is the stuff people shouldn't spend time worrying about. A good compiler will see this code, say "Hey, this is a memcpy" and just call memcpy() for you. memcpy() itself has all these optimizations already built into it.

Thread: Interview Question

Thread Tools

Search Thread

Display

Interview Question

Similar Threads

c Interview question.

SDL buffer channels question

Newbie question, C #

another do while question

Question...