Which way to loop throught an array is the fastest one?

**rempas** · 06-27-2022

So, we're having the following array:

Code:

int numbers[5] = { 10, 20, 30, 40, 50 };
int len = 5;

We can loop through them using two ways (or are there any more I'm not aware of?).

1:

Code:

    for (int i = 0; i < len; i++) { printf("%d\n", numbers[i]); }

2:

Code:

  int* ptr = numbers;
int len_cpy = len;
for (; len_cpy > 0; --len_cpy) { printf("%d\n", *ptr++); }

Which one will result in the fastest runtime performance? I tried to measurement them but the times are always inaccurate and very inconsistent. Let's see how I look at them with logic (with my current knowledge).

In the first case, we create a variable "i" so 1 move instruction. Then then condition check is 1 "cmp" and then 2 "jmp" instructions. Then I'm not sure how "numbers[i]" in machine language. I suppose the generated code gets a copy of the original pointer, increments it by "i" and then, de-references the pointer. So 3 instructions? And finally, we increment "i" so 1 instruction. So, we have a total of 8 instructions!

In the second case, we get a pointer to the beginning of the array and we copy the variable that holds the length of the array (as we don't want to modify the original variable that holds the length) so 2 instructions. Then we have the condition. However, it now compares against an immediate value and from what I heard, working with immediate values is faster (but how much faster?) so I suppose that even tho we have the same instructions here, it will execute faster in that case. Then, we deference the pointer and then we increment it, so 2 instructions. But in this case, we increment it by "1" which is an immediate value compared to the previous case where we increment it by the value of a variable so I suppose that incrementing will also work faster here. Finally, we decrement "len_copy" which is also 1 instruction but "subtraction" is slower than "addition" so it will be slower in that case (but how much slower?). In the end, we have 8 instructions in this case as well (but 1 less inside the loop)!

I don't know if what I'm saying is right or wrong but it's all based on my current knowledge. That's why I'm asking for help in the first place. Thank you all for your time reading this!

**hamster_nz** · 06-27-2022

It is very hard to give you a useful answer. There are too many unknowns, and it also depends on the compiler and it's optimizations. If you are using printf() it doesn't really matter as printf() is slow.

However, in general because modern CPUs are usually superscalar, using two variables (one for the addressing, and one for the counting, is usually preferred, as both can be processed in parallel.

The other option is to calculate the end address and then loop until you reach it, saving one math operation inside the loop. Might not make it any faster on a superscalar CPU, but could be on a low end CPU/microcontroller.

Code:

int* ptr = numbers;
for(int* end_ptr = numbers + len;  ptr != end_ptr; ptr++) {
  printf("%d\n", *ptr);  // Should be a very simple operation for the performance of the loop to make a difference
}

**rempas** · 06-27-2022

Originally Posted by hamster_nz

It is very hard to give you a useful answer. There are too many unknowns, and it also depends on the compiler and it's optimizations. If you are using printf() it doesn't really matter as printf() is slow.

However, in general because modern CPUs are usually superscalar, using two variables (one for the addressing, and one for the counting, is usually preferred, as both can be processed in parallel.

The other option is to calculate the end address and then loop until you reach it, saving one math operation inside the loop. Might not make it any faster on a superscalar CPU, but could be on a low end CPU/microcontroller.

Code:

int* ptr = numbers;
for(int* end_ptr = numbers + len;  ptr != end_ptr; ptr++) {
  printf("%d\n", *ptr);  // Should be a very simple operation for the performance of the loop to make a difference
}

Tons of things that I have to learn! Thank you for making my life easier my friend!

**Salem** · 06-27-2022

First off, trying to time anything calling printf is a fools errand.
You're trying to measure an ant-fart in a hurricane.

Second, it's basically the same code whichever way you write it.

Rather than trying to rationalise the instruction count, just count them directly.

Code:

#include <stdio.h>

void foo ( int numbers[], int len ) {
    for (int i = 0; i < len; i++) {
        printf("%d\n", numbers[i]);
    }
}

void bar ( int numbers[], int len ) {
    for (; len > 0; --len) { printf("%d\n", *numbers++); }
}

Then compile with -S and other flag(s) of your choice.

Code:

$ gcc -S foo.c
$ less foo.s

$ gcc -S -O2 foo.c
$ less foo.s

By the time the optimiser has had a go at them both, the loops are essentially the same.

The code you write should follow what you would naturally write.
In that respect, for (int i = 0; i < len; i++) has far fewer surprises than for (; len_cpy > 0; --len_cpy)

If you try to be too clever, you might just end up confusing the optimiser and coming out worse than if you'd just stuck to the plain and simple.

Focus your efforts on choosing the best algorithms and data structures for the problem at hand, and leave the micro-management to the compiler.
No amount of loop reformatting or compiler optimisation flags will save you if you choose 'bubblesort' instead of 'quicksort'.

**rempas** · 06-27-2022

Originally Posted by Salem

First off, trying to time anything calling printf is a fools errand.
You're trying to measure an ant-fart in a hurricane.

Second, it's basically the same code whichever way you write it.

Rather than trying to rationalise the instruction count, just count them directly.

Code:

#include <stdio.h>

void foo ( int numbers[], int len ) {
    for (int i = 0; i < len; i++) {
        printf("%d\n", numbers[i]);
    }
}

void bar ( int numbers[], int len ) {
    for (; len > 0; --len) { printf("%d\n", *numbers++); }
}

Then compile with -S and other flag(s) of your choice.

Code:

$ gcc -S foo.c
$ less foo.s

$ gcc -S -O2 foo.c
$ less foo.s

By the time the optimiser has had a go at them both, the loops are essentially the same.

The code you write should follow what you would naturally write.
In that respect, for (int i = 0; i < len; i++) has far fewer surprises than for (; len_cpy > 0; --len_cpy)

If you try to be too clever, you might just end up confusing the optimiser and coming out worse than if you'd just stuck to the plain and simple.

Focus your efforts on choosing the best algorithms and data structures for the problem at hand, and leave the micro-management to the compiler.
No amount of loop reformatting or compiler optimisation flags will save you if you choose 'bubblesort' instead of 'quicksort'.

Thank you! Yeah, sorry for not making it clear. I don't care about "printf", I just used it to make the example. Of course the "numbers[i]" and
"*nubers++" is what I care for. In the end, I get your point however, what if you want to write a compiler which is what I want to do? Then you
are going to create the optimizer and you'll have to know how low level details work. This is why I'm asking. Have a nice day my friend!

**Salem** · 06-28-2022

> what if you want to write a compiler which is what I want to do?
Have you actually read any books or courses on compiler construction?

Just making it generate correct code will be achievement enough in itself.

K&R's first C compiler was just a dumb translator.
Hence they had compiler hint keywords like 'register' built into the language.

Modern compilers with their range of optimisation options are the distilled experience of 1000's of people with decades of programming experience.

Having read your previous topics, it seems like we're back to where we started.

**rempas** · 06-28-2022

Originally Posted by Salem

> what if you want to write a compiler which is what I want to do?
Have you actually read any books or courses on compiler construction?

Just making it generate correct code will be achievement enough in itself.

K&R's first C compiler was just a dumb translator.
Hence they had compiler hint keywords like 'register' built into the language.

Modern compilers with their range of optimisation options are the distilled experience of 1000's of people with decades of programming experience.

Having read your previous topics, it seems like we're back to where we started.

WHAT DO YOU MEAN... no no, I'm just kidding, XD!

But seriously, I can confidently say that I have made a HUGE progress not only in theory but in practice too (and my mental health has drastically improved
to the point that I'm not getting stressed with all the stuff that I have to thing and learn). I don't remember the last time I posted something regardless of that
topic but my frontend work sees progress. Well slowly because I'm lazy af but still, it's progress after all. It'll take time as I'm designing the compile to be optimized
(at least as much as I can think) from the start and I'm also not using libc so I'm writing everything I need myself. And I have lost (almost) all hopes to use a ready
backend because I find nothing that fits my needs without adding at least one very negative thing so I'll Assembly and output that instead. This is why I'm making
these questions. It's not about the ".01%" runtime gain but about properly learn how things work. I know I have a lot of work to do and a lot of things to still learn
but I'm willing to do whatever it takes. I can see small backends (that have been written by single individuals) that can get 60-70% of `GCC -O2` so I guess it's
not impossible! And it will be explicitly said that my compiler will not transform any code and do the aggressive optimizations GCC does. If you write bad code,
you will get bad output as a result. If you (or you know anyone that does) have a better idea for the backend tho, let me know!

When it comes to books, I don't remember if I asked it here but I have come to the conclusion to read the book called "modern compiler design 2nd edition". I've read
it (with caution) until a point but then I realize something. The book explains how a compiler works but it has tons of theory on that topic. Yes there are some small code
examples but there are nothing. I then realized that there is no book that will show me how to "design" and "implement" a compiler. All the book will do is show me how
compilers work (which you can find 10 million articles online and I could think of myself anyway) and the different "general designs" we can have. But no detailed explanation
because in the end, it's up to you! So I thought about save my time and start thinking and working myself! It's more fun this way anyway

However, I plan on going back in the book (and on reading anyway) at some point because I may see and learn things that I wouldn't have thought of. And because, It's very
hard for reading and learning from others and reading other code for me and I plan on getting more discipline and learning how to relax and take my time to properly read
and learn. Thank you for your time reading this my brother!

**awsdert** · 06-28-2022

I've not read through the posts here but I'll add my tidbit. I remember watching a vid at some point that mentioned going backwards through an array is usually faster than going forwards, something to do with where in RAM/cache the address starts off at. Secondly for multi-dimensional arrays it's better to go from the final dimension than from the 1st so that the CPU is just loading one block per top level loop, example:

Code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define X_HAS 10
#define Y_HAS 10
#define Z_HAS 10
int zyx[Z_HAS][Y_HAS][X_HAS];

typedef void (*callback)();

void slow();
void fast();
void timed( callback cb );

int main()
{
	timed( slow );
	timed( fast );
	return 0;
}

void timed( callback cb )
{
	long int started, stopped;
	memset( (void*)zyx, 0, sizeof(zyx) );
	started = clock(); cb(); stopped = clock();
	printf( "Ticks taken: %ld\n", stopped - started );
}

void slow()
{
	int z, y, x;
	for ( x = 0; x < X_HAS; ++x )
	{
		for ( y = 0; y < Y_HAS; ++y )
		{
			for ( z = 0; z < Z_HAS; ++z )
				zyx[z][y][x] = z * y * x;
		}
	}
}

void fast()
{
	int z, y, x;
	for ( z = Z_HAS; z; )
	{
		--z;
		for ( y = Y_HAS; y; )
		{
			--y;
			for ( x = X_HAS; x; )
			{
				--x;
				zyx[z][y][x] = z * y * x;
			}
		}
	}
}

I wrote that up just now so you might need to check for bugs or pick a faster ticker but it should get the point across

**rempas** · 06-28-2022

Originally Posted by awsdert

I've not read through the posts here but I'll add my tidbit. I remember watching a vid at some point that mentioned going backwards through an array is usually faster than going forwards, something to do with where in RAM/cache the address starts off at. Secondly for multi-dimensional arrays it's better to go from the final dimension than from the 1st so that the CPU is just loading one block per top level loop, example:

Code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define X_HAS 10
#define Y_HAS 10
#define Z_HAS 10
int zyx[Z_HAS][Y_HAS][X_HAS];

typedef void (*callback)();

void slow();
void fast();
void timed( callback cb );

int main()
{
    timed( slow );
    timed( fast );
    return 0;
}

void timed( callback cb )
{
    long int started, stopped;
    memset( (void*)zyx, 0, sizeof(zyx) );
    started = clock(); cb(); stopped = clock();
    printf( "Ticks taken: %ld\n", stopped - started );
}

void slow()
{
    int z, y, x;
    for ( x = 0; x < X_HAS; ++x )
    {
        for ( y = 0; y < Y_HAS; ++y )
        {
            for ( z = 0; z < Z_HAS; ++z )
                zyx[z][y][x] = z * y * x;
        }
    }
}

void fast()
{
    int z, y, x;
    for ( z = Z_HAS; z; )
    {
        --z;
        for ( y = Y_HAS; y; )
        {
            --y;
            for ( x = X_HAS; x; )
            {
                --x;
                zyx[z][y][x] = z * y * x;
            }
        }
    }
}

I wrote that up just now so you might need to check for bugs or pick a faster ticker but it should get the point across

HOLY!!!! That's truly advanced stuff! Thank you for the info!

**john.c** · 06-28-2022

That's truly advanced stuff!

I have no idea what you are referring to here. If you actually think this is "advanced" then you are many years away from "writing a compiler".

Obviously going through the dimensions of a muti-dimensional array from the inside out is going to be faster. For modern machines the most important thing is to limit cache misses. That's part of what Salem means about choosing the right algorithm. And looping backwards in that situation is also obviously useful since in that case when one inner loop ends you will be right next to the end of the next inner loop. This is really basic cache stuff.

However, none of this was part of the original question.

**rempas** · 06-28-2022

Originally Posted by john.c

I have no idea what you are referring to here. If you actually think this is "advanced" then you are many years away from "writing a compiler".

Obviously going through the dimensions of a muti-dimensional array from the inside out is going to be faster. For modern machines the most important thing is to limit cache misses. That's part of what Salem means about choosing the right algorithm. And looping backwards in that situation is also obviously useful since in that case when one inner loop ends you will be right next to the end of the next inner loop. This is really basic cache stuff.

However, none of this was part of the original question.

I'm sorry but memory optimizations are consider "advanced stuff" to me compared to a lot of other "cool" things and small optimizations. It may be the opposite for you, it's fine.

Also, why am I years away of writing a compiler if I don't know these stuff? Don't we all start from zero and learn as we go? Should I wait until I know EVERYTHING until I start
writing my compiler? Also, this has to do with optimizations. Know knowing that stuff doesn't mean that I cannot make something that works (even if not the best way). Like I said
to my friend @Salem, If you have any better ideas for a backend that will solve me hands, please let me know

**awsdert** · 06-28-2022

Originally Posted by john.c

However, none of this was part of the original question.

*looks at thread title* "Which way to loop throught an array is the fastest one?"
I believe you're incorrect there, unless someone changed the title before I saw the thread.

**john.c** · 06-28-2022

Some people are simply not worth helping.

**awsdert** · 06-28-2022

Originally Posted by john.c

Some people are simply not worth helping.

What makes you say that? And of whom are you referring? Even when I look at the initial post of the thread I see nothing that really deviates form the title, might've confounded the question a bit with talk of assembly instructions though.

**rempas** · 06-28-2022

Originally Posted by john.c

Some people are simply not worth helping.

I really do get hostile vibes from your replies and I don't understand why. It would really be nice if you wanted to address what
made you feel this way so we can fix it. Also, I'm really grateful that you and anyone else tries to help and thank you! However,
I will not stand here and feel bad while my behavior towards you and everyone else here has been nothing but the friendliest possible!
In the end, nobody forced you to help neither me or anyone else. If you think that my questions (or me in general) are "too stupid"
and that I'm unworthy of your time, then your free to pass!

Thread: Which way to loop throught an array is the fastest one?

Thread Tools

Search Thread

Display

Which way to loop throught an array is the fastest one?

Similar Threads

Fastest way to check if the bytes contained in an array are all x00

Help making manipulable variable throught another file

Fastest way to search the table ( or array )

I'm looking for the fastest!

Spreading throught the network

Tags for this Thread