1. ## Convolution \cross-correlation issue.

Hi everybody.

I dunno if it is the right section, as I'm using c to implement my project, but it's a pretty technical issue about alghoritms.

I'm not so skilled in C, and part of my project is about implementing convolution\cross-correlation between two signals (implemented with vectors of fixed size).

I read in lot of websites that fast fourier transform is preferred on standard convolution because complexity goes from O(n^2) to O(nlogn), so it's worthy to give an effort understanding it.

Lot of libraries offer C implementations for FFT, so I found the one I think is best for my case (my vectors are powers of 2, that's important sentence for this operation), so I'm trying to understand Cooley-Tukey alghoritm to use it in my project.

Talking in math terms, I expect C function to get my vector as parameter, and give me back FFT transformed vector, but all functions I found get as parameters more stuff I don't understand very well, so if you could help me, I'd post here famous C code for the implementation of FFT, I try to comment what I understood about it and posting some question to more skilled people to understand the code better, and eventually improve it to fit better to what I need (obviously I'm not asking you to explain me all of the code, or to do the job for me, it's obvious, but if you want you could help me understanding the code better).

So, here we are.

That's the code I found...

(next post, wait a sec)

2. Code:
```<<complex number datatype>>=
typedef struct complex_t {
double re;
double im;
} complex;

<<complex number prototypes>>=
double  complex_magnitude(complex c);
complex complex_sub(complex left, complex right);
complex complex_mult(complex left, complex right);

<<complex number function definitions>>=
complex complex_from_polar(double r, double theta_radians) {
complex result;
return result;
}

double complex_magnitude(complex c) {
return sqrt(c.re*c.re + c.im*c.im);
}

complex complex_add(complex left, complex right) {
complex result;
result.re = left.re + right.re;
result.im = left.im + right.im;
return result;
}

complex complex_sub(complex left, complex right) {
complex result;
result.re = left.re - right.re;
result.im = left.im - right.im;
return result;
}

complex complex_mult(complex left, complex right) {
complex result;
result.re = left.re*right.re - left.im*right.im;
result.im = left.re*right.im + left.im*right.re;
return result;
}```
That's pretty easy I find, this should be a header that defines complex type and operation (though here complex are defined with double for both real and img part, but I think I can convert all this stuff for float).Here it is the code for FFT operation:
Code:
```complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex * d, * e, * D, * E;
int k;

if (N == 1) {
X[0] = x[0];
return X;
}

e = (complex*) malloc(sizeof(struct complex_t) * N/2);
d = (complex*) malloc(sizeof(struct complex_t) * N/2);
for(k = 0; k < N/2; k++) {
e[k] = x[2*k];
d[k] = x[2*k + 1];
}

E = FFT_simple(e, N/2);
D = FFT_simple(d, N/2);

free(e);
free(d);

for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(complex_from_polar(1, -2.0*PI*k/N), D[k]);
}

for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}

free(D);
free(E);
return X;
}```
Until now I think code is pretty linear, but I have some question.The function receives as parameter my vector to FFT, then a int value n, that's said to be a power of 2. What is N? The size of the vector? Or the sampling frequency I want to use (usually it should be twice the number of the samples of the vector).This implementation is said to be a little bit rude, and it can be better (it says uses 3x necessary memory for allocation, and it has a performance degrade for large input, like my vectors that are 1024 elements long).The guide proposes a new implementation:
Code:
```void FFT_calculate(complex* x, int N /* must be a power of 2 */, int skip,
complex* X, complex* D, complex* twiddles) {
complex * E = D + N/2;
int k;

if (N == 1) {
X[0] = x[0];
return;
}

/* for now we can use X as a scratch buffer */
FFT_calculate(x, N/2, skip*2, E, X, twiddles);
FFT_calculate(x + skip, N/2, skip*2, D, X, twiddles);

for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(twiddles[k*skip], D[k]);
}

for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}
}

complex* FFT_get_twiddle_factors(int N) {
complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N/2);
int k;
for (k = 0; k != N/2; ++k) {
twiddles[k] = complex_from_polar(1.0, -2.0*PI*k/N);
}
return twiddles;
}

complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* out = (complex*) malloc(sizeof(struct complex_t) * N);
complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N);
complex* twiddles = FFT_get_twiddle_factors(N);
FFT_calculate(x, N, 1, out, scratch, twiddles);
free(twiddles);
free(scratch);
return out;
}```
It's said to be the best implementation of the alghoritm, but it gets lot of parameters I can't really understand.What are twiddles? What is the twiddle array?What is int skip?What are X and D array?For now these are my question, they really can't let me understand the code...For now I could use the first implementation too, if the best one becomes to difficult to understand, but I'm in need to know at least what parameter n means, and if someone's a little expert about this kind of operation, if he knows more efficient or easier ways to implement FFT.Thank you all for your attention and help, I'm giving my best effort in understanding this code. I usually try to implement software by myself, but in this case it's obvious it's better to understand famous and tested software created by "autorithies", I think...

3. What is N?
N is the size of both the input and output vectors.

[second chunk of FFT code
First, note that you are supposed to call FFT_simple, which in turn calls FFT_calculate, so you don't need to supply the extra parameters (they are set up in FFT_simple).

So you call it (and even the other one) something like this:
Code:
```#define SIZE 1024
complex in[SIZE], *out;
/* First fill in vector, then... */
out = FFT_simple(in, SIZE);```

4. Originally Posted by oogabooga
N is the size of both the input and output vectors.
Ok perfect, so when I call the function I do FFT_simple(signalArray,lengthSignalArray) ?
I was not so secure, because that's pretty easy to calculate size of an array, so I tought it could be unnecessary to give it as a parameter and calculate it instead inside the function.

Originally Posted by oogabooga
First, note that you are supposed to call FFT_simple, which in turn calls FFT_calculate, so you don't need to supply the extra parameters (they are set up in FFT_simple).

So you call it (and even the other one) something like this:
Code:
```#define SIZE 1024
complex in[SIZE], *out;
/* First fill in vector, then... */
out = FFT_simple(in, SIZE);```
Oh ok, I was confused by the fact fft_calculate was defined first...
Now that's a little bit clearer, also if mechanics of how FFT is calculated is not so clear.

By the way, do you think it will work if I change defines from double to float? All my project is structured in array of complex with both real and img part float...

Other question. Defines of complex and relative operation are to save in a .h file or i Should write a .c file? If I learned something here I would put all that in a header file

5. I was not so secure, because that's pretty easy to calculate size of an array, so I thought it could be unnecessary to give it as a parameter and calculate it instead inside the function.
Actually it's one of the oddities of C that it isn't possible for a function to know the length of an array passed to it since it "decays" into a pointer when passed. So you generally have to pass the length of arrays (although an alternative is to provide a sentinel value to mark the end, as is done with standard C strings where the sentinel is zero).

do you think it will work if I change defines from double to float? All my project is structured in array of complex with both real and img part float.
It looks like if you change everywhere it says double to float it should be okay. You may need to cast the return values of sin, cos, and sqrt, since they are double. I've done this below.

Other question. Defines of complex and relative operation are to save in a .h file or i Should write a .c file? If I learned something here I would put all that in a header file
If your whole program is not that big (if it can comfortably fit into one file) then you can put everything into that one file and it will be a .c file (not a header file).

If on the other hand you want to split your project into multiple files, then you would put type definitions (like structs) and function prototypes (but not the function definitions which include the function bodies) in a header file.

So you might make a header file called complex.h containing this (the ifndef and define are to ensure that the header file is not included more than once) :
Code:
```#ifndef COMPLEX_H
#define COMPLEX_H

typedef struct complex_t {
float re;
float im;
} complex;

float complex_magnitude(complex c);
complex complex_sub(complex left, complex right);
complex complex_mult(complex left, complex right);

#endif```
You'd also make a code file called complex.c containing this:
Code:
```#include "complex.h"

complex complex_from_polar(float r, float theta_radians) {
complex result;
return result;
}

float complex_magnitude(complex c) {
return (float)sqrt(c.re*c.re + c.im*c.im);
}

complex complex_add(complex left, complex right) {
complex result;
result.re = left.re + right.re;
result.im = left.im + right.im;
return result;
}

complex complex_sub(complex left, complex right) {
complex result;
result.re = left.re - right.re;
result.im = left.im - right.im;
return result;
}

complex complex_mult(complex left, complex right) {
complex result;
result.re = left.re*right.re - left.im*right.im;
result.im = left.re*right.im + left.im*right.re;
return result;
}```
Then in .c files where you need complex numbers, you would include complex.h.

6. Originally Posted by oogabooga
Actually it's one of the oddities of C that it isn't possible for a function to know the length of an array passed to it since it "decays" into a pointer when passed. So you generally have to pass the length of arrays (although an alternative is to provide a sentinel value to mark the end, as is done with standard C strings where the sentinel is zero).
Ok I understood the issue.
When I give an array to a function, it degraded to a pointer, so it becomes the address of the 1st element of the array, and there's no way to understand when the array is over, as I continue to scan memory addresses also when array is already over. Right?

Originally Posted by oogabooga
It looks like if you change everywhere it says double to float it should be okay. You may need to cast the return values of sin, cos, and sqrt, since they are double. I've done this below.

If your whole program is not that big (if it can comfortably fit into one file) then you can put everything into that one file and it will be a .c file (not a header file).

If on the other hand you want to split your project into multiple files, then you would put type definitions (like structs) and function prototypes (but not the function definitions which include the function bodies) in a header file.

So you might make a header file called complex.h containing this (the ifndef and define are to ensure that the header file is not included more than once) :
Code:
```#ifndef COMPLEX_H
#define COMPLEX_H

typedef struct complex_t {
float re;
float im;
} complex;

float complex_magnitude(complex c);
complex complex_sub(complex left, complex right);
complex complex_mult(complex left, complex right);

#endif```
You'd also make a code file called complex.c containing this:
Code:
```#include "complex.h"

complex complex_from_polar(float r, float theta_radians) {
complex result;
return result;
}

float complex_magnitude(complex c) {
return (float)sqrt(c.re*c.re + c.im*c.im);
}

complex complex_add(complex left, complex right) {
complex result;
result.re = left.re + right.re;
result.im = left.im + right.im;
return result;
}

complex complex_sub(complex left, complex right) {
complex result;
result.re = left.re - right.re;
result.im = left.im - right.im;
return result;
}

complex complex_mult(complex left, complex right) {
complex result;
result.re = left.re*right.re - left.im*right.im;
result.im = left.re*right.im + left.im*right.re;
return result;
}```
Then in .c files where you need complex numbers, you would include complex.h.
Thank you very much, I'm really learning much here from nice and avaible people like you. Thank you again, it's a pleasure to be here.

7. Although it's true that FFT is O(n log n) and convolution is O(n^2), be aware that the FFT has a rather large constant factor lurking, which makes it more efficient to do convolution with smaller kernel sizes. For autocorrelation or cross-correlation it is USUALLY a win to do FFT, but you should really test it and find out before committing to FFT.

8. Originally Posted by brewbuck
Although it's true that FFT is O(n log n) and convolution is O(n^2), be aware that the FFT has a rather large constant factor lurking, which makes it more efficient to do convolution with smaller kernel sizes. For autocorrelation or cross-correlation it is USUALLY a win to do FFT, but you should really test it and find out before committing to FFT.
True, although I'd use the word "often" rather than "usually".

The O(n log n) is an estimated lower bound on the complexity of FFT algorithms, not the actual complexity. It has never actually been proven that any FFT algorithm achieves that lower bound, although no FFT algorithm has been found which does better than that lower bound. FFT algorithms don't play well with CPU pipelining or CPU cache optimisation techniques used by most processors from the last couple of decades. Some so-called "approximate FFT" algorithms aside, no FFT algorithm has been found that is amenable to parallelisation (eg parts distributed across processors), so can give unpleasant performance surprises if they are used blindly.

9. Ok got it.
To provide better compatibility, I decided to implement both FFT method and standard time convolution (i was in need for FFT to go to frequency domain and simply multiplicate transformed signals, than transform back in time domain the result signal).

So, my idea is to call for a function that gets as parameters an int, that decides which method I want to use to calculate my convolution, so it should be something like:
Code:
```complex* convolute (complex* vector1, complex* vector2, int lenA, int lenB, int choice){

complex result[...];

if (choice==1){
result=convoluteByFFT(vector1,vector2,lenA,lenB);
else
result=stdConvolution(vector1,vector2,lenA,lenB);
}
return result;
}```
I decided to do something like that (if it's correct) because I read that FFT has some compatibility issue and it's an estimation of real convolution, and it's not very accurate expecially on the borders of the vector, so I decided to provide a little more effort to implement both.

To implement std. convolution, the steps should be:

• I reverse one of the vectors (method already done)
• I should keep this vector on the other one, and calculate for each valor of a counter i the value of convolution in that point.
• Save that value in convArray
• Return pointer to 1st element of convArray

Right?

Ok, I think I should not get lot of issues for my case (because vectors are of the same length), but to provide best compatibility possible, I searched for a source code that is compatible with all the situations.

I found this:

Code:
```//convolution algorithm
float *conv(float *A, float *B, int lenA, int lenB, int *lenC)
{
int nconv;
int i, j, i1;
float tmp;
float *C;

//allocated convolution array
nconv = lenA+lenB-1;
C = (float*) calloc(nconv, sizeof(float));

//convolution process
for (i=0; i<nconv; i++)
{
i1 = i;
tmp = 0.0;
for (j=0; j<lenB; j++)
{
if(i1>=0 && i1<lenA)
tmp = tmp + (A[i1]*B[j]);

i1 = i1-1;
C[i] = tmp;
}
}

//get length of convolution array
(*lenC) = nconv;

//return convolution array
return(C);
}```
It seems pretty linear, but I'm finding issues in understanding some parts:
I don't understand what is int* lenC. It can't be the size of the convoluted vector, because it's a pointer and because as I can see the length of the vector is calculated as nconv using lengths of input vectors (and this part is pretty clear). At the end it sets *lenC to nconv, so I guess you pass the address of a int where at the end you will have the length of the convoluted vector, it's done to let me iterate on the convoluted vector outside of the function, as far as I couldn't get the length of convoluted vector in any other way outside the function?

The implementations with the loops is not pretty clear, I don't understand > checks, and I can't understand how to implement it with complex numbers as I'm doing.

Any hint?

10. UPDATE: I worked by myself on the code and I got it working!!
For now i just made 1 try with only real part vectors and arrays of the same length, now I'll try it with different cases!

If post my code so someone could watch it and give me hints or find some bug, and maybe it can be useful for someone looking for something like that!

Code:
```complex* stdConvolution(complex* A, complex* B, int lenA, int lenB, int* lenC)
{
int nconv;
int i, j, i1;
complex tmp;
complex *C;

//allocated convolution array
nconv = lenA+lenB-1;
C = (complex*) calloc(nconv, sizeof(complex));

//convolution process
for (i=0; i<nconv; i++)
{
i1 = i;
tmp.real = 0.0;
tmp.img=0.0;

for (j=0; j<lenB; j++)
{
if(i1>=0 && i1<lenA)

i1 = i1-1;
C[i] = tmp;
}
}

//get length of convolution array
(*lenC) = nconv;

//return convolution array
return(C);
}```
main part is:

Code:
```complex vect1[8];
complex vect2[8];
complex* vettoreRes;
int x=0;
int* lenRes;
lenRes=&x;```
Then i call stdConvolution(vect1,vect2,8,8,lenRes), and when I print x and values of the elements of the array it works fine!

UPDATE: it seem to work with only img part too of the same length.

11. Originally Posted by grumpy
True, although I'd use the word "often" rather than "usually".

The O(n log n) is an estimated lower bound on the complexity of FFT algorithms, not the actual complexity. It has never actually been proven that any FFT algorithm achieves that lower bound, although no FFT algorithm has been found which does better than that lower bound. FFT algorithms don't play well with CPU pipelining or CPU cache optimisation techniques used by most processors from the last couple of decades. Some so-called "approximate FFT" algorithms aside, no FFT algorithm has been found that is amenable to parallelisation (eg parts distributed across processors), so can give unpleasant performance surprises if they are used blindly.
They are not amenable to linearly scalable implementations, sure. In one dimension. They are embarrassingly parallel in multi-dimensional cases.

12. Originally Posted by brewbuck
They are not amenable to linearly scalable implementations, sure. In one dimension. They are embarrassingly parallel in multi-dimensional cases.
Sure, but the Cooley-Tukey algorithm (which the OP is trying to understand) is not a multi-dimensional algorithm.

Multi-dimensional FFTs usually break down to applying a linear FFT ("one dimension" in your speak) repeatedly on a set of distinct inputs. Applying multiple instances of a process on distinct inputs (say, one-dimensional FFT to each column of a matrix) is inherently a parallel operation. But the execution of the individual FFTs (for each "one dimension") are not further parallelisable.

13. Originally Posted by grumpy
Sure, but the Cooley-Tukey algorithm (which the OP is trying to understand) is not a multi-dimensional algorithm.

Multi-dimensional FFTs usually break down to applying a linear FFT ("one dimension" in your speak) repeatedly on a set of distinct inputs. Applying multiple instances of a process on distinct inputs (say, one-dimensional FFT to each column of a matrix) is inherently a parallel operation. But the execution of the individual FFTs (for each "one dimension") are not further parallelisable.
Nobody has parallelized an FFT to get linear scalability (I doubt it's possible even in theory), but you can certainly parallelize it. It's also rare that you do just one FFT, so you can do multiple cases in parallel. I get what you're saying, it's just not a huge catastrophe.

14. Originally Posted by grumpy
Some so-called "approximate FFT" algorithms aside, no FFT algorithm has been found that is amenable to parallelisation (eg parts distributed across processors), so can give unpleasant performance surprises if they are used blindly.
Quite the opposite, actually. Each power-of-two sub-convolution is done in it's own stage (hence, "butterfly") so in fact the individual terms can be splitup, calculated in tandem, and then combined later. Incidentally, a slight variation on the FFT's makes for a very fast, parallelizable multiplier, too!

Anyway, my suggestion to the OP would be to first partition the data with a suitable basis function before doing an FFT on the subsamples. Better resolution and less overall "noise"...