# Thread: How to use DFT/FFT (Conceptual)

1. ## How to use DFT/FFT (Conceptual)

So I have two pieces of audio and I would like to see how similar they are... i.e. determine if a person is saying the same thing in each piece of audio. I've heard from many places that generating a DFT can help, however I'm not 100% certain what the FFT outputs, and how I can use that to help me. Any advice is appreciated.

Thanks.

2. I don't think a DFT/FFT in itself would help you much, you would still have a bunch of data that you would need to interpret and classify, which strikes me as the core of the problem, which would likeley involve some machine learning. I also think that the time domain would be as important, since it's not only timbre that differ between two spoken phrases.

3. The outputs of an FFT are complex values which indicate the magnitude and phase of the various frequency components of a signal. In speech analysis, many FFTs are performed on windows which slide through the signal, this is called STFT, or short-term Fourier transform. The result of this (after some further processing which is problem-specific) is called a spectrogram. The spectrogram will show the contribution of different frequencies at different points in time.

A spectrogram will show both the pitch and formant structure of the human voice, and can certainly be used to compare the voices of two people, though this is far beyond the scope of anything we could meaningfully explain on this board.

4. ...and that's to say nothing of the many, many filters you may need to process the signal.

Soma

5. If you are trying to do something simple, like voice recognition of a handful of commands, from a single speaker, then something like cross-correlation could be useful.

If you need something more complex than that, where you must deal with any number of people who may speak, longer speech samples (perhaps a dozen words or more) and you have to consider all the variances in human speech, then cross-correlation probably wont cut it. By "variances", I mean two people not only have different timbres, they may have a different accent or different rhythm when they speak, or they may have a cold that affects speech (e.g. nasalized sounds -- a bigger issue in some languages than others). Many people use words like "um" as a sort of filler, which you may want to disregard, or they stretch their syllables out to give their brain more time to assemble the next thing they are going to say. If you're dealing with a tonal language (Mandarin, Cantonese, Vietnamese), you have that to account for too. Even in English, we use a rising tone at the end of a phrase to denote a question. Compare "I went to the store." and "I went to the store?" If you need that more complex analysis, you are basically doing speech-to-text. You will need to turn that speech sample into a string of phonemes, and break that string of phonemes into chunks that each represent a word. Do that for both pieces, and if they match (or are sufficiently close), then they are the same.

Note, for either of these cases, there are probably some third party tools/libraries out there that can help you. And you will always have to settle for a "close enough" match, the only way to guarantee an exact match is to compare something against itself.

As others have pointed out, this is a very complex task. You should do some serious research, thinking and planning before you even think about writing any code.

6. You should do some serious research, thinking and planning before you even think about writing any code.
This is pretty much an always thing.

Soma

7. Originally Posted by anduril462
If you are trying to do something simple, like voice recognition of a handful of commands, from a single speaker, then something like cross-correlation could be useful.
I'm trying to find cases where a speaker will repeat themselves, so it will be the same speaker recorded with the same hardware... etc, there might be slight fluctuations in how they said the word though. So I've already cut up the audio and isolated the different utterances, then I just need to compare bordering utterances to one another to see if the speaker repeated themselves.

Do you know how quick this process would be? A typical audio file might have roughly 100 utterances in it, so I'd like to compare ~ 100 utterances in a respectable time... perhaps less than 10 seconds?

For now I've implemented my own method of comparing audio, by simplifying the waveform and comparing the number of peaks in the waveform as well as the slope of each peak to another waveform. It's only about 70% accurate, so I'm hoping something like FFT/cross-correlation will improve upon that.

8. There are several things to consider here.

When a speaker's voice rises in pitch, there is a shift in all of the harmonic tones of the voice, but the formants remain in the same locations. So is is helpful to first isolate the formants from the pitch harmonics so they can be compared separately. There are entire books on just this element of the problem.

Second, even with the same pitch and formant structure, a sound may be drawn out in time (speaker speaking slower/faster than usual).

Some of these computations are best performed directly in signal space (e.g. prefiltering), some of them are best in the overall frequency domain, and some are almost like image-processing operations that are performed on the spectrogram.

I can help you with very specific questions, but I'm not really able to write out an entire book on voice signal processing in a few minutes.

9. I attempted to do cross correlation by following these instructions here: fft - How do I implement cross-correlation to prove two audio files are similar? - Signal Processing Beta - Stack Exchange

However I might not understand it correctly, as the results I'm getting are no where between 1 and 0 like it says in that first response. Here is what I am doing:

I realize this is a C board, but I ended up writing this in java, but I figured the syntax was close enough. In this code snippet I generated an FFT for each of my two audio pieces called fft1 and fft2.

Code:
```	float stdev1;
float stdev2;

stdev1 = find_stdev(fft1);

stdev2 = find_stdev(fft1);

//calculate covariance
double covariance = 0;
for (int i = 0; i < fft1.length; i += 1) {
covariance += (double) fft1[i] * fft2[i];
}
System.out.println("CORRELATION = " + ((double) covariance / (stdev1 * stdev2)) );```

Popular pages Recent additions