Thread: How to use DFT/FFT (Conceptual)

  1. #1
    Registered User
    Join Date
    Jul 2011
    Posts
    6

    How to use DFT/FFT (Conceptual)

    So I have two pieces of audio and I would like to see how similar they are... i.e. determine if a person is saying the same thing in each piece of audio. I've heard from many places that generating a DFT can help, however I'm not 100% certain what the FFT outputs, and how I can use that to help me. Any advice is appreciated.

    Thanks.

  2. #2
    Registered User
    Join Date
    Jan 2009
    Posts
    1,485
    I don't think a DFT/FFT in itself would help you much, you would still have a bunch of data that you would need to interpret and classify, which strikes me as the core of the problem, which would likeley involve some machine learning. I also think that the time domain would be as important, since it's not only timbre that differ between two spoken phrases.

  3. #3
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    The outputs of an FFT are complex values which indicate the magnitude and phase of the various frequency components of a signal. In speech analysis, many FFTs are performed on windows which slide through the signal, this is called STFT, or short-term Fourier transform. The result of this (after some further processing which is problem-specific) is called a spectrogram. The spectrogram will show the contribution of different frequencies at different points in time.

    A spectrogram will show both the pitch and formant structure of the human voice, and can certainly be used to compare the voices of two people, though this is far beyond the scope of anything we could meaningfully explain on this board.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  4. #4
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    ...and that's to say nothing of the many, many filters you may need to process the signal.

    Soma

  5. #5
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    If you are trying to do something simple, like voice recognition of a handful of commands, from a single speaker, then something like cross-correlation could be useful.


    If you need something more complex than that, where you must deal with any number of people who may speak, longer speech samples (perhaps a dozen words or more) and you have to consider all the variances in human speech, then cross-correlation probably wont cut it. By "variances", I mean two people not only have different timbres, they may have a different accent or different rhythm when they speak, or they may have a cold that affects speech (e.g. nasalized sounds -- a bigger issue in some languages than others). Many people use words like "um" as a sort of filler, which you may want to disregard, or they stretch their syllables out to give their brain more time to assemble the next thing they are going to say. If you're dealing with a tonal language (Mandarin, Cantonese, Vietnamese), you have that to account for too. Even in English, we use a rising tone at the end of a phrase to denote a question. Compare "I went to the store." and "I went to the store?" If you need that more complex analysis, you are basically doing speech-to-text. You will need to turn that speech sample into a string of phonemes, and break that string of phonemes into chunks that each represent a word. Do that for both pieces, and if they match (or are sufficiently close), then they are the same.


    Note, for either of these cases, there are probably some third party tools/libraries out there that can help you. And you will always have to settle for a "close enough" match, the only way to guarantee an exact match is to compare something against itself.

    As others have pointed out, this is a very complex task. You should do some serious research, thinking and planning before you even think about writing any code.

  6. #6
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    You should do some serious research, thinking and planning before you even think about writing any code.
    This is pretty much an always thing.

    Soma

  7. #7
    Registered User
    Join Date
    Jul 2011
    Posts
    6
    Quote Originally Posted by anduril462 View Post
    If you are trying to do something simple, like voice recognition of a handful of commands, from a single speaker, then something like cross-correlation could be useful.
    I'm trying to find cases where a speaker will repeat themselves, so it will be the same speaker recorded with the same hardware... etc, there might be slight fluctuations in how they said the word though. So I've already cut up the audio and isolated the different utterances, then I just need to compare bordering utterances to one another to see if the speaker repeated themselves.

    Do you know how quick this process would be? A typical audio file might have roughly 100 utterances in it, so I'd like to compare ~ 100 utterances in a respectable time... perhaps less than 10 seconds?

    For now I've implemented my own method of comparing audio, by simplifying the waveform and comparing the number of peaks in the waveform as well as the slope of each peak to another waveform. It's only about 70% accurate, so I'm hoping something like FFT/cross-correlation will improve upon that.

  8. #8
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    There are several things to consider here.

    When a speaker's voice rises in pitch, there is a shift in all of the harmonic tones of the voice, but the formants remain in the same locations. So is is helpful to first isolate the formants from the pitch harmonics so they can be compared separately. There are entire books on just this element of the problem.

    Second, even with the same pitch and formant structure, a sound may be drawn out in time (speaker speaking slower/faster than usual).

    Some of these computations are best performed directly in signal space (e.g. prefiltering), some of them are best in the overall frequency domain, and some are almost like image-processing operations that are performed on the spectrogram.

    I can help you with very specific questions, but I'm not really able to write out an entire book on voice signal processing in a few minutes.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  9. #9
    Registered User
    Join Date
    Jul 2011
    Posts
    6
    I attempted to do cross correlation by following these instructions here: fft - How do I implement cross-correlation to prove two audio files are similar? - Signal Processing Beta - Stack Exchange

    However I might not understand it correctly, as the results I'm getting are no where between 1 and 0 like it says in that first response. Here is what I am doing:

    I realize this is a C board, but I ended up writing this in java, but I figured the syntax was close enough. In this code snippet I generated an FFT for each of my two audio pieces called fft1 and fft2.

    Code:
    	float stdev1;
            float stdev2;
    		
            stdev1 = find_stdev(fft1);
    		
            stdev2 = find_stdev(fft1);
    		
    	//calculate covariance
            double covariance = 0;
            for (int i = 0; i < fft1.length; i += 1) {
                covariance += (double) fft1[i] * fft2[i];
            }
            System.out.println("CORRELATION = " + ((double) covariance / (stdev1 * stdev2)) );

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Shifting from C to C++: help with conceptual stuff
    By officedog in forum C++ Programming
    Replies: 6
    Last Post: 12-02-2008, 08:30 AM
  2. Two conceptual questions
    By AntiScience in forum C++ Programming
    Replies: 3
    Last Post: 11-01-2007, 11:36 AM
  3. Conceptual Problem in D3D - Tranformations
    By Tonto in forum Game Programming
    Replies: 2
    Last Post: 10-15-2007, 11:13 PM
  4. A conceptual question on using DLL'd
    By Niara in forum Windows Programming
    Replies: 1
    Last Post: 09-08-2005, 03:02 AM
  5. disjoint sets conceptual question
    By axon in forum C++ Programming
    Replies: 1
    Last Post: 03-01-2004, 10:49 PM