I've implemented DCT Type-II and Type-I, as well as the DFT. What I like about the DFT is the quality of the results. I suppose the nature of the DCT and DST plays in to that, I don't understand in full yet the consequences of the Even|Real / Odd|Imaginary-ness of the DCT/DST, but I understand enough from exploring their results that it affects the how 'well' they define the spectrum components of a given audio sample. Well, I shouldn't actually say that, I'm gauging that from what I can see and the few statistics I'm performing based on the results.
One of the things that is most important in my application is resolution and not so much precision. I want to get as much information about the signal with taking a degree of hit on the precision of the results (if I have to lose something). Approaching this from a strategic perspective I'm falling towards hw accelerating the DFT, considering GPU's are a little sloppy on the quality of the floating point values in the general case, they are fast, and would perform the 'luxurious' DFT pretty damn quick regardless. If i used a 16x16 texture that's 256 samples, which is double the number of samples I was already satisfied with. Most mid-range GPU's these days have at least 8 stream proc's, the newest one's out there now are over 800 for about $200. (ATI HD5880). From general testing I can tell greater than 32bit floating point precision is not necessary, so that's another aspects in favor of GPU's, I could quadruple the amount of data I could process over four 32-bit color channels and actually pull off 1024 sample DFTs. (I'd store the results in an associated 16x16 'result' texture.) (But of course even GPU's have performance limitations and I'm doing a lot there too so... I'll have to find a balance.)
I think I'm OK with recommending like a cheap-o 9800pro $30 card and just call it a min spec and do this all with only GPU-CPU transfers as a the limit to the CPU load. Hell, maybe I can offset all of my statistics to the GPU. I think I've already convinced myself to just go a whole new route with all of this. (I have *a lot of other information to track for real-time visual models, so I need to be as strict as possible when managing what the CPU is doing.)
I have to say though there are some things I can't hw accelerate, like classes I keep trying to optimize which track and hold vertex locations (particle classes). I definitely never thought to consider the role of L1/L2 cache limits when dealing with large enough arrays/objects(?), I think I could still apply that in other areas where I'm still trying to reach for performance in having to track and manage 9k-13k+ vertices. I'm a little excited to think there's room to get a decent increase in the amount of vertices I can manage within 2-3% CPU use on a ~C2D 2.5ghz, if it turns out I am being a lot more sloppy than I realized.
Good stuff, I got more out of this thread then I thought I would, thinking this out sorta helped me rethink a few things that shoulda been more obvious to me earlier.