![]() |
| | #1 |
| Registered User Join Date: Nov 2009
Posts: 77
| CPU increase with precomputed tables? The problem is that using precomputed tables results in higher CPU usage and that's confusing me. The routine which creates the table is in fact occurring outside of the applications render loop. It is NOT being recomputed every iteration. So here are the details; I've designed a class which executes some numerical algorithms and I've added some methods which allow user to specify the location to some precomputed tables to speed things up. Code: vector< vector< float > > *TCOS, *TSIN;
//...
void ALG::UseTrigTables(vector< vector< float > > *fsin, vector< vector< float > > *fcos)
{
TCOS = fcos;
TSIN = fsin;
}
Code: vector< vector< float > > tcos(N, vector< float > (N)), tsin(N, vector< float > (N)); // [insert loop here to fill vector-vectors with data]... etc... Blah.UseTrigTables(&fsin, &fcos) Code: float f0;
for (int k = 0; k < N; ++k)
for (int j = 0; j < N; ++j)
{
f0 = fA[k] * fB[j] * (*TSIN)[k][j];
// ...
}
With trig: ~8% With tables: ~10% Where N = 128 (size of tables/loop) My first guess; While I am passing a pointer, maybe I'm missing something critical and the tables are actually being copied to a new location in memory every time i call them?? If not, what exact tools do you recommend I use to profile this applications behavior? Last edited by since; 11-20-2009 at 01:55 PM. |
| since is offline | |
| | #2 |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,345
| Perhaps in your test program you only access the tables once, thus unnecessary computations are not eliminated since they never happen in the first place.
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way |
| laserlight is online now | |
| | #3 |
| Registered User Join Date: Nov 2009
Posts: 77
| Accessing tables definitely occurs every iteration, I don't want to get specific because people get carried away on forums, but for arguments sake I'm talking about certain fourier transforms. So the access must occur to get results for each frame that I am processing, another thing to note is that I've removed all math and get a very clear decrease in CPU usage. So I have a baseline that says clearly no math = less CPU intense, which is pretty obvious, but more importantly supports the notion that extra work is happening when both trig and the tables are in place. Interestingly enough, the amount of 'more' work being done by the lookup-tables is higher than just using trig functions. Any other ideas? This is going to drive me mad, Lol. I'll have to spend some time looking at what the good open source profiling tools there are, unless you have any recommendations? ... I have this feeling that passing vector pointers isn't doing what I'm expecting, I hope I'm wrong, trying to get 2d arrays to pass around without vectors has lead me to some painful attempts with little success. Maybe I'll try passing old school arrays wrapped in a class... Last edited by since; 11-20-2009 at 02:14 PM. |
| since is offline | |
| | #4 | |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,345
| Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | |
| laserlight is online now | |
| | #5 |
| Registered User Join Date: Nov 2009
Posts: 77
| Ooooh. IIII seee. Sorry for misunderstanding. I was hoping that the impact of accessing that many array elements would be insignificant, because... no matter what I do I have to go through NxN operations, I thought it was that fact and not array access itself that was causing the performance impact. I'm actually surprised, if I'm using a pointer to look at a specific place in memory, why would that be less efficient than having to go through a series of operations to capture cos/sin values? It's not like I'm traversing an array to find a value per operation, the array value I'm referencing is literally *defined for every value at NxN. Is there something actually different about accessing a variable which is an array member versus variable which is not within the subset of an array (other then the fact that you have to use an index)? Time to bust out the ASM and learn some more, because I would be extremely disappointed to discover that a CPU can't be told to look straight at any given array index and for some reason have that cost a crap ton of CPU cycles. How does a CPU look at any other type of variable data without killing the cpu, because I'm using a lot of other variables in the same operation. >:\ Last edited by since; 11-20-2009 at 02:40 PM. |
| since is offline | |
| | #6 | |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,345
| Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | |
| laserlight is online now | |
| | #7 |
| Registered User Join Date: Nov 2009
Posts: 77
| From what I can tell reading around I'm going to have to port this over to Visual C++ to have access to really good profiling and analysis. Looks like I'm stuck with that since it doesn't seem there's any easy tests to identify the exact reason for this behavior. Caching sounds like a good culprit. Well, I guess I have two options, I can learn how to use profiling to troubleshoot this annoyance, and I could just move fft's on to the GPU instead of fighting the CPU for no better than 50% gain even if this did work like I imagined it might. Bleh, I'll try to figure out why this is the way it is, and post back for who ever else that might ever run in to this and search the 'ol google, then just try this all out with shaders. Where are the quantum computers at? I demand better. >:[ Last edited by since; 11-20-2009 at 02:50 PM. |
| since is offline | |
| | #8 |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,345
| What compiler are you currently using? I believe that the profiling tools for Visual C++ are only available with the top end Team System (but if you have that, then great ).
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way |
| laserlight is online now | |
| | #9 |
| Registered User Join Date: Nov 2009
Posts: 77
| I'm using GCC/G++ etc... the free stuff with Dev-C++. I'd like to make it all work using just what I have available through opensource projects on principle for this. (edit: I need to really think through my posts before I post, I go through like five edits within the first 10 seconds of all my posts. wow!) Last edited by since; 11-20-2009 at 02:58 PM. |
| since is offline | |
| | #10 |
| 3735928559 Join Date: Mar 2008
Posts: 693
| if cache coherence is a problem, you could preload the cache. also don't forget GPU readback doesn't come for free. though i hear that can be much improved with CUDA |
| m37h0d is offline | |
| | #11 | |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,345
| Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | |
| laserlight is online now | |
| | #12 |
| Registered User Join Date: Mar 2009
Posts: 44
| If you're saying there's an array entry for each NxN operation, and N=128, then you're definitely killing L1 D-cache hit rates. That's 64K of data assuming only one value with no vector class overhead, and it looks like you've got two, plus other source and dest arrays which is way larger than the d-cache on modern processors. So the biggest delay is likely the time it's taking to service an L1 miss (low tens of cycles?). There's also probably a bit of overhead in doing all of the pointer arithmetic, but that's probably able to be hidden fairly well by the compiler and is only a secondary concern. Can you try cutting down the value of N and see if the calculation vs table performance difference changes? That would give you a hint that you're limited by cache size if the table lookups suddenly get better when N=16 or 32 - at that point they should fit in L1 if I'm doing the math correctly and your 10-20 cycle L2 access time drops to 3-4 cycles for an L1 hit. |
| KCfromNC is offline | |
| | #13 |
| Registered User Join Date: Nov 2009
Posts: 77
| KCfromNC, You might have got it, I had to go all the way down to N=16 before the table lookups won out, that was doing just quick 15 second tests at N=64, 32, 16. At N=32 the CPU hit was about even between trig/tables. Well, that sucks. This is just the base results that I have to work off of with a series of other stochastic processes, not going to work. Figures. Alright, I might have to go an entirely different route, with hw accelleration. I need N=128 at least, I might be able to make this work with N=96 and just have it be a little 'sloppy', but with hw acceleration I might be able to get N=256 with relative ease, we'll see. Thanks for your suggestions. I'm going to look at gprof anyway just so I can use it for other projects possibly. Last edited by since; 11-20-2009 at 03:11 PM. |
| since is offline | |
| | #14 |
| Senior software engineer Join Date: Mar 2007 Location: Portland, OR
Posts: 5,768
| Apart from any cache-related issues, the naive DFT is slower than the FFT by a factor of N / log N (i.e., it's HUGELY slower). Have you considered implementing FFT instead? (I'm assuming you are implementing DFT based on the scant code you've posted so far, but I could be wrong)
__________________ "Congratulations on your purchase. To begin using your quantum computer, set the power switch to both off and on simultaneously." -- raftpeople@slashdot |
| brewbuck is offline | |
| | #15 |
| Registered User Join Date: Mar 2009
Posts: 44
| Any chance you use the tables more than one in the loop? If you could rewrite it like this Code: float f0;
for (int k = 0; k < N/4; ++k)
for (int j = 0; j < N; ++j)
{
f0 = fA[k] * fB[j] * (*TSIN)[k][j];
// ...
}
for (int k = N/4; k < 2*N/4; ++k)
for (int j = 0; j < N; ++j)
{
f0 = fA[k] * fB[j] * (*TSIN)[k][j];
// ...
}
for (int k = 2*N/4; k < 3*N/4; ++k)
for (int j = 0; j < N; ++j)
{
f0 = fA[k] * fB[j] * (*TSIN)[k][j];
// ...
}
for (int k = 3*N/4; k < N; ++k)
for (int j = 0; j < N; ++j)
{
f0 = fA[k] * fB[j] * (*TSIN)[k][j];
// ...
}
Also, here's a strange idea. Make the one or both of the vectors' dimensions sized at N+1. You won't use the extra entry, but it will move the data around a bit in the cache and maybe you'll have fewer conflicts. Still won't help as N gets larger, but in that case you're going to start missing in L2 as well and the penalty will get even larger. |
| KCfromNC is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Pthread CPU Affinity | fguy817817 | C Programming | 5 | 11-19-2009 12:39 PM |
| questions on multiple thread programming | lehe | C Programming | 11 | 03-27-2009 07:44 AM |
| Upgrading my old CPU (for another old one!) | foxman | Tech Board | 16 | 01-11-2008 05:41 PM |
| Can you still view the bios screen with a bad CPU? | HyperCreep | Tech Board | 4 | 12-31-2006 06:57 PM |
| CPU temp | PING | Tech Board | 5 | 01-28-2006 06:25 AM |