CPU increase with precomputed tables?

This is a discussion on CPU increase with precomputed tables? within the C++ Programming forums, part of the General Programming Boards category; (edit: I'm done editing now, I promise. Maybe.) The problem is that using precomputed tables results in higher CPU usage ...

  1. #1
    Registered User
    Join Date
    Nov 2009
    Posts
    82

    CPU increase with precomputed tables?

    (edit: I'm done editing now, I promise. Maybe.)

    The problem is that using precomputed tables results in higher CPU usage and that's confusing me. The routine which creates the table is in fact occurring outside of the applications render loop. It is NOT being recomputed every iteration.

    So here are the details;

    I've designed a class which executes some numerical algorithms and I've added some methods which allow user to specify the location to some precomputed tables to speed things up.

    Code:
    vector< vector< float > > *TCOS, *TSIN;
    //...
    void ALG::UseTrigTables(vector< vector< float > > *fsin, vector< vector< float > > *fcos)
    {
    	TCOS = fcos;
    	TSIN = fsin;
    }
    After the object is created I will attach the tables like so;

    Code:
    vector< vector< float > >
     tcos(N, vector< float > (N)),
     tsin(N, vector< float > (N));
    
    // [insert loop here to fill vector-vectors with data]... etc...
    
    Blah.UseTrigTables(&fsin, &fcos)
    Within the class the tables are used like so;

    Code:
    float f0;
    for (int k = 0; k < N; ++k)
    for (int j = 0; j < N; ++j)
    {
    	f0 = fA[k] * fB[j] * (*TSIN)[k][j];
    	// ...
    }
    Why would that increase CPU usage?

    With trig: ~8%
    With tables: ~10%

    Where N = 128 (size of tables/loop)

    My first guess;

    While I am passing a pointer, maybe I'm missing something critical and the tables are actually being copied to a new location in memory every time i call them??

    If not, what exact tools do you recommend I use to profile this applications behavior?
    Last edited by since; 11-20-2009 at 12:55 PM.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,711
    Perhaps in your test program you only access the tables once, thus unnecessary computations are not eliminated since they never happen in the first place.
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Nov 2009
    Posts
    82
    Accessing tables definitely occurs every iteration, I don't want to get specific because people get carried away on forums, but for arguments sake I'm talking about certain fourier transforms.

    So the access must occur to get results for each frame that I am processing, another thing to note is that I've removed all math and get a very clear decrease in CPU usage. So I have a baseline that says clearly no math = less CPU intense, which is pretty obvious, but more importantly supports the notion that extra work is happening when both trig and the tables are in place.

    Interestingly enough, the amount of 'more' work being done by the lookup-tables is higher than just using trig functions.

    Any other ideas? This is going to drive me mad, Lol. I'll have to spend some time looking at what the good open source profiling tools there are, unless you have any recommendations?

    ... I have this feeling that passing vector pointers isn't doing what I'm expecting, I hope I'm wrong, trying to get 2d arrays to pass around without vectors has lead me to some painful attempts with little success. Maybe I'll try passing old school arrays wrapped in a class...
    Last edited by since; 11-20-2009 at 01:14 PM.

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,711
    Quote Originally Posted by since
    Accessing tables definitely occurs every iteration
    What I mean is that the access of the entire table happens more than once. If the program accesses each element of the table exactly once, then it would not have done less work than computing each element of the table on the fly, assuming that the values are independent (otherwise you would presumably be using memoisation).
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User
    Join Date
    Nov 2009
    Posts
    82
    Ooooh. IIII seee. Sorry for misunderstanding. I was hoping that the impact of accessing that many array elements would be insignificant, because... no matter what I do I have to go through NxN operations, I thought it was that fact and not array access itself that was causing the performance impact. I'm actually surprised, if I'm using a pointer to look at a specific place in memory, why would that be less efficient than having to go through a series of operations to capture cos/sin values?

    It's not like I'm traversing an array to find a value per operation, the array value I'm referencing is literally *defined for every value at NxN.

    Is there something actually different about accessing a variable which is an array member versus variable which is not within the subset of an array (other then the fact that you have to use an index)?

    Time to bust out the ASM and learn some more, because I would be extremely disappointed to discover that a CPU can't be told to look straight at any given array index and for some reason have that cost a crap ton of CPU cycles. How does a CPU look at any other type of variable data without killing the cpu, because I'm using a lot of other variables in the same operation. >:\
    Last edited by since; 11-20-2009 at 01:40 PM.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,711
    Quote Originally Posted by since
    I'm actually surprised, if I'm using a pointer to look at a specific place in memory, why would that be less efficient than having to go through a series of operations to capture cos/sin values?
    There is a slight cost in dereferencing the pointer, but perhaps the main cost is that of the array itself, e.g., having to load the array into cache if it is too large (though I am not sure what impact this has on CPU usage). Another cost would be the cost of traversing the array during pre-computation (e.g., incrementing array indices).
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    Registered User
    Join Date
    Nov 2009
    Posts
    82
    From what I can tell reading around I'm going to have to port this over to Visual C++ to have access to really good profiling and analysis. Looks like I'm stuck with that since it doesn't seem there's any easy tests to identify the exact reason for this behavior. Caching sounds like a good culprit.

    Well, I guess I have two options, I can learn how to use profiling to troubleshoot this annoyance, and I could just move fft's on to the GPU instead of fighting the CPU for no better than 50% gain even if this did work like I imagined it might.

    Bleh, I'll try to figure out why this is the way it is, and post back for who ever else that might ever run in to this and search the 'ol google, then just try this all out with shaders.

    Where are the quantum computers at? I demand better. >:[
    Last edited by since; 11-20-2009 at 01:50 PM.

  8. #8
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,711
    What compiler are you currently using? I believe that the profiling tools for Visual C++ are only available with the top end Team System (but if you have that, then great ).
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #9
    Registered User
    Join Date
    Nov 2009
    Posts
    82
    I'm using GCC/G++ etc... the free stuff with Dev-C++.

    I'd like to make it all work using just what I have available through opensource projects on principle for this.


    (edit: I need to really think through my posts before I post, I go through like five edits within the first 10 seconds of all my posts. wow!)
    Last edited by since; 11-20-2009 at 01:58 PM.

  10. #10
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    839
    if cache coherence is a problem, you could preload the cache.

    also don't forget GPU readback doesn't come for free. though i hear that can be much improved with CUDA

  11. #11
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,711
    Quote Originally Posted by since
    I'm using GCC/G++ etc...
    In that case you could use gprof for profiling.
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  12. #12
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    If you're saying there's an array entry for each NxN operation, and N=128, then you're definitely killing L1 D-cache hit rates. That's 64K of data assuming only one value with no vector class overhead, and it looks like you've got two, plus other source and dest arrays which is way larger than the d-cache on modern processors.

    So the biggest delay is likely the time it's taking to service an L1 miss (low tens of cycles?). There's also probably a bit of overhead in doing all of the pointer arithmetic, but that's probably able to be hidden fairly well by the compiler and is only a secondary concern.

    Can you try cutting down the value of N and see if the calculation vs table performance difference changes? That would give you a hint that you're limited by cache size if the table lookups suddenly get better when N=16 or 32 - at that point they should fit in L1 if I'm doing the math correctly and your 10-20 cycle L2 access time drops to 3-4 cycles for an L1 hit.

  13. #13
    Registered User
    Join Date
    Nov 2009
    Posts
    82
    KCfromNC,

    You might have got it, I had to go all the way down to N=16 before the table lookups won out, that was doing just quick 15 second tests at N=64, 32, 16. At N=32 the CPU hit was about even between trig/tables. Well, that sucks.

    This is just the base results that I have to work off of with a series of other stochastic processes, not going to work. Figures.

    Alright, I might have to go an entirely different route, with hw accelleration. I need N=128 at least, I might be able to make this work with N=96 and just have it be a little 'sloppy', but with hw acceleration I might be able to get N=256 with relative ease, we'll see.

    Thanks for your suggestions. I'm going to look at gprof anyway just so I can use it for other projects possibly.
    Last edited by since; 11-20-2009 at 02:11 PM.

  14. #14
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,239
    Apart from any cache-related issues, the naive DFT is slower than the FFT by a factor of N / log N (i.e., it's HUGELY slower). Have you considered implementing FFT instead?

    (I'm assuming you are implementing DFT based on the scant code you've posted so far, but I could be wrong)
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #15
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    Any chance you use the tables more than one in the loop? If you could rewrite it like this

    Code:
    float f0;
    for (int k = 0; k < N/4; ++k)
    for (int j = 0; j < N; ++j)
    {
    	f0 = fA[k] * fB[j] * (*TSIN)[k][j];
    	// ...
    }
    for (int k = N/4; k < 2*N/4; ++k)
    for (int j = 0; j < N; ++j)
    {
    	f0 = fA[k] * fB[j] * (*TSIN)[k][j];
    	// ...
    }
    for (int k = 2*N/4; k < 3*N/4; ++k)
    for (int j = 0; j < N; ++j)
    {
    	f0 = fA[k] * fB[j] * (*TSIN)[k][j];
    	// ...
    }
    for (int k = 3*N/4; k < N; ++k)
    for (int j = 0; j < N; ++j)
    {
    	f0 = fA[k] * fB[j] * (*TSIN)[k][j];
    	// ...
    }
    And use TSIN[k][j] more than once per loop, you'd get some benefit. Of course you'd have to play with how much to divide the loop up, but hopefully you get the idea.

    Also, here's a strange idea. Make the one or both of the vectors' dimensions sized at N+1. You won't use the extra entry, but it will move the data around a bit in the cache and maybe you'll have fewer conflicts. Still won't help as N gets larger, but in that case you're going to start missing in L2 as well and the penalty will get even larger.

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Pthread CPU Affinity
    By fguy817817 in forum C Programming
    Replies: 5
    Last Post: 11-19-2009, 11:39 AM
  2. questions on multiple thread programming
    By lehe in forum C Programming
    Replies: 11
    Last Post: 03-27-2009, 07:44 AM
  3. Upgrading my old CPU (for another old one!)
    By foxman in forum Tech Board
    Replies: 16
    Last Post: 01-11-2008, 04:41 PM
  4. Can you still view the bios screen with a bad CPU?
    By HyperCreep in forum Tech Board
    Replies: 4
    Last Post: 12-31-2006, 05:57 PM
  5. CPU temp
    By PING in forum Tech Board
    Replies: 5
    Last Post: 01-28-2006, 05:25 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21