Thread: How to improve time performance of these operations

  1. #1
    Registered User
    Join Date
    Jan 2009
    Posts
    159

    How to improve time performance of these operations

    Hi,
    I am using gprof and find that these operations in another loop takes up most of the time:
    Code:
    ... //_xi is precomputed 
    for(each content1) for(each cont2){ //pseudo code, where content1 and content2 are both 1D int arrays of size width * height with each element taking value from 0 to 255.
        for(k = 0; k < width * height ; k++) { //19.39% time. where width and height are both int
          if(content1[k] == 0) l += _xi[content2[k]]; // 69.89% time, where  _xi is a 1D double array of size 256
        }
    }
    I guess the line taking 19.39% time probably could be improved by precompute width and height and store into a new variable and use it as the upper limit on k.
    As to the next line taking 69.89% time, is the checking "if(content1[k] == 0)" or addition " l += _xi[content2[k]];" or both takes more time?
    How can I improve the time performance in this code snippet? Thanks in advance!

  2. #2
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Sorry, in my first line by "another loop", I mean the loop
    Code:
    for(each content1) for(each cont2){
    ...
    }
    since I wrote it out, so there is no "another loop".
    Thanks!
    Last edited by lehe; 03-27-2009 at 07:17 AM.

  3. #3
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I don't see any real good way to improve performance. Obviously, (manually?) unrolling the loop may help a bit to perform more operations vs. number of loop steps.

    But the big time-factor is probably to do with indexing into the array _xi[] and content2[] and waiting for that data to be ready to use.

    If you are using something like Intel VTune, AMD CodeAnalyst or oprofile, then you could try to analyze which instruction WITHIN the loop is affecting performance the most. But my guess is that you just have a long dependency chain, and the processor simply can't perform more operations in parallel.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  4. #4
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    I would think the problem is the branchmisprediction in the if

    if you can rearrange your code in such way - that if will be outside the internal loop
    something like
    Code:
    for each content1
       for(k=0;k<256;k++)
          if(content1[k] == 0)
             for each content2
                l += x[content2[k]]
    it would decrease number of branches in your code speeding it up...
    VTune would show this type of problems
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  5. #5
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Thanks vart and matsp!
    vart, your suggestion helps to reduce half of the time! Now the time cost is distributed as
    Code:
    ... //_xi is precomputed 
    size =  width * height;
    for each content1
        for(k = 0; k < size ; k++) { //7.64% time
          if(content1[k] == 0) //  8.73% time
             for each content2 // 8.75% time
                tmp =  _x[content2[k]] // 18.90% time
                l += tmp // 28.53% time
    Re-organize embeded loops could help to reduce unnecessary repeated operations. Thanks!

  6. #6
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    next step would be to play with type of content array members

    making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  7. #7
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Since mentioned by matsp and vart, I was wondering are Intel VTune, AMD CodeAnalyst and oprofile freely available? I only found oprofile using my Ubuntu Synaptic.
    Also my cpu is AMD Turion64. Does it mean I can only use AMD CodeAnalyst and oprofile but not Intel VTune?

    What does it mean that "making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)"? Is it that int requires least time to access and char requires least space?

    Thanks!
    Last edited by lehe; 03-28-2009 at 06:48 PM.

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    If you can find a really ancient version of VTune, you'd be able to use that. Modern versions check the type of the CPU and refuse to run on non-Intel processors.

    So, yes, if you are in Windows, CodeAnalyst is your only choice. In linux, you would be running oprofile - CodeAnalyst is available in Linux and uses oprofile for the actual profiling functionality.

    There is a product for Windows called oprofile, but it's something completely different altogether.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    You post pseudocode to optimise at the algorithmic level, for lower level optimisation you need to post real code.
    I don't think there's enough pseudocode shown here to perform any more algorithmic optimisation and low-level optimisation now requires real code.

    However, what type is _x and how large is that array?
    What are typical values of size, height and width?
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Quote Originally Posted by lehe View Post

    What does it mean that "making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)"? Is it that int requires least time to access and char requires least space?

    Thanks!
    yes - char requires less space because int takes 4 or 8 bytes...

    so you (or computer) could put 4 times more elements of the char array into the same cashe space when int array...

    on the other hand - acessing int is one oparation... acessing char could involve additional oparations - like move int into regoster, masking and shifting...

    PS. VTune is for Intel compilers only, and there is free for non-comercial use linux version of it.
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  11. #11
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Thanks, mstsp, iMalc and vart!

    I have some questions regarding gprof and oprofile:
    1. Is oprofile better than gprof in terms of helping optimizing code time performance? Like checking branch misprediction, etc;
    2. Why does oprofile require root while gprof doesn't?
    I am trying to use oprofile on a server, however I don't have access to root, so here is what I get:
    bash-3.2$ opcontrol --vmlinux=/boot/vmlinux-`uname -r`
    Normal users are limited to '--dump'.
    bash-3.2$ opcontrol --start
    Normal users are limited to '--dump'.
    bash-3.2$ opcontrol --dump
    Unable to complete dump of oprofile data: is the oprofile daemon running?
    bash-3.2$ opreport
    opreport error: No sample file found: try running opcontrol --dump
    or specify a session containing sample files
    It seems to tell me I can use dump only, but actually I cannot even start. Is it no way to use oprofile if I am not root?
    Would it help if I am able to install my oprofile under my $HOME?

    iMalc:
    It's C++ code not completely C.
    Code:
    Vignette_size =   Vignette::width * Vignette::height ; //24 * 24. Vignette is a class.
    for(every possible tmplt){// pseduo-code, since the real code is too long. tmplt is of type Vignette*
       generate the tmplt // pseduo-code, since the real code is too long
       for(int k = 0; k < Vignette_size ; k++) { 
          if(tmplt->content[k] == 0){ // tmplt->content is a 1D int array of 24*24
              for(int i = 0; i < nb_vignettes; i++){
                  double tmp = _xi[vignettes[i].content[k]]; // _xi is a 1D double array of size 256. vignettes  is of type Vignette*
                  log_prob_img_t[i] += tmp;}
          }
        }
    }
    Thanks!

  12. #12
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by lehe View Post
    Code:
    Vignette_size =   Vignette::width * Vignette::height ; //24 * 24. Vignette is a class.
    for(every possible tmplt){// pseduo-code, since the real code is too long. tmplt is of type Vignette*
       generate the tmplt // pseduo-code, since the real code is too long
       for(int k = 0; k < Vignette_size ; k++) { 
          if(tmplt->content[k] == 0){ // tmplt->content is a 1D int array of 24*24
              for(int i = 0; i < nb_vignettes; i++){
                  double tmp = _xi[vignettes[i].content[k]]; // _xi is a 1D double array of size 256. vignettes  is of type Vignette*
                  log_prob_img_t[i] += tmp;}
          }
        }
    }
    What's showing up here on the [vignettes[i].content[k]] ... line is that your access pattern on that line suggests that inverting the order of the for-loops could improve cache coherence. However it also looks likely that it would in some ways be faster this way around. It's a shame you didn't post the real code for the for loop. Doesn't matter if it's a little long. As long as I dont have to scroll my screen it's fine.
    You're also missing all optimisation opportunities with not showing "generate the tmplt". But knowing how big Vignette_size, nb_vignettes are and how many tmplts there typically are would tell us whether it's worth looking at optimising that part more too.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to get current time
    By tsubasa in forum C Programming
    Replies: 3
    Last Post: 05-01-2009, 02:03 AM
  2. benchmarking?
    By cs32 in forum C Programming
    Replies: 5
    Last Post: 02-14-2008, 07:37 AM
  3. Replies: 6
    Last Post: 06-09-2006, 12:44 AM
  4. What is the best way to record a process execution time?
    By hanash in forum Linux Programming
    Replies: 7
    Last Post: 03-15-2006, 07:17 AM
  5. relating date....
    By Prakash in forum C Programming
    Replies: 3
    Last Post: 09-19-2001, 09:08 AM