C Board  

Go Back   C Board > General Programming Boards > C Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 03-27-2009, 06:56 AM   #1
Registered User
 
Join Date: Jan 2009
Posts: 156
How to improve time performance of these operations

Hi,
I am using gprof and find that these operations in another loop takes up most of the time:
Code:
... //_xi is precomputed 
for(each content1) for(each cont2){ //pseudo code, where content1 and content2 are both 1D int arrays of size width * height with each element taking value from 0 to 255.
    for(k = 0; k < width * height ; k++) { //19.39% time. where width and height are both int
      if(content1[k] == 0) l += _xi[content2[k]]; // 69.89% time, where  _xi is a 1D double array of size 256
    }
}
I guess the line taking 19.39% time probably could be improved by precompute width and height and store into a new variable and use it as the upper limit on k.
As to the next line taking 69.89% time, is the checking "if(content1[k] == 0)" or addition " l += _xi[content2[k]];" or both takes more time?
How can I improve the time performance in this code snippet? Thanks in advance!
lehe is offline   Reply With Quote
Old 03-27-2009, 07:02 AM   #2
Registered User
 
Join Date: Jan 2009
Posts: 156
Sorry, in my first line by "another loop", I mean the loop
Code:
for(each content1) for(each cont2){
...
}
since I wrote it out, so there is no "another loop".
Thanks!

Last edited by lehe; 03-27-2009 at 07:17 AM.
lehe is offline   Reply With Quote
Old 03-27-2009, 07:38 AM   #3
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
I don't see any real good way to improve performance. Obviously, (manually?) unrolling the loop may help a bit to perform more operations vs. number of loop steps.

But the big time-factor is probably to do with indexing into the array _xi[] and content2[] and waiting for that data to be ready to use.

If you are using something like Intel VTune, AMD CodeAnalyst or oprofile, then you could try to analyze which instruction WITHIN the loop is affecting performance the most. But my guess is that you just have a long dependency chain, and the processor simply can't perform more operations in parallel.

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 03-27-2009, 08:04 AM   #4
CSharpener
 
vart's Avatar
 
Join Date: Oct 2006
Posts: 5,242
I would think the problem is the branchmisprediction in the if

if you can rearrange your code in such way - that if will be outside the internal loop
something like
Code:
for each content1
   for(k=0;k<256;k++)
      if(content1[k] == 0)
         for each content2
            l += x[content2[k]]
it would decrease number of branches in your code speeding it up...
VTune would show this type of problems
__________________
If I have eight hours for cutting wood, I spend six sharpening my axe.
vart is offline   Reply With Quote
Old 03-27-2009, 11:52 AM   #5
Registered User
 
Join Date: Jan 2009
Posts: 156
Thanks vart and matsp!
vart, your suggestion helps to reduce half of the time! Now the time cost is distributed as
Code:
... //_xi is precomputed 
size =  width * height;
for each content1
    for(k = 0; k < size ; k++) { //7.64% time
      if(content1[k] == 0) //  8.73% time
         for each content2 // 8.75% time
            tmp =  _x[content2[k]] // 18.90% time
            l += tmp // 28.53% time
Re-organize embeded loops could help to reduce unnecessary repeated operations. Thanks!
lehe is offline   Reply With Quote
Old 03-27-2009, 12:27 PM   #6
CSharpener
 
vart's Avatar
 
Join Date: Oct 2006
Posts: 5,242
next step would be to play with type of content array members

making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)
__________________
If I have eight hours for cutting wood, I spend six sharpening my axe.
vart is offline   Reply With Quote
Old 03-28-2009, 06:44 PM   #7
Registered User
 
Join Date: Jan 2009
Posts: 156
Since mentioned by matsp and vart, I was wondering are Intel VTune, AMD CodeAnalyst and oprofile freely available? I only found oprofile using my Ubuntu Synaptic.
Also my cpu is AMD Turion64. Does it mean I can only use AMD CodeAnalyst and oprofile but not Intel VTune?

What does it mean that "making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)"? Is it that int requires least time to access and char requires least space?

Thanks!

Last edited by lehe; 03-28-2009 at 06:48 PM.
lehe is offline   Reply With Quote
Old 03-28-2009, 06:51 PM   #8
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
If you can find a really ancient version of VTune, you'd be able to use that. Modern versions check the type of the CPU and refuse to run on non-Intel processors.

So, yes, if you are in Windows, CodeAnalyst is your only choice. In linux, you would be running oprofile - CodeAnalyst is available in Linux and uses oprofile for the actual profiling functionality.

There is a product for Windows called oprofile, but it's something completely different altogether.

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 03-28-2009, 10:28 PM   #9
Algorithm Dissector
 
iMalc's Avatar
 
Join Date: Dec 2005
Location: New Zealand
Posts: 2,476
You post pseudocode to optimise at the algorithmic level, for lower level optimisation you need to post real code.
I don't think there's enough pseudocode shown here to perform any more algorithmic optimisation and low-level optimisation now requires real code.

However, what type is _x and how large is that array?
What are typical values of size, height and width?
__________________
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
iMalc is offline   Reply With Quote
Old 03-28-2009, 11:47 PM   #10
CSharpener
 
vart's Avatar
 
Join Date: Oct 2006
Posts: 5,242
Quote:
Originally Posted by lehe View Post

What does it mean that "making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)"? Is it that int requires least time to access and char requires least space?

Thanks!
yes - char requires less space because int takes 4 or 8 bytes...

so you (or computer) could put 4 times more elements of the char array into the same cashe space when int array...

on the other hand - acessing int is one oparation... acessing char could involve additional oparations - like move int into regoster, masking and shifting...

PS. VTune is for Intel compilers only, and there is free for non-comercial use linux version of it.
__________________
If I have eight hours for cutting wood, I spend six sharpening my axe.
vart is offline   Reply With Quote
Old 03-29-2009, 10:09 AM   #11
Registered User
 
Join Date: Jan 2009
Posts: 156
Thanks, mstsp, iMalc and vart!

I have some questions regarding gprof and oprofile:
1. Is oprofile better than gprof in terms of helping optimizing code time performance? Like checking branch misprediction, etc;
2. Why does oprofile require root while gprof doesn't?
I am trying to use oprofile on a server, however I don't have access to root, so here is what I get:
Quote:
bash-3.2$ opcontrol --vmlinux=/boot/vmlinux-`uname -r`
Normal users are limited to '--dump'.
bash-3.2$ opcontrol --start
Normal users are limited to '--dump'.
bash-3.2$ opcontrol --dump
Unable to complete dump of oprofile data: is the oprofile daemon running?
bash-3.2$ opreport
opreport error: No sample file found: try running opcontrol --dump
or specify a session containing sample files
It seems to tell me I can use dump only, but actually I cannot even start. Is it no way to use oprofile if I am not root?
Would it help if I am able to install my oprofile under my $HOME?

iMalc:
It's C++ code not completely C.
Code:
Vignette_size =   Vignette::width * Vignette::height ; //24 * 24. Vignette is a class.
for(every possible tmplt){// pseduo-code, since the real code is too long. tmplt is of type Vignette*
   generate the tmplt // pseduo-code, since the real code is too long
   for(int k = 0; k < Vignette_size ; k++) { 
      if(tmplt->content[k] == 0){ // tmplt->content is a 1D int array of 24*24
          for(int i = 0; i < nb_vignettes; i++){
              double tmp = _xi[vignettes[i].content[k]]; // _xi is a 1D double array of size 256. vignettes  is of type Vignette*
              log_prob_img_t[i] += tmp;}
      }
    }
}
Thanks!
lehe is offline   Reply With Quote
Old 03-29-2009, 12:27 PM   #12
Algorithm Dissector
 
iMalc's Avatar
 
Join Date: Dec 2005
Location: New Zealand
Posts: 2,476
Quote:
Originally Posted by lehe View Post
Code:
Vignette_size =   Vignette::width * Vignette::height ; //24 * 24. Vignette is a class.
for(every possible tmplt){// pseduo-code, since the real code is too long. tmplt is of type Vignette*
   generate the tmplt // pseduo-code, since the real code is too long
   for(int k = 0; k < Vignette_size ; k++) { 
      if(tmplt->content[k] == 0){ // tmplt->content is a 1D int array of 24*24
          for(int i = 0; i < nb_vignettes; i++){
              double tmp = _xi[vignettes[i].content[k]]; // _xi is a 1D double array of size 256. vignettes  is of type Vignette*
              log_prob_img_t[i] += tmp;}
      }
    }
}
What's showing up here on the [vignettes[i].content[k]] ... line is that your access pattern on that line suggests that inverting the order of the for-loops could improve cache coherence. However it also looks likely that it would in some ways be faster this way around. It's a shame you didn't post the real code for the for loop. Doesn't matter if it's a little long. As long as I dont have to scroll my screen it's fine.
You're also missing all optimisation opportunities with not showing "generate the tmplt". But knowing how big Vignette_size, nb_vignettes are and how many tmplts there typically are would tell us whether it's worth looking at optimising that part more too.
__________________
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
iMalc is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to get current time tsubasa C Programming 3 05-01-2009 02:03 AM
benchmarking? cs32 C Programming 5 02-14-2008 07:37 AM
any ideas to improve search performance in a tree structure George2 C Programming 6 06-09-2006 12:44 AM
What is the best way to record a process execution time? hanash Linux Programming 7 03-15-2006 07:17 AM
relating date.... Prakash C Programming 3 09-19-2001 09:08 AM


All times are GMT -6. The time now is 04:23 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.0 RC2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22