![]() |
| | #1 |
| Registered User Join Date: Jan 2009
Posts: 156
| How to improve time performance of these operations I am using gprof and find that these operations in another loop takes up most of the time: Code: ... //_xi is precomputed
for(each content1) for(each cont2){ //pseudo code, where content1 and content2 are both 1D int arrays of size width * height with each element taking value from 0 to 255.
for(k = 0; k < width * height ; k++) { //19.39% time. where width and height are both int
if(content1[k] == 0) l += _xi[content2[k]]; // 69.89% time, where _xi is a 1D double array of size 256
}
}
As to the next line taking 69.89% time, is the checking "if(content1[k] == 0)" or addition " l += _xi[content2[k]];" or both takes more time? How can I improve the time performance in this code snippet? Thanks in advance! |
| lehe is offline | |
| | #2 |
| Registered User Join Date: Jan 2009
Posts: 156
| Sorry, in my first line by "another loop", I mean the loop Code: for(each content1) for(each cont2){
...
}
Thanks! Last edited by lehe; 03-27-2009 at 07:17 AM. |
| lehe is offline | |
| | #3 |
| Kernel hacker Join Date: Jul 2007 Location: Farncombe, Surrey, England
Posts: 15,686
| I don't see any real good way to improve performance. Obviously, (manually?) unrolling the loop may help a bit to perform more operations vs. number of loop steps. But the big time-factor is probably to do with indexing into the array _xi[] and content2[] and waiting for that data to be ready to use. If you are using something like Intel VTune, AMD CodeAnalyst or oprofile, then you could try to analyze which instruction WITHIN the loop is affecting performance the most. But my guess is that you just have a long dependency chain, and the processor simply can't perform more operations in parallel. -- Mats
__________________ Compilers can produce warnings - make the compiler programmers happy: Use them! Please don't PM me for help - and no, I don't do help over instant messengers. |
| matsp is offline | |
| | #4 |
| CSharpener Join Date: Oct 2006
Posts: 5,242
| I would think the problem is the branchmisprediction in the if if you can rearrange your code in such way - that if will be outside the internal loop something like Code: for each content1
for(k=0;k<256;k++)
if(content1[k] == 0)
for each content2
l += x[content2[k]]
VTune would show this type of problems
__________________ If I have eight hours for cutting wood, I spend six sharpening my axe. |
| vart is offline | |
| | #5 |
| Registered User Join Date: Jan 2009
Posts: 156
| Thanks vart and matsp! vart, your suggestion helps to reduce half of the time! Now the time cost is distributed as Code: ... //_xi is precomputed
size = width * height;
for each content1
for(k = 0; k < size ; k++) { //7.64% time
if(content1[k] == 0) // 8.73% time
for each content2 // 8.75% time
tmp = _x[content2[k]] // 18.90% time
l += tmp // 28.53% time
|
| lehe is offline | |
| | #6 |
| CSharpener Join Date: Oct 2006
Posts: 5,242
| next step would be to play with type of content array members making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)
__________________ If I have eight hours for cutting wood, I spend six sharpening my axe. |
| vart is offline | |
| | #7 |
| Registered User Join Date: Jan 2009
Posts: 156
| Since mentioned by matsp and vart, I was wondering are Intel VTune, AMD CodeAnalyst and oprofile freely available? I only found oprofile using my Ubuntu Synaptic. Also my cpu is AMD Turion64. Does it mean I can only use AMD CodeAnalyst and oprofile but not Intel VTune? What does it mean that "making it int to decrease access time or making it char (if possible) to decrease cashe size (avoiding cache misses)"? Is it that int requires least time to access and char requires least space? Thanks! Last edited by lehe; 03-28-2009 at 06:48 PM. |
| lehe is offline | |
| | #8 |
| Kernel hacker Join Date: Jul 2007 Location: Farncombe, Surrey, England
Posts: 15,686
| If you can find a really ancient version of VTune, you'd be able to use that. Modern versions check the type of the CPU and refuse to run on non-Intel processors. So, yes, if you are in Windows, CodeAnalyst is your only choice. In linux, you would be running oprofile - CodeAnalyst is available in Linux and uses oprofile for the actual profiling functionality. There is a product for Windows called oprofile, but it's something completely different altogether. -- Mats
__________________ Compilers can produce warnings - make the compiler programmers happy: Use them! Please don't PM me for help - and no, I don't do help over instant messengers. |
| matsp is offline | |
| | #9 |
| Algorithm Dissector Join Date: Dec 2005 Location: New Zealand
Posts: 2,476
| You post pseudocode to optimise at the algorithmic level, for lower level optimisation you need to post real code. I don't think there's enough pseudocode shown here to perform any more algorithmic optimisation and low-level optimisation now requires real code. However, what type is _x and how large is that array? What are typical values of size, height and width?
__________________ My homepage Advice: Take only as directed - If symptoms persist, please see your debugger |
| iMalc is offline | |
| | #10 | |
| CSharpener Join Date: Oct 2006
Posts: 5,242
| Quote:
so you (or computer) could put 4 times more elements of the char array into the same cashe space when int array... on the other hand - acessing int is one oparation... acessing char could involve additional oparations - like move int into regoster, masking and shifting... PS. VTune is for Intel compilers only, and there is free for non-comercial use linux version of it.
__________________ If I have eight hours for cutting wood, I spend six sharpening my axe. | |
| vart is offline | |
| | #11 | |
| Registered User Join Date: Jan 2009
Posts: 156
| Thanks, mstsp, iMalc and vart! I have some questions regarding gprof and oprofile: 1. Is oprofile better than gprof in terms of helping optimizing code time performance? Like checking branch misprediction, etc; 2. Why does oprofile require root while gprof doesn't? I am trying to use oprofile on a server, however I don't have access to root, so here is what I get: Quote:
Would it help if I am able to install my oprofile under my $HOME? iMalc: It's C++ code not completely C. Code: Vignette_size = Vignette::width * Vignette::height ; //24 * 24. Vignette is a class.
for(every possible tmplt){// pseduo-code, since the real code is too long. tmplt is of type Vignette*
generate the tmplt // pseduo-code, since the real code is too long
for(int k = 0; k < Vignette_size ; k++) {
if(tmplt->content[k] == 0){ // tmplt->content is a 1D int array of 24*24
for(int i = 0; i < nb_vignettes; i++){
double tmp = _xi[vignettes[i].content[k]]; // _xi is a 1D double array of size 256. vignettes is of type Vignette*
log_prob_img_t[i] += tmp;}
}
}
}
| |
| lehe is offline | |
| | #12 | |
| Algorithm Dissector Join Date: Dec 2005 Location: New Zealand
Posts: 2,476
| Quote:
You're also missing all optimisation opportunities with not showing "generate the tmplt". But knowing how big Vignette_size, nb_vignettes are and how many tmplts there typically are would tell us whether it's worth looking at optimising that part more too.
__________________ My homepage Advice: Take only as directed - If symptoms persist, please see your debugger | |
| iMalc is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How to get current time | tsubasa | C Programming | 3 | 05-01-2009 02:03 AM |
| benchmarking? | cs32 | C Programming | 5 | 02-14-2008 07:37 AM |
| any ideas to improve search performance in a tree structure | George2 | C Programming | 6 | 06-09-2006 12:44 AM |
| What is the best way to record a process execution time? | hanash | Linux Programming | 7 | 03-15-2006 07:17 AM |
| relating date.... | Prakash | C Programming | 3 | 09-19-2001 09:08 AM |