Thread: Profiling and benchmarking

  1. #1
    Ugly C Lover audinue's Avatar
    Join Date
    Jun 2008
    Location
    Indonesia
    Posts
    489

    Profiling and benchmarking

    I just tried to learn about image convolution process.
    Not the math, but in programmer perspective.
    I learned the basic concept only.

    Then I found this algorithm is very very slow for large images, took about <1 second.
    This makes me screaming whole day.

    Now I need to do some optimizations, including the use of linear pointer arithmetic.

    So, do you know anything I need to compare to decide which right algorithm to use(including algorithm specialization) before cheating using multiple threads?

    Currently I have:

    1. Mean (Average): The test run for N times then the sum of elapsed time for each process divided to N.
    2. Fastest: The fastest time required to do the process.
    3. Slowest: The slowest time required to do the process.

    Anything else?
    Maybe standard deviations or variance (I don't know what are they useful for tough)?
    How many times I need to run the process to get the exact result?
    I'm currently use between 100 and 1000.

    Thanks in advance.

    EDIT:
    I tend to use portable code. So there is no ASM.
    Last edited by audinue; 09-01-2009 at 02:25 AM.
    Just GET it OFF out my mind!!

  2. #2
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by audinue View Post
    Then I found this algorithm is very very slow for large images, took about <1 second.
    Oh, the horror. :/

    Now I need to do some optimizations, including the use of linear pointer arithmetic.
    ermm... what's non linear pointer arithmetic?
    Not trying to be funny. I just never heard of it.

    Currently I have:

    1. Mean (Average): The test run for N times then the sum of elapsed time for each process divided to N.
    2. Fastest: The fastest time required to do the process.
    3. Slowest: The slowest time required to do the process.

    That should be it.
    Unless performance is your one and only concern, you then should compare results to other variables like memory requirements or code clarity in order to make a decision.


    Code:
    Anything else?
    I could say that less than one second to process a large image (how large exactly?) is really not a problem unless this is meant to be used in batch processing, where indeed every micro-second counts. You should however run your code in other systems to get a feel for how processor performance will affect the code performance. This is the best way to determine when you should stop optimizing. My guess: about right now.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  3. #3
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    How large is the image?

    Are you sure your code is the bottleneck and not the harddrive?

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    How large is the image and the convolution kernel?
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    The best way to speed up convolution is to express it as a frequency multiplication using FFTs. This generates enormous speedups, faster than anything you could ever achieve by writing optimized convolution algorithms.

    Explicit convolution is normally only used for small convolution kernels, where the constant factors involved in FFT convolution outweigh the quadratic cost of explicit convolution. How fast your convolver operates is heavily dependent on two factors:

    1) Memory access pattern. Convolution is primarily memory bound. The multiplications and additions take far less time than actually loading the data into the processor. A poor memory indexing pattern will have a huge impact on the convolver even if the basic calculations are the same. And unless the kernel and the image can both fit into cache simultaneously, you can never go faster than the speed dictated by your memory bus.

    2) Handling of boundary conditions. There are four common ways the boundary is handled. It can be row/column replicated from the nearest valid row/column. It could be cyclic (modular). It could be treated as an implicit zero. Or the convolution kernel itself can be adjusted to only cover "valid" area of the image.

    Generally you do not want to introduce boundary handling logic into your inner loops. That means you need to process the boundary regions separately from the interior.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #6
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Quote Originally Posted by brewbuck View Post
    The best way to speed up convolution is to express it as a frequency multiplication using FFTs. This generates enormous speedups, faster than anything you could ever achieve by writing optimized convolution algorithms.

    Explicit convolution is normally only used for small convolution kernels, where the constant factors involved in FFT convolution outweigh the quadratic cost of explicit convolution. How fast your convolver operates is heavily dependent on two factors:

    1) Memory access pattern. Convolution is primarily memory bound. The multiplications and additions take far less time than actually loading the data into the processor. A poor memory indexing pattern will have a huge impact on the convolver even if the basic calculations are the same. And unless the kernel and the image can both fit into cache simultaneously, you can never go faster than the speed dictated by your memory bus.

    2) Handling of boundary conditions. There are four common ways the boundary is handled. It can be row/column replicated from the nearest valid row/column. It could be cyclic (modular). It could be treated as an implicit zero. Or the convolution kernel itself can be adjusted to only cover "valid" area of the image.

    Generally you do not want to introduce boundary handling logic into your inner loops. That means you need to process the boundary regions separately from the interior.
    Hmm. Interesting suggestions - sounds like good advice.

  7. #7
    Ugly C Lover audinue's Avatar
    Join Date
    Jun 2008
    Location
    Indonesia
    Posts
    489
    I want to make both manga (japanese comic) and comic viewer.

    Almost of them are scanlation hence they are contains a lot of huge images.

    It's absolutely "degradating" your mood if the viewer is just as slow as a turtle just for sharpening, smoothing, or retouching the page.


    My simple sharpen kernel is just 3x3.

    Currently, I'm interpreting the convolution process as linear array processing.

    So it's not:
    Code:
    for y
     for x
      for i
       for j
    The high level map-to-code above consume <1 second to process 2K x 2K image.

    While my technique is just:
    Code:
    while not end
     while not end
    So the boundary check is just as simple as:
    Code:
    if p >= begin and p < end
    p is pointer.

    I already done some optimizing using all the best techniques I know such as minimize function calls and minimize the use of arrays, and it's done for only >400ms.

    About the FFT, I tought it is good for large kernels, not the small one.
    Just GET it OFF out my mind!!

  8. #8
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Since people usually read manga page by page, isn't it possible to pre-process pages in background?

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    So to make sure I understand, your goal is to scale an image from a high resolution to a resolution suitable for displaying on a computer screen. To accomplish this you are applying an anti-aliasing filter (low pass?) to the data before scaling down in order to avoid aliasing.

    This is absolutely the correct thing to do, but to give more specific advice, I would need to know what the starting and ending resolutions are (in dots per inch) and whether there is a significant amount of high frequency signal present.

    Unless high frequencies dominate, it is usually best to do a crude, nearest-neighbor downsample to an intermediate resolution, followed by an anti-alias filter to the final resolution. This way, the convolution is applied to a smaller image, reducing the cost.

    It is true that downsampling will introduce aliased frequencies, but this is only a problem if those frequencies were strong to begin with. For example, a cartoon drawing at 600 DPI could be nearest-neighbor downsampled to 300 DPI, then anti-alias downsampled to 100 DPI for on-screen display.

    (In my day job I implement image processing pipelines)

    @cyberfish: Pre-computing images during idle time is a perfectly good approach, as long as memory is available for it.

    Another possibility is to apply crude downsampling to quickly get an image on the screen, then perform a higher quality downsample for an immediate update. This gets the image onto the screen quickly, then a high-quality version appears just moments afterward. Studies have shown that users perceive the responsiveness to be greater when such an approach is used
    Last edited by brewbuck; 09-06-2009 at 11:53 PM.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Ugly C Lover audinue's Avatar
    Join Date
    Jun 2008
    Location
    Indonesia
    Posts
    489
    Nice shot brewbuck!

    @cyberfish: Btw I already implement the prefetch mechanism plus the cache manager
    Now, I'm trying to implement somekind of continuous page display mode... Like almost PDF readers do...
    A little bit confusing tough to use the cache or mmap...
    Just GET it OFF out my mind!!

Popular pages Recent additions subscribe to a feed