Profiling and benchmarking

**audinue** · 09-01-2009

I just tried to learn about image convolution process.
Not the math, but in programmer perspective.
I learned the basic concept only.

Then I found this algorithm is very very slow for large images, took about <1 second.
This makes me screaming whole day.

Now I need to do some optimizations, including the use of linear pointer arithmetic.

So, do you know anything I need to compare to decide which right algorithm to use(including algorithm specialization) before cheating using multiple threads?

Currently I have:

Mean (Average): The test run for N times then the sum of elapsed time for each process divided to N.
Fastest: The fastest time required to do the process.
Slowest: The slowest time required to do the process.

Anything else?
Maybe standard deviations or variance (I don't know what are they useful for tough)?
How many times I need to run the process to get the exact result?
I'm currently use between 100 and 1000.

Thanks in advance.

EDIT:
I tend to use portable code. So there is no ASM.

**Mario F.** · 09-01-2009

Originally Posted by audinue

Then I found this algorithm is very very slow for large images, took about <1 second.

Oh, the horror. :/

Now I need to do some optimizations, including the use of linear pointer arithmetic.

ermm... what's non linear pointer arithmetic?
Not trying to be funny. I just never heard of it.

Currently I have:

Mean (Average): The test run for N times then the sum of elapsed time for each process divided to N.
Fastest: The fastest time required to do the process.
Slowest: The slowest time required to do the process.

That should be it.
Unless performance is your one and only concern, you then should compare results to other variables like memory requirements or code clarity in order to make a decision.

Code:

Anything else?

I could say that less than one second to process a large image (how large exactly?) is really not a problem unless this is meant to be used in batch processing, where indeed every micro-second counts. You should however run your code in other systems to get a feel for how processor performance will affect the code performance. This is the best way to determine when you should stop optimizing. My guess: about right now.

**cyberfish** · 09-05-2009

How large is the image?

Are you sure your code is the bottleneck and not the harddrive?

**brewbuck** · 09-05-2009

How large is the image and the convolution kernel?

**brewbuck** · 09-05-2009

The best way to speed up convolution is to express it as a frequency multiplication using FFTs. This generates enormous speedups, faster than anything you could ever achieve by writing optimized convolution algorithms.

Explicit convolution is normally only used for small convolution kernels, where the constant factors involved in FFT convolution outweigh the quadratic cost of explicit convolution. How fast your convolver operates is heavily dependent on two factors:

1) Memory access pattern. Convolution is primarily memory bound. The multiplications and additions take far less time than actually loading the data into the processor. A poor memory indexing pattern will have a huge impact on the convolver even if the basic calculations are the same. And unless the kernel and the image can both fit into cache simultaneously, you can never go faster than the speed dictated by your memory bus.

2) Handling of boundary conditions. There are four common ways the boundary is handled. It can be row/column replicated from the nearest valid row/column. It could be cyclic (modular). It could be treated as an implicit zero. Or the convolution kernel itself can be adjusted to only cover "valid" area of the image.

Generally you do not want to introduce boundary handling logic into your inner loops. That means you need to process the boundary regions separately from the interior.

**Sebastiani** · 09-05-2009

Originally Posted by brewbuck

The best way to speed up convolution is to express it as a frequency multiplication using FFTs. This generates enormous speedups, faster than anything you could ever achieve by writing optimized convolution algorithms.

Explicit convolution is normally only used for small convolution kernels, where the constant factors involved in FFT convolution outweigh the quadratic cost of explicit convolution. How fast your convolver operates is heavily dependent on two factors:

1) Memory access pattern. Convolution is primarily memory bound. The multiplications and additions take far less time than actually loading the data into the processor. A poor memory indexing pattern will have a huge impact on the convolver even if the basic calculations are the same. And unless the kernel and the image can both fit into cache simultaneously, you can never go faster than the speed dictated by your memory bus.

2) Handling of boundary conditions. There are four common ways the boundary is handled. It can be row/column replicated from the nearest valid row/column. It could be cyclic (modular). It could be treated as an implicit zero. Or the convolution kernel itself can be adjusted to only cover "valid" area of the image.

Generally you do not want to introduce boundary handling logic into your inner loops. That means you need to process the boundary regions separately from the interior.

Hmm. Interesting suggestions - sounds like good advice.

**audinue** · 09-06-2009

I want to make both manga (japanese comic) and comic viewer.

Almost of them are scanlation hence they are contains a lot of huge images.

It's absolutely "degradating" your mood if the viewer is just as slow as a turtle just for sharpening, smoothing, or retouching the page.

My simple sharpen kernel is just 3x3.

Currently, I'm interpreting the convolution process as linear array processing.

So it's not:

Code:

for y
 for x
  for i
   for j

The high level map-to-code above consume <1 second to process 2K x 2K image.

While my technique is just:

Code:

while not end
 while not end

So the boundary check is just as simple as:

Code:

if p >= begin and p < end

p is pointer.

I already done some optimizing using all the best techniques I know such as minimize function calls and minimize the use of arrays, and it's done for only >400ms.

About the FFT, I tought it is good for large kernels, not the small one.

**cyberfish** · 09-06-2009

Since people usually read manga page by page, isn't it possible to pre-process pages in background?

**brewbuck** · 09-06-2009

So to make sure I understand, your goal is to scale an image from a high resolution to a resolution suitable for displaying on a computer screen. To accomplish this you are applying an anti-aliasing filter (low pass?) to the data before scaling down in order to avoid aliasing.

This is absolutely the correct thing to do, but to give more specific advice, I would need to know what the starting and ending resolutions are (in dots per inch) and whether there is a significant amount of high frequency signal present.

Unless high frequencies dominate, it is usually best to do a crude, nearest-neighbor downsample to an intermediate resolution, followed by an anti-alias filter to the final resolution. This way, the convolution is applied to a smaller image, reducing the cost.

It is true that downsampling will introduce aliased frequencies, but this is only a problem if those frequencies were strong to begin with. For example, a cartoon drawing at 600 DPI could be nearest-neighbor downsampled to 300 DPI, then anti-alias downsampled to 100 DPI for on-screen display.

(In my day job I implement image processing pipelines)

@cyberfish: Pre-computing images during idle time is a perfectly good approach, as long as memory is available for it.

Another possibility is to apply crude downsampling to quickly get an image on the screen, then perform a higher quality downsample for an immediate update. This gets the image onto the screen quickly, then a high-quality version appears just moments afterward. Studies have shown that users perceive the responsiveness to be greater when such an approach is used

**audinue** · 09-07-2009

Nice shot brewbuck!

@cyberfish: Btw I already implement the prefetch mechanism plus the cache manager

Now, I'm trying to implement somekind of continuous page display mode... Like almost PDF readers do...
A little bit confusing tough to use the cache or mmap...

Thread: Profiling and benchmarking

Thread Tools

Search Thread

Display

Profiling and benchmarking