Profiling and benchmarking

I just tried to learn about image convolution process.

Not the math, but in programmer perspective.

I learned the basic concept only.

Then I found this algorithm is very very slow for large images, took about <1 second.

This makes me screaming whole day.

Now I need to do some optimizations, including the use of linear pointer arithmetic.

So, do you know anything I need to compare to decide which right algorithm to use(including algorithm specialization) before cheating using multiple threads?

Currently I have:

- Mean (Average): The test run for N times then the
*sum of* elapsed time for each process divided to N. - Fastest: The fastest time required to do the process.
- Slowest: The slowest time required to do the process.

Anything else?

Maybe standard deviations or variance (I don't know what are they useful for tough)?

How many times I need to run the process to get the exact result?

I'm currently use between 100 and 1000.

Thanks in advance.

EDIT:

I tend to use portable code. So there is no ASM.