# Thread: How To Measure "Difference" Between Images?

1. ## How To Measure "Difference" Between Images?

Hello,

So here I am, stuck overnight at work and whiling away the wee hours on another crazy coding spree.

Anyways, I am attempting to reason with the following:-
I have 32 8-bit greyscale images which may/may not be similar;
I want to have 16 images that best represent the original 32.

My strategy:-
Select the 16 most divergent images (the ones that are the least like the others);
Cluster the other 16 around these, combining the clustered images together.

But what's a good metric to determine how "like" two two-dimensional arrays are?

At the moment I'm doing abs(img1[n] - img2[n]) yes, I'm hopeless. This can't deal with for example, comparing images that are complements (inversions) of each other, they would be considered identical. I need something that takes into account the position of each pixel within the image, I think.

I have heard about chi square distribution being used to compare frames of video, but I'm not a math geek and would appreciate someone putting it through the blender for me, if that is what I need.

2. Well I guess that'll keep me busy. Thanks

3. Which technique to use depends on the problem requirements. Do you need to identify images which are scaled/rotated copies of each other? Is your inverted-color example just an example or an actual requirement? Etc.

abs(img1[n] - img2[n]) is called the absolute difference method and is the technique used in most video compression algorithms, so I wouldn't call you "hopeless" for adopting such a technique. It's just that it's not robust to certain types of transformations of the image. If you are more specific I can give some better ideas.

EDIT: You can fairly easily change your absolute value technique to a correlation technique. First, compute the mean and variance of the pixel values in the image, and normalize them such that the mean is zero and the variance is one. After doing this to both images, compute sum(img1[n] * img2[n]) over all pixels in the image. The result is the normalized cross-correlation at zero shift. If this value is close to zero, the images are not correlated. If it is positive, the images are positively correlated. If it is negative, the images are negatively correlated (inverted colors)

Again, whether this works better than taking absolute differences depends on what you are trying to achieve.

4. Well, I sorta... am trying to reduce an image to up to 256 8x16 tiles... to load into character memory in 80x25 text mode.
Just so you know, I did wet myself a little writing that.

This is better than various ASCII/ANSI art renderers as they use the standard character set and typically need thousands of characters to render hundreds of pixels. My implementation would be 1:1.
The only downside is depending on the image, fidelity is gonna get pretty bad. So having the right algorithm to "fold" different tiles together is crucial.
Identifying rotations probably wouldn't help (I wouldn't want to lump symmetrical tiles together), slight scales/translations might be useful though, to preserve detail.

5. So you are looking for a set of 256 8x16 bitmaps which in some sense "optimally" represents the images you want to draw? Then, given a particular chunk of image, to identify which of those tiles matches that image chunk best? Like, old-school custom font graphics?

Assuming you say yes, the sum of absolute differences is quite a good method for determining which pre-made tile is the best match for a given image part. Determining the optimal set of tiles is more of an open-ended question, but I would take a large sample of the images you will be using (hopefully all of them), divide them into 8x16 chunks, then cluster them into 256 different clusters.

I can propose clustering methods if you confirm that this is indeed what you are doing.

6. Almost got it, except mapping part of the image to a tile will modify the tile to include features from that part. Everywhere else that that tile is used will take on that change.
As I said, likely to get noisy but it's the best way that I can think of doing this. That and keeping the images small and greyscale.
I'm developing the algorithm around a 256x192 test image, 384 tiles of input -> 256 tiles of output.
I am also mulling reserving one tile for a border and areas of the image that are or close to solid colour. Mapping parts to this tile does not modify the tile.
Getting there. 8)

7. Originally Posted by brewbuck
I would take a large sample of the images you will be using (hopefully all of them), divide them into 8x16 chunks, then cluster them into 256 different clusters.
Me too, pretty much. Do you know any clever ways to include psychovisual modeling?

(I don't. I do know the brute-force way of first detecting any human-noticeable details (edges, curves, intersections, corners, and dots), using both the list of details and the pixel difference statistics to compare chunks.)

To those who are unfamiliar with the term "psychovisual modeling":

To a human, \ and ╲ are very similar, as are x and × and X and ╳. However, | and ( are quite dissimilar. Applying the way humans perceive visual differences to quantifying differences or similarity, is applying psychovisual modeling.

Psychoacoustic modeling is much more common. For example, when audio data is digitally compressed, the noise generated is often shaped so that the frequency spectrum matches the sensitivity of human hearing; this minimizes the perception of that noise.

8. Originally Posted by Nominal Animal
To a human, \ and ╲ are very similar, as are x and × and X and ╳. However, | and ( are quite dissimilar. Applying the way humans perceive visual differences to quantifying differences or similarity, is applying psychovisual modeling.
Hmm... I recall reading some article about those optical illusions where say, two lines of the same length are presented in a slightly different context (due to surrounding lines) and hence perceived as being of different lengths, mentioning that in certain non-urban cultures, people were not susceptible to the illusion. The theory highlighted by that article is that these people spend most of their time in environments without such straight lines, thus their perception is different. If so, wouldn't this mean that "to a human" might be an unwarranted generalisation in your statement, or is psychovisual modeling able to account for such cultural/environmental factors?

9. Originally Posted by laserlight
wouldn't this mean that "to a human" might be an unwarranted generalisation in your statement
Yes and no. Yes, there are cultural and even individual differences as to exactly how humans perceive visual details. No, because the features I mentioned are general enough; they are perceived the same way by practically all humans.

In fact, I do believe most of the research into this kind of visual metrics is actually done in relation to machine vision. I personally suspect that the importance of these features (straight lines, intersections, curves, spots) to perception is roughly common across most non-nocturnal mammals with good stereoscopic vision, not just humans.

Originally Posted by laserlight
is psychovisual modeling able to account for such cultural/environmental factors?
In general, no. The same applies to psychoacoustic modeling, too.

For example, the frequency-specific sensitivity of hearing is very dependent on age, not just cultural/environmental factors. As humans age, the high-frequency end of the spectrum is lost, mostly due to (normal) damage to the hair cells in the organ of Corti. Headphone use, listening to loud music, working in a noisy industrial environment all affect this.

Mostly, the models are quite, quite rough. We're nowhere near the level of detail where optical illusions work.

In my opinion, if optical illusions are the patterns in tree bark, in psychovisual modeling we are roughly able to tell that there is a forest there, but not even tell the kinds of trees in it. Fortunately, even that kind of rough models do yield useful results, compared to e.g. plain pixel statistics.

A recent Slashdot article mentions how D. Kriesel, a German researcher, has found that in some specific situations, some scanners/photocopiers alter numbers when scanning documents.

It seems that this is a side effect from the JBIG compression used internally by the scanners and photocopiers; it basically uses similar chunks as the OP in this threads, and pixel-based statistics to compare those chunks.

Unfortunately, there is just a small difference between a 6 and a 8, for example. When the chunk happens to sit just right on a single digit, that chunk is used as-is elsewhere. That seems to be the reason for the 6 ↔8 changes in the examples, at least.

This is an excellent example as to why psychovisual modeling should have been taken into account: the pixel statistics alone are just not enough to give a satisfactory result.

I do believe counting the number of "intersections" and "endpoints" (and perhaps discontinuous changes in curvature) in each chunk would be sufficient, and not too difficult or slow to implement. (The dataset a scanner/photocopier works on is rather big, and they don't have that powerful image manipulation capabilities to start with.)

10. Originally Posted by Nominal Animal
Me too, pretty much. Do you know any clever ways to include psychovisual modeling?
The most well-understood areas of human vision from an image processing perspective are the frequency/contrast luminance response curves as well as chromatic sensitivity. I am not aware specifically of any research into the eye's sensitivity to small-scale variation in textures. Simple measures like SAD, Hamming distance (which is same as SAD for bitonal images), or correlation are probably the best you're going to get unless someone can find the relevant research. I don't know it.

I can pose this question to one of the human vision scientists here.

11. Originally Posted by brewbuck
I am not aware specifically of any research into the eye's sensitivity to small-scale variation in textures.
I was referring more to the image processing in the human brain.

I've seen some related research (I'm just interested in it, it's not at all my field) in machine vision, stuff like edge detection and estimating distances in single images (i.e. non-stereoscopic) using specific identified details like those I mentioned.

While technically the researchers are loath to say it's psychovisual modeling, it really is what they're doing: they find ways to program (or in some cases teach, if using neural nets or similar) the machine to perceive or extrapolate similar information humans (the researchers themselves) derive from the same images.

Originally Posted by brewbuck
I can pose this question to one of the human vision scientists here.
I suspect machine vision or image processing people might know more interesting techniques.

D'oh! I'm stupid, or getting old, or both.

I just now remembered where I first encountered this stuff: in optical character recognition, specifically feature detection. The Wikipedia article describes pretty well what I have been trying to describe here.

12. While you intellectuals are busy sizing each other up, I thought you might be interested in seeing the test image that I'm using:-
https://ece.uwaterloo.ca/~z70wang/re...s/image024.gif
(Credit: Zhou Wang, University of Waterloo)

I'm banking on the similarity in the roofing to quantize into tiles better than other kinds of detail.

13. Originally Posted by SMurf
I'm banking on the similarity in the roofing to quantize into tiles better than other kinds of detail.
Photographs are usually easier to compress than other kinds of images, and don't need the kind of handling I was talking about.

After all, the issues with the scanners and photocopiers using a very similar method (chunking and reusing those chunks if they approximately match) causing severe problems with numeric data seem to have completely surprised the manufacturers.

What I am saying is that when your implementation works fine with photographs, do not assume it will work fine for all other use cases too.