Welcome to Inkbunny...
Allowed ratings
To view member-only content, create an account. ( Hide )
ducky

Image hashing

by
For who knows why, I started thinking about image hashes a week ago. MD5, the same IB uses, are good for checking the integrity of a file, but if you cut even a pixel out, it treats it as a whole new file... That’s not good for images. Ideally, you’d want the computer to determine similiarity between two images from the shapes and colors, not individual bits.

And it’s fucking HARD. Fuzzy logic is way beyond my coding skills. But I had to try.

My idea was to basically scale pictures into 8*8 grids and create a hash from there. I start from the top left corner and move through each pixel of each RGB channel asking this question: is this channel pixel lighter or darker than the one before?

So I get a number of bits (00110110..) for each channel, and eventually I get a numerical string representing the overall changes of each channel.

Something like: 8426430550391263784-7825201099380511380-7825728728327139888 for my latest “Snapshot”... Red-Green-Blue in three 64bit tuples.

I found that I could scale a photo and get a near identical match. (0-3 bits off)
Wholly different pictures had on average about 20-30 bits off. (Snapshot & Jätte bra had 25 bits off)

What’s good about this approach is that it’s relatively simple to calculate, and that it’s relative to the image itself. What’s still bad about it is that the result depends on the scaling function, which probably isn’t universal. Ideally I’d have calculated the average sum of pixels within region, but for testing purposes I let the imaging library deal with the interpolation.

Bugh.. Head spinning.

You can download the .py here. You'll need the PIL imaging library. Just drag and drop image on script, and it'll calculate a hash. Drag two and it'll calculate the difference.
Viewed: 10 times
Added: 13 years, 3 months ago
 
GreenReaper
13 years, 3 months ago
Heh. I'm probably going to have to do something like that at some point to identify duplicates of different sizes.
ducky
13 years, 3 months ago
The problem I see you tackling, then, is search. Or it depends how fuzzy you want it to be... A simple h1==h2 string comparison would work relatively ok, but if you're looking for overall similarity, you'll have to do bit by bit comparisons.
But if the hash also carried a loose width/height ratio as the first tuple, it might help narrowing down hashes to process...

Sorry, just thinking out loud here.
GreenReaper
13 years, 3 months ago
That's a pretty good idea, most people will shrink their pictures rather than cropping to fit. Thanks. :-)

The end application would be identifying identical works from an artist on different galleries, which may have been resized. I'll probably have to download all files at least once, but I don't want to have to do comparisons of each image against each other image, and stuff like this will definitely cut down on the number I have to do..
New Comment:
Move reply box to top
Log in or create an account to comment.