For checksumming a large number of files, I'd use existing tools. You can do this easily in Linux, but it should work under WSL just as well:
(Replace checksum with your checksumming program, e.g., sha256sum.)Code:find DIR -type f -print0 | xargs -0 -n1 -P8 checksum >checksum.txt
That will checksum all of your files under the directory "DIR" in parallel (up to 8 in parallel), and it will very likely be finished sooner than writing a custom program to do the job (on my computer it takes sha256sum about 8 seconds to compute the checksum of a 2.5GB file; I'm guessing your collection is in the neighborhood of 500GB, which would take about 1600 seconds (26m 40s) of CPU time, or about 200 seconds (3m 20s) of real time if you compute the checksums in parallel on 8 CPUs; of course this assumes sha256sum is the bottleneck and not the disk/flash I/O speed, but regardless, even the non-parallel checksummer could have finished already since the time you first asked).
Wouldn't it be an option to store the checksum in the filename ?
A fast one! Probably Fletcher's checksum, or something similar.
Thanks, that looks interesting.
It's looking like some file-types can carry metadata, and others can't.
But I've made the usual discovery. Searching for things on the net often depends on using the right search terms. I've been searching for "file tags" (and similar phrases), with little success. A search I just did on "Windows file metadata" has already thrown up a lot of interesting links, that I can now investigate further. So thank you both, for the search term "metadata"!
My music collection, of .mp3 and .flac files is (according to Windows Explorer 'Properties') "147,393 Files, 12,112 Folders", occupying 1.75 TBytes.
I currently have a system using batch files and Microsoft's checksummer "FCIV.exe". It tries to accumulate results in a common file, but it takes forever, and it crashes regularly. I think the crashes are down to the parallel execution you mention, with multiple instances of something trying to access the common results file?
Whatever the reason, it crashes, and it takes many hours to run. Hence my thought of designing and coding something myself.
The results from FCIV are not great for post-processing either, as different error conditions are reported using different text, and it's difficult to find out what all the variants are.
I'm looking for a solution that works simply, giving a YES/NO answer for each file, and allowing me to follow on with a backup so that the whole thing takes less than a day, not more.
What Windows do you use?
Under Windows 10 you can enable long filenames with a length of 32,767 chars
Enable long file name support in Windows 10 | IT Pro
Do you already have a backup of all files? If not, what will you verify them against?
Most backup software will do some sort of file verification automatically and will let you roll back to any previous version (useful if you ever back up a corrupted file over a good one).
I'm still left puzzled as why you want to checksum/hash the files... are you intending to store them on media has a high bit error rate? do you want some short 'digest' of the file, allowing you to locate duplicates? Are you expecting something to deliberately trash your files by writing random data in random files? Or that all you MP3s will be RickRolled?
As you most probably know, for any reasonable storage media if you can read all the bytes in the file then you can be pretty sure that it is an accurate copy. If not, all computing would not work. It is only when the data travels over communication links that there is a risk of data getting corrupted. Even that is exceedingly rare, usually involving bad RAM or software in a router or firewall, as the packets are check-summed while they are on the wire.
More of a worry is unreadable blocks, where the media's ECC has failed to correct errors. In that case having checksums doesn't help because you most likely won't be able to read the block at the OS level.
Once a file is checksummed, you can verify it against its former self by checksumming it again.
Most backup software does NOT do the kind of verification I'm talking about. It only verifies the copy it just made against the file it just copied.
If the 'master' copy of the file was already corrupt, I know of no backup software that will detect that the current backup still contains the uncorrupted information, and restore it.
I'm not sure if I'm writing this as clearly as I could/should, but it's critical to keeping your files uncorrupted: confirm that the file has not become corrupt since it was last backed-up, BEFORE you overwrite a good copy of the file with a corrupt one.
File corruption happens. It is statistically unavoidable, although we can do a fair amount to reduce the probability of it happening. When it has happened, there is no flag set anywhere to tell you "Hey, this file has become corrupt!", you have to detect it for yourself, by checking.
When my music collection was subject to some corruption, I only found out about it when foobar2000 threw up an error box to tell me it couldn't play the file. I'm still discovering corrupt files, or rather foobar2000 is, that became corrupt at the same time. [My music collection is too large to listen to everything to see if it plays or not.]
No, I don't want to detect duplicates, I only want to keep my files uncorrupted. Checksumming them is the only practical way to allow them to be checked for corruption before they are written over a good copy during backup.
I'm not proposing to double-check my backup software, to confirm that the backup copy it just made is a good copy. I want to detect file corruption that has happened to the 'master' file since the last backup, and (if necessary) to restore a good copy from previous backups.
Obviously, you can only do this if you can catch the corruption before you over-write a good backup copy with a corrupt 'master' copy. The latter is what I did, unknowingly, by blindly backing-up without checking first that the master files were still OK.
Sadly, that isn't so. Any hard disk storage system, even the most sophisticated RAID array, can suffer corruption. All the RAID technology can do is to make it much less likely to occur.
In that case, the checksums aren't needed. You delete the obviously-damaged file, and restore a good copy from your backup. No backup system is worth the trouble if you can't go full circle and restore a good copy when necessary.
Thanks for taking the trouble to comment!
Ah, that makes more sense.
If I was implemented this, I would make a hidden 'checksums' file in each directory, with just the local files in that directory, and not inserting the checksums into the file names.
Have you done the numbers for the average data ratees you need to sustain to meet your time goals?
Last edited by hamster_nz; 06-05-2021 at 06:55 AM.