C file-manipulation in Windows

**Sir Galahad** · 06-03-2021

Originally Posted by john.c

mp3 files can have id3 (v1 and v2) tags which contain metadata.

Well that is true, and in fact there are other file formats as well that could be considered as containing "metadata" that Windows may be able to present to the user. In the case of "generic" binary data of course access to some sort of tagging API would be necessary.

**christop** · 06-03-2021

For checksumming a large number of files, I'd use existing tools. You can do this easily in Linux, but it should work under WSL just as well:

Code:

find DIR -type f -print0 | xargs -0 -n1 -P8 checksum >checksum.txt

(Replace checksum with your checksumming program, e.g., sha256sum.)

That will checksum all of your files under the directory "DIR" in parallel (up to 8 in parallel), and it will very likely be finished sooner than writing a custom program to do the job (on my computer it takes sha256sum about 8 seconds to compute the checksum of a 2.5GB file; I'm guessing your collection is in the neighborhood of 500GB, which would take about 1600 seconds (26m 40s) of CPU time, or about 200 seconds (3m 20s) of real time if you compute the checksums in parallel on 8 CPUs; of course this assumes sha256sum is the bottleneck and not the disk/flash I/O speed, but regardless, even the non-parallel checksummer could have finished already since the time you first asked).

**Sir Galahad** · 06-03-2021

Originally Posted by christop

For checksumming a large number of files, I'd use existing tools. You can do this easily in Linux, but it should work under WSL just as well:

Code:

find DIR -type f -print0 | xargs -0 -n1 -P8 checksum >checksum.txt

(Replace checksum with your checksumming program, e.g., sha256sum.)

That will checksum all of your files under the directory "DIR" in parallel (up to 8 in parallel), and it will very likely be finished sooner than writing a custom program to do the job (on my computer it takes sha256sum about 8 seconds to compute the checksum of a 2.5GB file; I'm guessing your collection is in the neighborhood of 500GB, which would take about 1600 seconds (26m 40s) of CPU time, or about 200 seconds (3m 20s) of real time if you compute the checksums in parallel on 8 CPUs; of course this assumes sha256sum is the bottleneck and not the disk/flash I/O speed, but regardless, even the non-parallel checksummer could have finished already since the time you first asked).

You'd still need some sort of tag-management system on top of that. I'm sure they exist. (Search Github?)

**thmm** · 06-04-2021

Wouldn't it be an option to store the checksum in the filename ?

**Pattern-chaser** · 06-04-2021

Originally Posted by john.c

What kind of checksum are you looking to calculate?

A fast one! Probably Fletcher's checksum, or something similar.

**Pattern-chaser** · 06-04-2021

Originally Posted by Sir Galahad

What you're talking about doesn't seem to be applicable to regular files. Image files carry metadata that can be extracted, which is what Windows is doing there.

That said, there may be other ways to do this. A cursory search on StackOverflow yielded an interesting possibility using OLE. (Scroll down to the 4th answer.)

Thanks, that looks interesting.

Originally Posted by john.c

mp3 files can have id3 (v1 and v2) tags which contain metadata.

It's looking like some file-types can carry metadata, and others can't.

But I've made the usual discovery. Searching for things on the net often depends on using the right search terms. I've been searching for "file tags" (and similar phrases), with little success. A search I just did on "Windows file metadata" has already thrown up a lot of interesting links, that I can now investigate further. So thank you both, for the search term "metadata"!

**Pattern-chaser** · 06-04-2021

Originally Posted by christop

For checksumming a large number of files, I'd use existing tools. You can do this easily in Linux, but it should work under WSL just as well:

Code:

find DIR -type f -print0 | xargs -0 -n1 -P8 checksum >checksum.txt

(Replace checksum with your checksumming program, e.g., sha256sum.)

That will checksum all of your files under the directory "DIR" in parallel (up to 8 in parallel), and it will very likely be finished sooner than writing a custom program to do the job (on my computer it takes sha256sum about 8 seconds to compute the checksum of a 2.5GB file; I'm guessing your collection is in the neighborhood of 500GB, which would take about 1600 seconds (26m 40s) of CPU time, or about 200 seconds (3m 20s) of real time if you compute the checksums in parallel on 8 CPUs; of course this assumes sha256sum is the bottleneck and not the disk/flash I/O speed, but regardless, even the non-parallel checksummer could have finished already since the time you first asked).

My music collection, of .mp3 and .flac files is (according to Windows Explorer 'Properties') "147,393 Files, 12,112 Folders", occupying 1.75 TBytes.

I currently have a system using batch files and Microsoft's checksummer "FCIV.exe". It tries to accumulate results in a common file, but it takes forever, and it crashes regularly. I think the crashes are down to the parallel execution you mention, with multiple instances of something trying to access the common results file?

Whatever the reason, it crashes, and it takes many hours to run. Hence my thought of designing and coding something myself.

The results from FCIV are not great for post-processing either, as different error conditions are reported using different text, and it's difficult to find out what all the variants are.

I'm looking for a solution that works simply, giving a YES/NO answer for each file, and allowing me to follow on with a backup so that the whole thing takes less than a day, not more.

**Pattern-chaser** · 06-04-2021

Originally Posted by thmm

Wouldn't it be an option to store the checksum in the filename ?

Yes, but I was a bit afraid of overflow. Some directory names, and filenames, get quite lengthy. I want to avoid crashes because the pathname is just too long. I've encountered this with some programs in the past.

**thmm** · 06-04-2021

What Windows do you use?
Under Windows 10 you can enable long filenames with a length of 32,767 chars
Enable long file name support in Windows 10 | IT Pro

**Pattern-chaser** · 06-04-2021

Originally Posted by thmm

What Windows do you use?
Under Windows 10 you can enable long filenames with a length of 32,767 chars
Enable long file name support in Windows 10 | IT Pro

Thanks, that's useful. It will be a lot easier to administrate this checksum stuff if each file can hold its own checksum, and that's the obvious way to go about it.

Thanks.

**christop** · 06-04-2021

Do you already have a backup of all files? If not, what will you verify them against?

Most backup software will do some sort of file verification automatically and will let you roll back to any previous version (useful if you ever back up a corrupted file over a good one).

**hamster_nz** · 06-04-2021

I'm still left puzzled as why you want to checksum/hash the files... are you intending to store them on media has a high bit error rate? do you want some short 'digest' of the file, allowing you to locate duplicates? Are you expecting something to deliberately trash your files by writing random data in random files? Or that all you MP3s will be RickRolled?

As you most probably know, for any reasonable storage media if you can read all the bytes in the file then you can be pretty sure that it is an accurate copy. If not, all computing would not work. It is only when the data travels over communication links that there is a risk of data getting corrupted. Even that is exceedingly rare, usually involving bad RAM or software in a router or firewall, as the packets are check-summed while they are on the wire.

More of a worry is unreadable blocks, where the media's ECC has failed to correct errors. In that case having checksums doesn't help because you most likely won't be able to read the block at the OS level.

**Pattern-chaser** · 06-05-2021

Originally Posted by christop

Do you already have a backup of all files? If not, what will you verify them against?

Most backup software will do some sort of file verification automatically and will let you roll back to any previous version (useful if you ever back up a corrupted file over a good one).

Once a file is checksummed, you can verify it against its former self by checksumming it again.

Most backup software does NOT do the kind of verification I'm talking about. It only verifies the copy it just made against the file it just copied.

If the 'master' copy of the file was already corrupt, I know of no backup software that will detect that the current backup still contains the uncorrupted information, and restore it.

I'm not sure if I'm writing this as clearly as I could/should, but it's critical to keeping your files uncorrupted: confirm that the file has not become corrupt since it was last backed-up, BEFORE you overwrite a good copy of the file with a corrupt one.

**Pattern-chaser** · 06-05-2021

Originally Posted by hamster_nz

I'm still left puzzled as why you want to checksum/hash the files... are you intending to store them on media has a high bit error rate? do you want some short 'digest' of the file, allowing you to locate duplicates? Are you expecting something to deliberately trash your files by writing random data in random files? Or that all you MP3s will be RickRolled?

File corruption happens. It is statistically unavoidable, although we can do a fair amount to reduce the probability of it happening. When it has happened, there is no flag set anywhere to tell you "Hey, this file has become corrupt!", you have to detect it for yourself, by checking.

When my music collection was subject to some corruption, I only found out about it when foobar2000 threw up an error box to tell me it couldn't play the file. I'm still discovering corrupt files, or rather foobar2000 is, that became corrupt at the same time.

[My music collection is too large to listen to everything to see if it plays or not.]

No, I don't want to detect duplicates, I only want to keep my files uncorrupted. Checksumming them is the only practical way to allow them to be checked for corruption before they are written over a good copy during backup.

Originally Posted by hamster_nz

As you most probably know, for any reasonable storage media if you can read all the bytes in the file then you can be pretty sure that it is an accurate copy.

I'm not proposing to double-check my backup software, to confirm that the backup copy it just made is a good copy. I want to detect file corruption that has happened to the 'master' file since the last backup, and (if necessary) to restore a good copy from previous backups.

Obviously, you can only do this if you can catch the corruption before you over-write a good backup copy with a corrupt 'master' copy. The latter is what I did, unknowingly, by blindly backing-up without checking first that the master files were still OK.

Originally Posted by hamster_nz

It is only when the data travels over communication links that there is a risk of data getting corrupted.

Sadly, that isn't so. Any hard disk storage system, even the most sophisticated RAID array, can suffer corruption. All the RAID technology can do is to make it much less likely to occur.

Originally Posted by hamster_nz

More of a worry is unreadable blocks, where the media's ECC has failed to correct errors. In that case having checksums doesn't help because you most likely won't be able to read the block at the OS level.

In that case, the checksums aren't needed. You delete the obviously-damaged file, and restore a good copy from your backup. No backup system is worth the trouble if you can't go full circle and restore a good copy when necessary.

Thanks for taking the trouble to comment!

**hamster_nz** · 06-05-2021

Ah, that makes more sense.

If I was implemented this, I would make a hidden 'checksums' file in each directory, with just the local files in that directory, and not inserting the checksums into the file names.

Have you done the numbers for the average data ratees you need to sustain to meet your time goals?