Integrity Checking for Password-based Archive Encryption

**Abyssion** · 03-13-2016

Hey all. It seems all I'm doing is posting on the Tech Board these days; apologies!

So, some cryptographers at my university are working on some novel crypto algorithms and are looking for people to write implementations for them (for testing purposes, I guess). I'm looking to incorporate their algorithms into an archive encryption scheme, similar to 7Zip and WinRAR's cryptosystems. So, the basic premise of the final 'product' would be:

1. Prompt the user for the archive's password.
2. Hash the input reiteratively (X number of times) with [novel hash algorithm (NHA)] and compare it to a value stored at a given offset within the archive.
3. Integrity check the archive to detect corruption or tampering.
4. Decrypt the archive using [novel symmetic-crypto algorithm (NSCA)], feeding the output of the reiteration of NHA(input) (X/2 number of times) to the function as the decryption key.

I was just wondering how 7Zip, WinRAR and other programs accomplish their integrity-checking securely. Perhaps they hash the whole contents of the (encrypted) archive and append that to the beginning of the file during encryption. During integrity checking, they could repeat this process: hash the contents of the archive (minus the message digest previously appended,) and see if it matches the first 256 or 512 bits or so of the file. However, it seems as though this could easily be attacked by tampering with the archive as desired (probably corrupting the internal directory structures,) then manually hashing the tampered archive and replacing the first 32 or 64 bytes with the output of the hash function. Plus, this would mean that brute forcing the password would be unnecessary; attackers could brute force the actual archive itself until they hit a collision. (Now, granted, the archive itself is probably much larger than the password and therefore would take longer to brute force (and there is a small chance that they would hit a false-positive collision which, defensively, isn't a bad thing). However, the number of iterations that the archive is to be hashed with would likely be a lot lower than the number of iterations that the password is hashed with to facilitate speed of the integrity checking process, allowing for a more efficient attack vector.)

Alternatively, they could perform integrity checking as follows: following encryption, a hard coded value is hashed and appended to the beginning (or end,) of the archive. Then, during integrity checking, this same hard coded value is hashed with the same algorithm, the same number of times, and the output is compared to the beginning (or end,) of the archive file. However, that would mean that the archive could be tampered with undetectably as long as those designated bytes are unaltered.

Another option would be to look for file-type signatures (for instance 'MZ' for .exes etc) following decryption and, if none are found, re-encrypt the archive with the same key. However, this would only work for established file-types and would also give attackers some insight into the contents of the archive. (They could attach a debugger to the integrity checking module, observe that the module is searching for a given file-type and thus have more information about the contents of the archive.)

I can think of a million and one ways of doing this (involving containers and recognition of basic container structures etc), but I'm unsure of the current gold standard practices for this type of thing.

If there's any crypto minded folk out there, any help would be massively appreciated!

Cheers,
Abyssion

**laserlight** · 03-13-2016

I am confused though: what exactly is the "input" that you are referring to in steps 2 and 4? Because of step 1, I think it is the password (i.e., you are preparing the password for use as a secret key), but perhaps I am mistaken. It would be better if you had started off with a preamble explaining that the given steps are for decryption, and instead of "input" use the specific term like "password" or "encrypted archive".

Originally Posted by Abyssion

Perhaps they hash the whole contents of the (encrypted) archive and append that to the beginning of the file during encryption. During integrity checking, they could repeat this process: hash the contents of the archive (minus the message digest previously appended,) and see if it matches the first 256 or 512 bits or so of the file. However, it seems as though this could easily be attacked by tampering with the archive as desired (probably corrupting the internal directory structures,) then manually hashing the tampered archive and replacing the first 32 or 64 bytes with the output of the hash function.

How is the attacker going to correctly hash the tampered encrypted archive without the password? Since you have a password, you would presumably be using a HMAC rather than just hashing the archive alone.

Originally Posted by Abyssion

Plus, this would mean that brute forcing the password would be unnecessary; attackers could brute force the actual archive itself until they hit a collision. (Now, granted, the archive itself is probably much larger than the password and therefore would take longer to brute force (and there is a small chance that they would hit a false-positive collision which, defensively, isn't a bad thing). However, the number of iterations that the archive is to be hashed with would likely be a lot lower than the number of iterations that the password is hashed with to facilitate speed of the integrity checking process, allowing for a more efficient attack vector.)

This does not make sense to me, but that could be because I am confused as stated earlier. The number of possible plaintext archives of the given size should be so large that trying to brute force it should be computationally infeasible, whereas passwords are often really small in comparison and possibly badly chosen.

**Abyssion** · 03-13-2016

Cheers for the reply laser!

Ahh yes, having reread my post, I realise that I could have been clearer. The given steps were indeed for decryption of an encrypted archive, and you're right about me wanting to use a transformation of the password input by the user as a decryption key. I believe that you were pointing me in the right direction with your reference to a HMAC.

My initial conception of the integrity checking process did not involve a HMAC and relied on simply hashing the archive alone. (This is why I thought attackers might be able to generate a hash of a tampered archive and overwrite the message digest at the beginning or end of the file that would have been used for comparison during the integrity check.) I now gather that using a HMAC is necessary for this type of work. I didn't consider using a HMAC initially, largely because I do not understand what HMACs are and how they are used. This is not due to lack of trying; I have indeed looked up HMACs online and even spent a few days watching cryptography courses on youtube trying to understand them. Unfortunately, time constraints got the better of me and I had to suspend my course watching (and therefore HMAC understanding,) until a later date.

If you could please give me a brief overview of what a HMAC is and how it is used, that would be great! Or if you happened to know of any resources that go over the underlying principles of them, that would also be wonderful. Short of that, it's back to cryptography 101 on youtube

I should also point out that the novel algorithms in question are designed specifically to withstand attacks from adversaries with massive amounts of parallel computing resources. (Certain three letter agencies spring to mind!) I realise that brute forcing an archive and comparing it to a value that is used in an integrity check is very infeasible on commercial machines, but I'm very unsure about how hard this would be to achieve on top-of-the-range supercomputing architecture. I thought I'd mention it in my initial post just in case it was a valid concern. Maybe I was after brownie points for putting a bit of thought into my question

I would outline the algorithms that I am working with, but unfortunately, I have signed an agreement of non-disclosure with my academic establishment.

Hope this clarifies things a little,
Cheers,
Abyssion

P.S. For the given scenario, it is assumed that any attackers have complete, unrestricted access to the binary that performs the encryption and decryption of the archive and can thus disassemble, debug and patch it as they so desire. (It's just occurred to me, does this completely invalidate the use of a HMAC?)

**Abyssion** · 03-13-2016

Ok, so I've done some more reading. From what I can gather HMACs are derived from the message itself AND the encryption key; as such, it is impossible to forge a valid HMAC without the key (or password, in this case). So, assuming that, for simplicity's sake, numerous iterations are included in the definition of the hash function (H(.)), the *encryption* process would look something like this:

(where X is the plaintext archive, Xⁱ is the encrypted archive and E(.) is the encryption algorithm)

1. Prompt user for password (p). (Re-enter to confirm password).
2. Calculate H(p) to derive the encryption key (k).
3. Calculate E_k(x) to receive encrypted archive Xⁱ.
4. Calculate HMAC via H(MAC_k(X)).
5. Append HMAC to Xⁱ to acquire final encrypted archive: HMAC + Xⁱ
Does this look about right? So to verify the password, integrity check and decrypt, we would do the following (with (D(.)) being the decryption algorithm):

1. Prompt user for password (p).
2. Calculate H(p) to derive decryption key (k).
3. Calculate D_k(Xⁱ) to acquire *potential* plaintext archive X.
4. Calculate HMAC via H(MAC_k(X)) from the *potential* plaintext archive X.
5. Compare the newly generated HMAC with the HMAC that was appended to Xⁱ.

- If HMACs match, password is assumed correct and archive integrity is assumed. The decryption process ends.
- If HMACs do not match, password is assumed incorrect OR integrity check has failed. Either way, re-encrypt *potential* plaintext archive X with the key generated from the recently user-inputted password to reacquire Xⁱ. (Undamaged?)

Am I sort of thinking along the right lines here?
(I apologise about my poor sub-standard notation, but hopefully you can follow along.)

Cheers,
Abyssion

**laserlight** · 03-13-2016

Originally Posted by Abyssion

P.S. For the given scenario, it is assumed that any attackers have complete, unrestricted access to the binary that performs the encryption and decryption of the archive and can thus disassemble, debug and patch it as they so desire. (It's just occurred to me, does this completely invalidate the use of a HMAC?)

I presume they have such access to a copy of the binary, in which case the HMAC is still perfectly fine. If they have such access to the actual binary itself, it is game over since they can install code to intercept the password and/or plaintext archive.

Originally Posted by Abyssion

Am I sort of thinking along the right lines here?

Yes, but for encryption:

Originally Posted by Abyssion

4. Calculate HMAC via H(MAC_k(X)).

I would modify that to:
4. Calculate HMAC via H(MAC_k(Xⁱ)).

Then for decryption:

Originally Posted by Abyssion

3. Calculate D_k(Xⁱ) to acquire *potential* plaintext archive X.
4. Calculate HMAC via H(MAC_k(X)) from the *potential* plaintext archive X.

I would modify that correspondingly to:
3. Calculate HMAC via H(MAC_k(Xⁱ)) from the *potential* encrypted archive Xⁱ.
4. Calculate D_k(Xⁱ) to acquire plaintext archive X.

**Abyssion** · 03-14-2016

Originally Posted by laserlight

I presume they have such access to a copy of the binary, in which case the HMAC is still perfectly fine. If they have such access to the actual binary itself, it is game over since they can install code to intercept the password and/or plaintext archive.

Ahh, right you are! I really need to learn to communicate effectively; yes, it is a copy of the binary that, in this scenario, any attackers would have access to. I should have clarified that initially.

Originally Posted by laserlight

Yes, but for encryption:
4. Calculate HMAC via H(MAC_k(X)).

I would modify that to:
4. Calculate HMAC via H(MAC_k(Xⁱ)).

Then for decryption:
3. Calculate D_k(Xⁱ) to acquire *potential* plaintext archive X.
4. Calculate HMAC via H(MAC_k(X)) from the *potential* plaintext archive X.

I would modify that correspondingly to:
3. Calculate HMAC via H(MAC_k(Xⁱ)) from the *potential* encrypted archive Xⁱ.
4. Calculate D_k(Xⁱ) to acquire plaintext archive X.

Thank you very much! Yes, this way round makes much more sense; it did seem a bit redundant to decrypt the archive with a potential password, only to re-encrypt it again if the HMACs did not match.

So, considering potential attacks:

If someone inputs the wrong password, the HMAC generated would not be correct and the application would not decrypt the archive. If an attacker were to force decryption by manually feeding an incorrect password to the key-derivation function and forcing the module to call the decryption function, the key would not be correct and the resulting "decrypted" data would be incorrect. Am I correct?

If someone were to tamper with the archive, again, the HMAC generated prior to decryption would not match the HMAC stored at the beginning of the archive. Decryption would thus be impossible via the application itself and forcing decryption through binary modification would result in seemingly random data. Is this correct?

From the two scenarios outlined above, I realise that the actions of inputting an incorrect password and inputting either a correct or incorrect password to try to decrypt a tampered-with archive yield the same results (i.e. a non-valid HMAC). Is there any way to distinguish between an incorrect password being provided or the encrypted archive being tampered with?

Also, to derive the encryption key from the password, is it sufficient practise to iterate over the hash function a huge number of times or should a designated function, say PBKDF2 be used?

Another attack vector occurs to me: adversaries with massive computing power may find it easier to try to generate a valid HMAC from the encrypted archive itself via brute forcing of the key. Something like this:

Code:

1. HMAC_k0(Xⁱ)
2. HMAC_k1(Xⁱ)
3. HMAC_k2(Xⁱ) ... etc etc
4. HMAC_k(n-1)(Xⁱ)
5. HMAC_kn(Xⁱ)           <------------- Valid HMAC generated!

This attack would save having to attack the password itself and thus save repeatedly performing time consuming iterations of hash functions to derive a key for each password tested. I find myself wondering, how fast can HMACs be generated? I'm assuming the attack outlined above would still be less efficient than password brute-forcing or rainbow table attacks due to a huge keyspace? It just seems to me that any key-stretching used is pretty much rendered obsolete by this attack. (That is, unless HMACs computation is relatively time consuming?)

I apologise for so many questions and such long posts; I just want to learn how to do cryptography right, as all

Many thanks for all your help laser, I am learning a lot!
Abyssion

**Abyssion** · 03-14-2016

Oh, and speaking of rainbow tables, would it be a good idea to salt the password with a per-archive salt?
And store the salt along with the HMAC of the archive?

This would, unless I'm mistaken, be more secure than using a constant hard coded salt for every archive because this latter method would allow rainbow tables to be generated. Am I thinking along the right lines?

Should the salt be appended before the HMAC and not be included in HMAC calculation, or after the HMAC and be included in HMAC calculation? The latter makes more sense to me, as we can determine if the salt itself has been tampered with in addition to the rest of the archive.

**laserlight** · 03-14-2016

Originally Posted by Abyssion

Yes, this way round makes much more sense; it did seem a bit redundant to decrypt the archive with a potential password, only to re-encrypt it again if the HMACs did not match.

I don't quite get the "re-encrypt it again": if the HMAC check fails, you should inform the user that either the password is wrong or the encrypted archive may have been corrupted or tampered. Encrypting then computing HMAC of the ciphertext is better than encrypting and computing HMAC of the plaintext because the HMAC ensures the integrity of the ciphertext, which is after all what is being transmitted across the insecure channel. You avoid problems where the attacker might try to mess around with the ciphertext, get you to decrypt invalid ciphertext, or attempt to attack the HMAC to get clues about the plaintext.

Originally Posted by Abyssion

If someone inputs the wrong password, the HMAC generated would not be correct and the application would not decrypt the archive. If an attacker were to force decryption by manually feeding an incorrect password to the key-derivation function and forcing the module to call the decryption function, the key would not be correct and the resulting "decrypted" data would be incorrect. Am I correct?

Yes.

Originally Posted by Abyssion

If someone were to tamper with the archive, again, the HMAC generated prior to decryption would not match the HMAC stored at the beginning of the archive. Decryption would thus be impossible via the application itself and forcing decryption through binary modification would result in seemingly random data. Is this correct?

Yes.

Originally Posted by Abyssion

From the two scenarios outlined above, I realise that the actions of inputting an incorrect password and inputting either a correct or incorrect password to try to decrypt a tampered-with archive yield the same results (i.e. a non-valid HMAC). Is there any way to distinguish between an incorrect password being provided or the encrypted archive being tampered with?

Yes, but I think they may reduce the security, e.g., you could pass a crytographic hash of the password, or pass a cryptographic hash (not HMAC) of the ciphertext.

Originally Posted by Abyssion

Also, to derive the encryption key from the password, is it sufficient practise to iterate over the hash function a huge number of times or should a designated function, say PBKDF2 be used?

I think it may be sufficient, but if you can use algorithms specifically designed to frustrate relevant attacks, why not? I note that there are alternatives to PBKDF2 too, e.g., bcrypt, scrypt, Argon2.

Originally Posted by Abyssion

Another attack vector occurs to me: adversaries with massive computing power may find it easier to try to generate a valid HMAC from the encrypted archive itself via brute forcing of the key.
(...)
This attack would save having to attack the password itself and thus save repeatedly performing time consuming iterations of hash functions to derive a key for each password tested.
(...)
It just seems to me that any key-stretching used is pretty much rendered obsolete by this attack.

Yes, I would expect this to be a likely attack, though it may be infeasible if the key is sufficiently large. However, this does not make key stretching obsolete because users may choose weak passwords.

Originally Posted by Abyssion

I find myself wondering, how fast can HMACs be generated? (...) (That is, unless HMACs computation is relatively time consuming?)

They are typically relatively fast. However in post #3, you wrote that the "novel algorithms in question are designed specifically to withstand attacks from adversaries with massive amounts of parallel computing resources", and from what I understand the cryptographic hash function used for the HMAC is one of these primitives. Actually, this may mean that

Originally Posted by Abyssion

I'm assuming the attack outlined above would still be less efficient than password brute-forcing or rainbow table attacks due to a huge keyspace?
(...)
This would, unless I'm mistaken, be more secure than using a constant hard coded salt for every archive because this latter method would allow rainbow tables to be generated. Am I thinking along the right lines?

Rainbow tables are not applicable here: you are not storing/transmitting the passwords in a hashed form such that the attacker could try to obtain the password from the hash. The HMAC involves a hash primitive and a key derived from the password, but you could say that the ciphertext itself acts as a huge salt that will defeat the use of rainbow tables.

That said, a per-archive salt could still be useful to ensure that encrypting the same archive twice with the same password results in different HMACs for the ciphertext. However, I recall that the encryption would already be using initialisation vectors, so this may be unnecessary as they would already result in different ciphertext.

Originally Posted by Abyssion

Oh, and speaking of rainbow tables, would it be a good idea to salt the password with a per-archive salt?
And store the salt along with the HMAC of the archive?

With algorithms like PBKDF2 you will use a salt, so you would need to pass the salt along if it is per-archive.

Originally Posted by Abyssion

Should the salt be appended before the HMAC and not be included in HMAC calculation, or after the HMAC and be included in HMAC calculation? The latter makes more sense to me, as we can determine if the salt itself has been tampered with in addition to the rest of the archive.

The latter, of course.

Originally Posted by Abyssion

I apologise for so many questions and such long posts; I just want to learn how to do cryptography right, as all

No problem, but shouldn't you be posing these questions to the cryptographers at your university? After an introductory module to computer security, I took an undergraduate final year/post-graduate first year module in computer security that included cryptography and cryptographic protocols as a major component, and while I scored an A and A- respectively, these are the kind of people who teach such modules and even higher level post-graduate modules. I have to warn that I could be wrong; these are people who get consulted by the industry on how to do cryptography right, and get interviewed by the media as experts outside of the government agencies.

**Abyssion** · 03-14-2016

Originally Posted by laserlight

I don't quite get the "re-encrypt it again": if the HMAC check fails, you should inform the user that either the password is wrong or the encrypted archive may have been corrupted or tampered.

Yeah, of course; with your suggestion to generate the HMAC from the key and ciphertext, as opposed to the key and plaintext, this idea becomes void. If the HMAC check fails, the archive will not be decrypted (either correctly or incorrectly,) anyway. Please ignore this idea.

Originally Posted by laserlight

Yes, but I think they may reduce the security, e.g., you could pass a crytographic hash of the password, or pass a cryptographic hash (not HMAC) of the ciphertext.

Noted; thank you! Yeah, I was thinking along the lines of using a cryptographic hash of the password to determine if the inputted password was incorrect. As you say, though, this probably decreases security unnecessarily, when one could just inform the user that the password was incorrect or the archive has been tampered with and have done with it.

Originally Posted by laserlight

I note that there are alternatives to PBKDF2 too, e.g., bcrypt, scrypt, Argon2.

I shall look into these, cheers for the pointers. Unless I'm mistaken, I don't believe that crypto is your primary field. However, if you happened to have experience with any of these alternative algorithms, do you have a preference?

I also should re-read the specification for PBKDF-2; when I preliminarily looked it up, it seemed to consist of "just hash things a ton of times, instead of once." From my recent readings though, I believe that numerous iterations are an integral part of any cryptographic hash function anyway, so presumably PBKDF-2 brings something new to the table.

Originally Posted by laserlight

Yes, I would expect this to be a likely attack, though it may be infeasible if the key is sufficiently large. However, this does not make key stretching obsolete because users may choose weak passwords.

The point raised about users choosing weak passwords: very true. Indeed, as time is not really a factor here as it would be in, say, a VPN service, the keys supplied to the algorithms will be relatively huge.

Originally Posted by laserlight

Rainbow tables are not applicable here...

I think I may have misspoken; perhaps 'rainbow tables' wasn't the right term to use. I was considering that, if a constant hard coded salt is used for all archived, attackers could disassemble the archive or binary to recover the salt, and then build a database of HMACs based on this salt + a list of dictionary words (for example) as passwords. However, you're right, of course: the ciphertext itself acts as a salt-type construct so any HMAC databases built from a given encrypted archive would only be applicable to identical copies of the archive that have the same password. As such please ignore my 'rainbow table' reference also.

As you point out, it seems that large pseudorandomised initialisation vectors will mimic the functionality of a per-archive salt, so if it is safe to do so, I will exclude salts from the implementation and have one less thing to worry about

(And not use a salt + password-based key derivation algorithm.)

Originally Posted by laserlight

The latter, of course.

Thought so; thank you for the clarification!

Originally Posted by laserlight

No problem, but shouldn't you be posing these questions to the cryptographers at your university? ...

To be honest neither computer science, software engineering nor cryptography are my current areas of study. I'm primarily a scientist; programming and computer security are just hobbies of mine that have developed over the past couple of years. Considering this, I'm not sure how professors would react to being asked about something that they get paid to teach by a student who, essentially, has nothing to do with their field. I guess some might be pleased that a genuine interest is being shown in their field, but they might not consider it worth a 'freebie' when it comes to sharing information.

The gentlemen who are researching the novel crypto algorithms are PHD students. I put myself forward to help them out because their work sounded interesting (and now that I've seen what they're up to, I'm glad that I did!) I could have brought these questions to them but, in all honesty, I trust the members of this board more because I have been coming here, on and off, for the past 10 years! I know that many members here are extremely knowledgeable IT professionals and, when I see that a prominent figure such as... I don't know, "laserlight" (

), has commented on one of my threads, I get a little bit excited!

Actually, I believe you've helped me with crypto in the past Laser; I was operating under a different username, but that account got suspended (I believe for inactivity, although I never followed it up) so I assumed a new username.

Anyway, bottom line: you've helped me out loads with this and I appreciate the time and effort you have put into it massively!

Thank you very much!
Abyssion

Thread: Integrity Checking for Password-based Archive Encryption

Thread Tools

Search Thread

Display

Integrity Checking for Password-based Archive Encryption

Similar Threads

To enter password without displaying it in a password checking program

Help in checking password in a file

Character Tables ( Password Encryption )

Self Integrity Check

password encryption