Cryptographic Accumulator

I’ve been following this thread that mostly boils down to this: if I have the SHA-256 hash of a file, can I use that to get a file from IPFS? The short answer is not really?, because Merkle-DAG.

The slightly-longer answer is that because Merkle-DAG is required to allow chunking files and verifying those chunks as they come in, and SHA-256 does not have a facility to combine hashes of two components into a single hash for the combined block, you can’t find files on IPFS my the hash of the entire file in a way where you can verify that each block belongs to that hash without having to download the entire file. Instead you are limited to a lookup of SHA-256 Hash to CID, download the entire file, then verify the hash is correct.

And by “combine hashes”, I mean there is no function f() that could exist such that:

except in the case that SHA-256 has been broken, at which point it is useless. I won’t go into why this is here, but if you are both curious and technically inclined, the important part is the “Pre-Processing” step described in this article about SHA-256 and that it includes the length of the input as a parameter.

But there is another cryptographic primitive, the accumulator, that would allow for verification of each block as it is downloaded, while also allowing the file can be chunked in any number of different segments and they will all “hash” to the same value.

As I am only familiar with an RSA accumulator, that is what I will discuss. Apparently, there are also Bilinear-Map accumulators. Last time I looked into it, I came to the conclusion that Elliptic-Curve accumulators are not possible, but I would welcome being proved wrong, as the size of ECC is significantly smaller than RSA for the same security.

RSA Accumulators

RSA accumulators are closely related to RSA public-private key cryptography, and takes a very similar form:

The accumulator value has a nifty property, and that is for any subset of the prime numbers in the set, you can generate a witness value that along with the set of primes proves those are in the set. To make things less abstract, I’m going to chose numbers from this list of primes:

p = 2089
q = 2503

N = p * q
puts "# N = #{N}"
# N = 5228767

m = 97

p_0 = 719
p_1 = 1481
p_2 = 1993
p_3 = 2027
p_4 = 1381

acc = [719,1481,1993,2027,1381].reduce(97) {|a,b| ( a ** b ) % N }
puts "# acc = #{acc}"
# acc = 1493320

There are five members of this accumulator, with a value of 1493320. You can verify this yourself by entering the above into irb or an online Ruby interpreter like this one. The above code calculates the following equation:

Calculating the witness value is very similar to calculating the accumulator value, but you just leave out the items that being witnessed:

witness = [719,1993,1381].reduce(97) {|a,b| ( a ** b ) % N } # 1653509

Which is the same calculation as:

To verify the witness is valid, and that the set is part of the accumulator, do the following:

[1481,2027].reduce(1653509) {|a,b| ( a ** b ) % N } # 1493320

Beyond Primes

The main drawback of RSA accumulators is that they require prime numbers to be secure. Composite numbers, those made from the product of two or more prime numbers, would overlap the primes it is made of.

Prime numbers are fine for math departments, but are not so useful when we want an accumulator that has the kind of information you run into when dealing with computers. Like strings:

data = "Hello, world!"

acc(data) # How is this done?!

What we need is a function that we can feed a string, and have it give us a prime number that is completely based on that string. The primitives we need for this already exist and just have to be assembled.

  • Random prime number generation
  • Cryptographically secure pseudorandom number generation
  • Cryptographic hash function (i.e. SHA-256)

Prime numbers can be found relatively easily by picking a random number, then checking if it is prime with a test like the Miller-Rabin primality test. If it isn’t choose another random number and test again, then repeat until you get a prime number.

This works fine if all you want is any prime number, but we want one that we can get to repeatedly. Thankfully, anywhere that a true random number source can be used, we can plug in a pseudo-random generator that will give the same sequence of results given a seed value, but cannot be distinguished from a truly random source without breaking cryptography or having the seed. I’m a fan of the Blum Blum Shub pseudorandom number generator because it is simple to understand and implement. Here is what it looks like:

We still need a seed value, an x_0, and for that, we circle around to where we started: cryptographic hashes. Feed in a string, it spits out a string of a fixed size. That fixed-sized string also represents a number (two actually, depending on byte order: big-endian and little-endian). So we end up with a procedure:

  1. Calculate SHA-256 of the input data
  2. Construct Blum Blum Shub pseudorandom number generator
  3. Generate prime with BBS and Miller-Rabin
  4. Add prime to accumulator

Arbitrary File Chunking

The above gives us the ability to take an arbitrary set of strings and add them to a cryptographic accumulator as well as generate witness values that proves a portion of those strings is part of the whole set. Since we are started with talking about file chunking in IPFS, we can think of ways that a file can be chunked.

One way to create the set of strings to form the accumulator is to create strings that encode each byte with its position in the file, like this:

res=(i=-1; "Hello world".each_char.map {|ch| i += 1; "#{i}:#{ch}" })

puts res.inspect
# ["0:H", "1:e", "2:l", "3:l", "4:o", "5: ", "6:w", "7:o", "8:r", "9:l", "10:d"]

Instead of the secure version of prime selection, I’m going to use a much-simplified and completely insecure version for illustration. The +1 is there because the list of primes is 1-indexed, while ruby is 0-indexed.

res.map {|s| "#{s} => #{s.hash%1000 + 1}" }
# ["0:H => 643", "1:e => 694", "2:l => 260", "3:l => 130", "4:o => 35", "5:  => 566",
# "6:w => 235", "7:o => 268", "8:r => 639", "9:l => 587", "10:d => 762"]

Manually looking up primes in the list of primes referenced earlier gives us this:

res_primes = [  [643,4787],  [694,5227],  [260,1657],  [130, 733],  [ 35, 149],
                [566,4111],  [235,1483],  [268,1721],  [639,4733],  [587,4273],
                [762,5807] ]

N = 5228767
acc = res_primes.map {|i| i[1] }.reduce(97) {|a,b| ( a ** b ) % N }
puts "# acc = #{acc}"
# acc = 1620616

I will leave calculating and checking various witness values to you, the reader. Mostly because I’ve already spend two days on this post, and I’m ready for it to be done and published.

With this, we can mix and match which bytes we want to include in each block provided, and as long as a valid witness is included with the response, the client can prove that each byte is part of the file. Get a copy of each byte and along with the byte offsets, the file can be rebuilt. The resulting file can also be passed thru this same procedure and in all cases, we end up with the same accumulator value.

But there is a downside to using byte elements.

It is slow.

Very, very slow.

If you have ever generated an RSA key using GPG, you know it takes a long time to generate the key. It takes that long because it has to find very large prime numbers. With what I just described, you would have to do that for every single byte in the file.

Practical Systems

Any practical system build on this would not use single-byte blocks, but rather a block size of 512B, 1KiB, 4KiB or another power of 2. The response would contain a number of contiguous blocks along with the first block’s file offset and a witness value.

The first block start with the number of bytes in the file in 8-byte header so files can be sizes other than multiples of the block size. CIDs would be constructed from the accumulator value and the initial block offset, except for the first block, which doubles as the starting point of the file.

Downloading would start by first downloading the first block/chunk, getting the file size, then successively calculating CID of the next chunk from the current offset and the number of blocks in the current chunk and downloading it. Optionally, a list of chunks could be included to make it possible to download chunks in parallel with a structure like a skip list.

Once a block is downloaded, sub-blocks can have their witness value calculated from only the data in the block and those CIDs can be added to the DHT. However, combining blocks or re-chunking requires the entire file. By design, accumulators can’t remove members without knowing the prime factors of the modulus. Knowing these factors allows to calculate a witness to prove any value is part of the set, so it is a good idea that nobody knows the factors, or at least as few people as possible.

{
  "#{acc}"   => { offset:  0, data: blocks[ 0.. 5].join, chunks:[6,16,18] },
  "#{acc}6"  => { offset:  6, data: blocks[ 6..15].join },
  "#{acc}16" => { offset: 16, data: blocks[16..17].join },
  "#{acc}18" => { offset: 18, data: blocks[18..21].join }
}

References

While writing this article, I used the following references:

  1. http://www.dragonwins.com/domains/getteched/crypto/modular_arithmetic_intro.htm
  2. https://en.wikipedia-on-ipfs.org/wiki/List_of_prime_numbers.html
  3. https://qvault.io/2020/07/08/how-sha-2-works-step-by-step-sha-256/

Side Note

This article uses SVG equations generated by this LATEX to Image converter. I tried setting it up on my computer, but the docker app wouldn’t run. (Docker and I do not get along)

Comments