Interpretation of Arweave Version 17 White Paper (3): The Argument of SPoRes Game
This article is somewhat hardcore, as it will involve a lot of mathematical derivations and proofs, so friends should be prepared. However, the author believes that mathematics, as a fundamental discipline, has a strong precision in logically explaining a phenomenon, so let me try to showcase the beauty of mathematics in the mechanism of @ArweaveEco using simple language.
In the article "White Paper Interpretation (Part 2)," we described how the Succinct Proof of Access ( #SPoA ) game can be used by any participant to prove that they indeed stored some data at a specific location in the dataset. This model can also be used to create a second game, which we call Succinct Proofs of Replications ( #SPoRes ), which will allow the prover to demonstrate to the verifier that they have stored a certain number of data copies with minimal data transmission and verification costs, regardless of the dataset size.
✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨
SPoRes Game
The Succinct Proofs of Replications (SPoRes) game proceeds as follows. Alice claims that she has stored n unique copies of a Merkleized dataset. Bob wants to verify whether Alice is lying. To do this, he gives Alice a difficulty parameter d and a randomly generated seed Seed. Alice uses this Seed to generate a VDF hash chain that emits a checkpoint every second, which can be used to unlock up to k SPoA challenges within the dataset. Whenever she has corresponding packed data blocks for these challenges, Alice can construct the corresponding SPoA proofs. She then hashes these proofs and compares them with Bob's difficulty parameter d. If the proof hash value is greater than d, then Alice has found a valid solution and sends the corresponding proof to Bob. Bob records the time from sending the random seed to receiving Alice's valid proof.
Based on the difficulty d, the expression for the probability p of finding a valid solution in a single attempt is:
Formula annotation: Since the SHA 256-based algorithm is a 256-bit string, there are 2^256 (where " ^ " denotes exponentiation) maximum hash values. To find a valid solution greater than d, there are only 2^256-d hash values available, from which we can calculate the probability of success in a single attempt.
If Alice has n copies and each copy can make k attempts per second, then the probability of her finding a solution in any given second is:
Formula annotation: kN means the total number of hash solution attempts Alice can make in one second, which is proportional to the number of unique copies Alice has stored. For a detailed explanation, please refer to the previous article "Arweave 2.6 Maybe More in Line with Satoshi's Vision." 1-p is the probability of failure in a single attempt. The probability value p2 is the difference between 1 and the probability of failure in kN attempts.
By substituting the above two formulas, we can derive:
Formula annotation: The result obtained by substituting p into the formula for P2.
The time required for Alice to submit a proof can be modeled using a geometric random variable X ~ geom(p2), where p2 is the probability of success. This random variable X depends on d, k, and n. To verify whether Alice is lying, Bob sends a difficulty d such that Alice can submit a proof to Bob every 120 seconds, given that she has n copies. This means that Alice's success probability in 120 seconds is:
Formula annotation: 1/120 is the probability of successfully submitting a proof within 120 seconds.
If Alice can submit the proof within the required time, she is likely to have the correct number of unique copies. However, a single proof is not enough for Bob to be convinced that Alice is not lying. Bob can only reasonably believe that Alice has stored the expected amount of data if she can consistently submit proofs every 120 seconds over a long period.
Next, we attempt to quantify Bob's certainty regarding the number of copies Alice has stored.
Suppose Alice claims to have stored 20 copies and consistently submits proofs at an average of every 120 seconds over a period of 2 weeks (a total of 10,080 proofs submitted). At this point, Bob wonders what the probability is that if Alice only stored 19 (or fewer) copies, she could still submit these proofs. This corresponds to the probability of providing proofs at different seconds:
Formula annotation: Here, 19k represents that Alice has 19 copies, each with k attempts to find a solution.
Over-simplifying:
This probability can be calculated using the d value generated for honest Alice by Bob. If she stored 19 or fewer copies, the probability of her providing a proof in one second can be derived from the following series of derivations. Here, p2 represents the probability of Alice honestly storing 20 copies and generating a proof in one second, while p2* represents the probability of providing a proof in one second if she is lying (storing ≤ 19 copies):
This derivation process may seem complex, but anyone with basic mathematical skills can understand it. The derivation process is simply the substitution process mentioned above.
Thus, p2* = 0.00791832081, corresponding to an expected proof generation time of 126.2894 seconds. Let X* be the time Alice takes to submit a proof derived from p2, which means X ~ geom(p2). This represents the expected (mean) time X that Alice spends submitting proofs when lying—only storing 19 copies:
Formula annotation: E[X*] indicates that the average time for Alice to successfully submit a proof, given that she only stored 19 copies, is 126.2894 seconds. If she stored 20 copies, the average successful submission time would be 120 seconds.
We can use the Central Limit Theorem to estimate the probability that the sample mean is less than 120, which differs from the expected value E[X*] obtained above. This probability is expressed as:
Formula annotation: The expression on the left side of the equation represents the probability that Alice's average submission time for proofs over two weeks is less than or equal to 120 seconds. The right side is a variant of the left side expression.
For a large number of samples, this inequality on the left approaches a normal distribution with a mean of E[X] and a variance of σX^2/n, where σX* is the standard deviation of Xd, and n (here = 10,080) is the sample size. Therefore, the above formula can be transformed into:
We can convert this distribution into standard normal form to obtain the equivalent probability:
Finally, using the standard normal distribution table, we can obtain the probability that Alice is lying during this two-week period:
Thus, in this example, Bob can determine with approximately 99.99997416% certainty that Alice has stored more than 19 copies of the dataset during these two weeks. We note that if Alice stored fewer than 19 copies, the expected proof submission time would exceed 126.3 seconds. We also note that the entire proof system only requires a proof to be transmitted every 120 seconds. This results in an average data transmission rate of 2.389 KB per second (280 KB/120 seconds) between Alice and Bob, comparable to the synchronization overhead of the Bitcoin network.
Finally, these probabilities do not change with arbitrary increases in dataset size, while increasing the sampling period will continue to improve Bob's certainty regarding the number of copies Alice has at a superlinear rate (Figure 1).
Figure 1: In the Succinct Proofs of Replications (SPoRes) game, certainty increases superlinearly with the duration of the sample. After two weeks of sample collection, the probability of maintaining fewer than 19 copies is negligible (P < 0.0000002584).
Thus, a SPoRes game is defined by the following parameters:
Where:
r = The Merkle root of the dataset used for the game.
k = The maximum number of challenges unlocked per second for each copy.
d = The difficulty parameter that determines the success probability for each attempt.
✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨
Through these interesting mathematical derivations, we can confidently determine that as long as miners continuously submit proofs within a reasonable timeframe, we can ascertain whether the data is stored. This forms the core of Arweave's consensus mechanism aimed at decentralized storage. The fun of mathematics is just beginning, and there will be more interesting deductions and proofs in the upcoming articles.