-
Notifications
You must be signed in to change notification settings - Fork 170
[wip] update fault handling and recovery #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't what I was expecting, which was simply removing the collateral penalties from reported and unreported storage fault cases.
Storage collateral and the windows for client arbitration and collateral lock-up do need to be defined, but the words here have left me more confused. See also issues I filed #385 #386.
- `PoStProvingPeriod: UInt = 24 * 60 * 60 / 30 = 2880` (1 day, given `30s` block time) | ||
- `PoStTimeout: UInt` (less than 1 ProvingPeriod) | ||
- `SectorFailureTimout: UInt` (more than 1 Proving Period) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot of complexity that I wasn't expecting.
What's the motivation for these additional periods?
Do I understand correctly that there are now five different periods to handle:
- on-time proof
- between proving period end (PPE) and PPE + post timeout
- between PPE + post timeout and PPE + sector failure timeout
- between PPE + sector failure timeout and PPE + generation attack time
- no submission when PPE + generation attack time has expired
How do these timeouts relate to the generation attack time, still referenced below? My understanding was that GAT was expected to be less than a proving period, so by transitivity less than sector failure timeout (and hence the sequence I wrote above doesn't make sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation attack time will go away entirely in these constructions. It will be only used to reason about these timeouts when defining them, but not appear in the spec in the current construction.
This leaves us with following periods
- on time
- between PPE and PoStTimeout
- between PoStTimeout and SectorFailureTimeout
- no submission/submission after SectorFailureTimeout
- **Condition**: If the miner posts their PoSt after the proving period ends, but before `PoStTimeout` | ||
- **Reporting:** The miner submits their PoSt as usual, but includes the `LateSubmissionFee`. | ||
- **Penalization:** | ||
- *Economic penalization*: To determine the penalty amount, `ComputeLateSubmissionFee(minerPower)` is called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actors spec has this as ComputeLateFee(miner.Power, chain.Now()-miner.ProvingPeriodEnd)
, i.e. a function of lateness as well as power. Which is it? To avoid divergence, I propose specifying the type signature in only one place (probably the actors).
Related: are late fees and other FIL-denominated amounts expected to be a function of FIL circulation, in the same way that pledge collateral is? If not, it would help to have a clear statement somewhere to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the "wip" in the title, not all references have been updated yet. The fees will be a fixed amount only depending on the power, not the delay anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: are late fees and other FIL-denominated amounts expected to be a function of FIL circulation, in the same way that pledge collateral is? If not, it would help to have a clear statement somewhere to that effect.
I am not sure, I think this is more complex topic that needs deeper thought, when we go about actually setting these fees to a certain value.
- **Reporting:** The miner submits their PoSt as usual, but includes the `LateSubmissionFee`. | ||
- **Penalization:** | ||
- *Economic penalization*: To determine the penalty amount, `ComputeLateSubmissionFee(minerPower)` is called. | ||
- *Power penalization*: The miners' power is reduced to `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change. Previously, a miner maintained power when late, until the generation attack threshold was exceeded (which made sense to me). But I'm also confused about when it should take effect. This stanza is for when a PoSt is received before the PoStTimeout, but the recovery below states that power is restored matching the submitted PoSt, so ... immediately.
One possible interpretation is that the intention is to implicitly reduce a miner's power to zero after their proving period expires, without the miner or any other actor sending a message to make that so. This is an additional complexity, since the Miner.Power
state variables will no longer define the power table; it will be a function of more complex state. All validating nodes will need to compute this function correctly to verify that a block proposer has the power to win the election. Will the election be constructed to subtract zero-power miners from the lottery total (more implementation complexity to track this for all miners), or allow them to "win" but ignore their blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change. Previously, a miner maintained power when late, until the generation attack threshold was exceeded (which made sense to me)
I don't think that was the case, but rather it was not well specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm also confused about when it should take effect.
See #384 for some details on how this will be enforced in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an additional complexity, since the Miner.Power state variables will no longer define the power table; it will be a function of more complex state
This is already the case with the previous definition, as slashed miners would have had their power removed, even without anyone calling SlashStorageMiner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #384 for some details on how this will be enforced in practice.
I read #384 and it's still not clear to me when the miner's power is reduced to 0. Should GetMinersPowerAt
return 0 when the head of the chain reaches the height which marks the end of the miner's proving period [and they have yet to submit a PoSt for that proving period]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can now happily answer that yes it should. That is, if the miner's PoSt has not been included by the block at height ProvingPeriodEnd, the miner's power will read 0 (they will be unable to participate in leader election) until they can prove they are storing the sectors to which they committed.
- **Condition**: If the miner posts their PoSt after the proving period ends, but before `SectorFailureTimeout` | ||
- **Reporting:** The miner submits their PoSt as usual, but includes the `LateSubmissionFee + Lost Storage Collateral`. | ||
- **Penalization:** | ||
- *Economic penalization*: To determine the penalty amount, `ComputeLateSubmissionFee(minerPower)` is called, in addition all storage collateral is lost for the sectors in current `ProvingSet` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "all storage collateral is lost" mean? Does that mean it's burnt?
The whole point of storage collateral, as I understand it, is that a miner must hold it for remittance to a deal client who's sector has been dropped before the deal completed. It is not burnt. Whether a sector is dropped intentionally ("done"), experiences a fault, or goes AWOL due to no PoSt after some timeout, the way it's lost is irrelevant from the client's point of view. Indeed, I believe the intention is that if a miner loses a sector but is able to recover from a backup and re-commit it, the client will not be entitled to payment (I don't think this is well defined, though).
As I understood (prior) intentions, the only claimant on storage collateral is clients, and after some appropriately long arbitration time the collateral should return to the miner.
Being X hours late in a PoSt (e.g. due to network outage) should not expose an honest miner who is in fact storing all clients' data to instant loss of all storage collateral. (The size of storage collateral is still TBA, so it's possible that it's a small enough matter that this objection is not relevant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "all storage collateral is lost" mean? Does that mean it's burnt?
It is lost in the sense that the miner has to replace it, and the storage clients can arbitrate deals with the miner to recover this collateral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being X hours late in a PoSt (e.g. due to network outage) should not expose an honest miner who is in fact storing all clients' data to instant loss of all storage collateral.
Two things
- This is a much larger timeout (e.g. x ProvingPeriods) we are talking about here, especially compared to the timeouts before. This means this only gets activated if miners can't provide proofs for more than multiple days.
- I disagree, we want to incentivize miners to store data and keep it available, and so loosing storage collateral for not being able to prove that data is stored is the right thing to do.
- **Condition**: If the miner posts their PoSt after the proving period ends, but before `SectorFailureTimeout` | ||
- **Reporting:** The miner submits their PoSt as usual, but includes the `LateSubmissionFee + Lost Storage Collateral`. | ||
- **Penalization:** | ||
- *Economic penalization*: To determine the penalty amount, `ComputeLateSubmissionFee(minerPower)` is called, in addition all storage collateral is lost for the sectors in current `ProvingSet` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secondly, is this another implicit thing? There doesn't seem to be a slashing method call here so there will be no state update. In the same vein as the complexity of implicit power, this suggests that a collateral burn (or something) is also implicit, which is going to be complex to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes collateral movement will have to happen implicitly
- when arbitrate deal is called by client, things can be updated
- when a new post is submitted, according calculations need to be made
- *Economic*: There is no recovery of the fees. | ||
- *Power*: The miners' power is reset matching the submitted PoSt. | ||
- **After `SectorFailureTimeout`:** | ||
- **Condition**: If the miner posts their PoSt after the proving period ends, or not at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these fault conditions are after the proving period ends, with the other cases describing exactly how long after the proving period for different penalties to apply. I don't understand how this condition fits in with the others. How is it different to the "unreported storage fault" (i.e. generation attack exceeded) case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreported storage fault is now only once SectorFailureTimeout
happens, which is a much longer duration than previously
faults.md
Outdated
- *Economic penalization*: All storage collateral is lost. | ||
- *Power penalization*: The miners' power is reduced to `0`. | ||
- **Recovery**: | ||
- Only resubmission of the sectors can lead to recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is recovered? I'm assuming this intends to mean that none of the lost things can be recovered, but that a miner may post new collateral and commit new sectors, i.e. they're not permanently banned from the market.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
faults.md
Outdated
- Note: we could *require* the method be called, as part of the consensus rules (this gets complicated though). In this case, there is a DoS attack where if I make a large number of miners each with a single sector, and fail them all at the same time, the next block miner will be forced to do a very large amount of work. This would either need an extended 'gas limit', or some other method to avoid too long validation times. | ||
- **Check:** The chain checks that the miners last PoSt submission was before the start of their current proving period, and that the current block is after the generation attack threshold for their current proving period. | ||
- **Penalization:** Penalizations are enforced by `SlashStorageFault` on the `storage market` actor. | ||
- *Economic Penalization*: Miner loses all collateral. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this PR to remove explicit economic penalisation from storage faults, as it's a huge operational risk. Losing all power, and hence being forced to re-seal and commit sectors, is already a huge penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Losing all power, and hence being forced to re-seal and commit sectors, is already a huge penalty.
- This should read
storage collateral
- they are only forced to reseal after the SectorFailureTimeout, this what this is all about, allowing recovery when the failure has not been too long
- Immediate power reductions are critical to keep the consensus safe, and also matches behaviour seen in proof of work chains, where when your operation is offline, you are not able to mine blocks.
faults.md
Outdated
- **Reporting:** The miner can specify some sectors that they failed to prove during the proving period. | ||
- Note: These faults are output by the `ProveStorage` routine, and are posted on-chain when posting the proof. This occurs when the miner (for example) has a disk failure, or other local data corruption. | ||
- **Check:** The chain checks that the proof verifies with the missing sectors. | ||
- **Penalization:** The miner is penalized for collateral and power proportional to the number of missing sectors. The sectors are also removed from the miners proving set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly I was expecting collateral not to be penalised here.
faults.md
Outdated
- **Condition:** A client who has stored data with a miner, and the miner removes the sector containing that data before the end of the agreed upon time period. | ||
- **Reporting:** The client invokes `ArbitrateDeal` on the offending miner actor with a signed deal from that miner for the storage in question. Note: the reporting must happen within one proving period of the miner removing the storage erroneously. | ||
- **Check:** The chain checks that the deal was correctly signed by the miner in question, that the deal has not yet expired, and that the sector referenced by the deal is no longer in the miners proving set. | ||
- **Penalization:** The miner is penalized an amount proportional to the incorrectly removed sector. This penalty is taken from their pledged collateral . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't intend to edit this text, but it is nevertheless incorrect. The penalty should be in proportion to the deal size, and taken from storage collateral.
A wrinkle for removing collateral penalties from storage slashing is that there's no reward to pay to incentivise slashers. Maybe that was the justification for the implicit thingies? |
Thanks for your responses, many of the things you said make sense. A lot depends on the relative scale of things – it's totally unclear to me how long the failure timeout periods are, but in a comment you said "many proving periods", which certainly makes a difference. The relative cost of storage collateral matters greatly. If this collateral is significant wrt a miner's investment and return expectations, then losing it all permanently doesn't satisfy my judgement of a miner's operational risk. But if it's pennies, no worries (but not so good for clients). The mechanism as stated will lock up potentially large amounts of FIL forever (without actually burning it). As I noted in #386, this doesn't answer how storage collateral is locked up for potential arbitration in case a miner declares a sector done prematurely, which will likely be a more frequent use case. |
- TODO: write on the spec exact parameters for PoSt Deadline and Gen Attack threshold | ||
- **Unreported storage fault slashing:** | ||
- **Condition:** If the miner does not submit their PoSt by the `generation attack threshold`. | ||
- **Penalization:** All of the miner's pledge collateral and all of their power is irrevocably slashed. This miner can never again produce blocks, even if they attempt to repost their pledge collateral. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if they attempt to repost their pledge collateral
What does it mean to "repost their pledge collateral" if the miner's pledge collateral was already slashed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my read here (not having written it) is they will not be allowed to commit to sectors even if they put up collateral to replace the slashed one.
I assume that would just force the miner to create a new identity. @dignifiedquire ? Does the incentive make sense in that way? I guess it forces the extra gas cost on them, but I assume that's small potatoes vs the storage collateral.
|
||
- `PoStProvingPeriod: UInt = 24 * 60 * 60 / 30 = 2880` (1 day, given `30s` block time) | ||
- `PoStTimeout: UInt` (less than 1 ProvingPeriod) | ||
- `SectorFailureTimout: UInt` (more than 1 Proving Period) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. Should be SectorFailureTimeout
.
- **Reporting:** The miner submits their PoSt as usual, but includes the `LateSubmissionFee`. | ||
- **Penalization:** | ||
- *Economic penalization*: To determine the penalty amount, `ComputeLateSubmissionFee(minerPower)` is called. | ||
- *Power penalization*: The miners' power is reduced to `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #384 for some details on how this will be enforced in practice.
I read #384 and it's still not clear to me when the miner's power is reduced to 0. Should GetMinersPowerAt
return 0 when the head of the chain reaches the height which marks the end of the miner's proving period [and they have yet to submit a PoSt for that proving period]?
faults.md
Outdated
- **Check:** The chain checks that the proof verifies with the missing sectors. | ||
- **Penalization:** The miner is penalized pledge collateral, storage collateral and power proportional to the number of missing sectors. The sectors are also removed from the miners proving set. | ||
- **Recovery**: | ||
- The faulted sectors must be re-submitted as if there were new sectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When re-sealing and re-submitting, can a miner reuse the id of a sector which it has permanently lost?
- **Reporting:** The client invokes `ArbitrateDeal` on the offending miner actor with a signed deal from that miner for the storage in question. Note: the reporting must happen within one proving period of the miner removing the storage erroneously. | ||
- **Check:** The chain checks that the deal was correctly signed by the miner in question, that the deal has not yet expired, and that the sector referenced by the deal is no longer in the miners proving set. | ||
- **Penalization:** The miner is penalized an amount proportional to the incorrectly removed sector. This penalty is taken from their storage collateral. | ||
- *Note*: This implies that miners cannot re-seal data into different sectors. We could come up with a protocol where the client gives the miner explicit consent to re-seal, but that is more complicated and can be done later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that miners cannot re-seal data into different sectors.
How does this imply that miner's cannot re-seal data into different sectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmh I'm also unclear there: couldn't the miner just cache old challenges in order to regenerate an identical seal? How might one tell the difference?
This reverts commit 5e43b27.
@dignifiedquire I believe this is stale after massive Sept-Oct spec & subsequence faults work. Please reopen if you disagree. |
No description provided.