OUTA

Conduit releases Degen Chain outage incident report: Improved alerts and monitoring for Orbit chain

ChainCatcher message, Conduit released a post-mortem analysis report regarding the previous downtime incident of Degen Chain. On May 10, Conduit increased the batch size for Degen and Proof of Play Apex to 10MB to reduce costs, which delayed the time for data to be batch published from these networks to their parent chain. On May 12, this configuration was reverted to fix the batch publishing. This led to reorganizations on both networks, as the batch was published after a 24-hour forced inclusion window. Arbitrum Nitro will insert any inbox messages before any transactions in the batch and replay those transactions using the new timestamps.After the reorganization, nodes returned with corrupted databases due to their depth not being well handled by geth. This required resynchronizing the data directory from genesis. The synchronization time for each network exceeded 40 hours, with a replay rate of about 100M gas/s. Once the nodes were resynchronized, Conduit attempted various transaction replay strategies, although not all transactions could be recovered, as some relied on precise timestamps.After negotiating with each rollup team, Conduit discussed and concurrently tried various strategies to bring the network online and restore the state prior to the reorganization. The Degen Chain network was restored online 54 hours after the outage. The Apex chain of Proof of Play was restored around the same time but became available to the public only after implementing another recovery solution.Conduit stated that it has improved alerts and monitoring for the Orbit chain to cover such situations and is committed to working with Offchain Labs to enhance the observability of all Orbit chain operators. The team will continue to invest in and research mechanisms to better simulate mainnet conditions and transaction payloads in a testing environment. The Degen Chain Explorer is now displaying the latest status of Degen Chain normally.

Aptos responded to yesterday's "network outage": it was not a transaction load issue, but caused by non-deterministic code, and a fix has been deployed

ChainCatcher news, Aptos released a report on the network outage that occurred yesterday, stating that the Aptos network began experiencing transaction delays around 16:15 PDT on October 18 (07:15 Beijing time on October 19). The transaction load was not the issue in this incident; submitted transactions were not lost, and no forks occurred. Non-deterministic code caused the problem, and a fix has been deployed. The issue was resolved around 12:30 Beijing time on October 19.On August 22, the Aptos core codebase submitted performance-centric code changes, and on October 16, the FeeStatement feature went live, detailing transaction fees/refunds. The initial code changes introduced non-determinism that was revealed only by FeeStatement. Specifically, validators consistently believed that the gas budget for transactions was insufficient to execute them, and due to the non-determinism introduced in the August code changes, they could not reach a consensus on the amount of gas used up to that point.After identifying the actual event output differences in non-deterministic transaction execution, the issue was traced back in the code to the FeeStatement event and code changes. Meanwhile, a developer began running transaction simulations through code changes to restore mapping changes, executing repeatedly to ensure consistent results. Additionally, the recovery from the code submission has been implemented, and Docker builds for validator operators have begun, followed by the release of a new version.
ChainCatcher Building the Web3 world with innovators