zkSync: Database failure leads to downtime, decentralization is the long-term solution
ChainCatcher message, zkSync tweeted to announce the detailed notice of the zkSync Era∎ network outage yesterday. Due to a failure in the block queue database, block production was halted. Nevertheless, the server API was not affected. Transactions continued to be added to the memory pool, and the query service was functioning normally. Despite all components having comprehensive monitoring, logging, and alerts, no alerts were triggered since the API was operating normally. The fix was implemented within 5 minutes. To address similar issues, zkSync has granted a special role to the database monitoring agents, enabling them to connect to the database and continuously collect metrics. Alerts will be issued when the database monitoring agents fail or cannot establish a connection to the database to collect metrics.
Additionally, if the situation escalates severely, the on-call team will be notified immediately through multiple channels. However, the only long-term solution for activity and availability is decentralization. Decentralized systems are inherently more resilient, and the decentralization of the sequencer (and subsequently the prover) is the top priority for the zkSync engineering team. (source link)