In-depth investigation: Why do new public chains frequently experience outages?

ChainCatcher Selection
2022-01-31 15:16:21
Collection
Traffic loss of control is the root cause of many public chains recently "stopping," and the common measure used by project parties to "increase the base Gas" essentially reduces the throughput that the system can support.

Author: Richard Lee

Editor: Gong Quanyu

As January begins, multiple public chains/Layer 2s such as Solana, Harmony, and Arbitrum have experienced network outages with block production halting, while Ethereum sidechain Polygon has faced severe congestion, with users reporting long delays in initiating transactions or withdrawals.
These public chains have often touted "high performance" as their main selling point, yet they have all coincidentally begun to "strike" around the same time. Prior to this, Solana, Arbitrum, BSC, and Fantom had also repeatedly exposed similar issues.

The collective halt of new public chains reflects a widespread and far-reaching infrastructure crisis. Chain Catcher, through interviews with the Harmony team and professionals from domestic public chains like Conflux, attempts to restore the context of this crisis and clarify the issues that deserve attention and reflection.

1. Why is "public chain downtime" worth noting?

Web 3.0 is known for combining the openness of Web 1.0 with the economic benefits of Web 2.0, and it is a term used to describe the next wave of the internet in the crypto space. This old term has become a buzzword again, not only because it legitimizes the crypto economy but more so because it symbolizes the large-scale adoption of blockchain and crypto technology.

The public chain sector saw explosive growth in 2021, with the emergence of Solana being one of the reasons: claiming tens of thousands of TPS per second, it aims to provide users with a faster and cheaper on-chain experience. Many celebrities and institutions, including SBF and Bank of America, view Solana as a "gateway" to promote large-scale crypto adoption.

In a future where on-chain applications are expected to further break out, the security and stability of public chains, as the underlying infrastructure, are crucial. New public chains represented by Solana plan to challenge Ethereum and become the first stop for many new users entering the crypto industry, but they have faced embarrassing situations like outages, reflecting the inherent flaws that have gradually emerged during their rapid development.

If the phenomenon of these public chains being paralyzed for hours is not resolved in a timely manner, it will inevitably lead to a poor user experience and impression for mainstream users entering the market, becoming a significant bottleneck for the large-scale development of the crypto economy. After all, if public chains, as decentralized networks maintained by distributed nodes, frequently go down or lag like platforms based on centralized servers, how can they gain the trust of the mainstream audience?

2. Traffic Out of Control: The Root Cause of New Public Chains' "Halts"

"DDoS attack" is one of the most commonly used terms by project teams when explaining network performance degradation. DDoS stands for "distributed denial-of-service attack," which refers to overwhelming a system's processing capacity by using traffic from multiple sources, making it impossible for real users to access the required network services or resources in a timely manner. Attackers typically achieve this by sending traffic that exceeds the processing capacity of a network card or by sending a number of requests to an application that exceeds its management capability.

According to the blockchain white hat hacker organization Halborn, traditional DDoS methods usually cause fixed single points of failure in the system; for example, if a web server fails, visitors may be unable to access the website it operates. Therefore, resistance to DDoS attacks is often one of the main selling points of blockchain technology—there is no single essential node in a blockchain network, and the offline status of one node does not lead to the paralysis of the entire network.

However, this does not mean that blockchains are immune to DDoS attacks. Halborn points out that attackers can send a large number of spam transactions, flooding the entire blockchain network and reducing the opportunities and space resources for "legitimate users." In real scenarios, the so-called "attacks" are often not premeditated "attacks," but rather cheating behaviors implemented by real users using computer programs during popular project IDOs, GameFi transactions, or market booms.

So, can continuously increasing the memory capacity of node servers solve this problem? The answer is no. This is determined by a common characteristic shared by most blockchain networks: most blockchains have a fixed capacity, and they periodically create blocks with specific size limits. When nodes package blocks, any content that does not fit the current block will be stored in the "memory pool," waiting for the next block to be packaged.

Therefore, this fundamental property also determines a common problem that public chain networks must face: under special circumstances, it may trigger a flood of transaction requests.

How to address this challenge and whether the measures taken are effective are important indicators for assessing the recent performance of various networks.

Solana users may be most familiar with the experience of "transaction flooding." Back on September 14 last year, Solana experienced a 17-hour network outage, during which all on-chain services were unavailable. The official subsequent report stated that this was due to the hot IDO activity of the decentralized social network protocol Grape Protocol on the Raydium platform, where many users sent a large number of transactions using scripted bots, causing "memory overflow," leading to the collapse of validating nodes, and ultimately resulting in the entire network being unable to reach "consensus" and going offline (i.e., unable to produce new blocks).

image

According to the Solana Status announcement, the congestion phenomenon that has persisted since early December last year is also related to the issues exposed by the "9·14" outage. Solana Status is a Twitter account operated by the Solana Foundation that publishes network performance announcements.

According to analysis by blockchain company Laine, the recent market volatility has led many leveraged positions in DeFi projects to reach liquidation standards. Those executing DeFi liquidations receive rewards, and anyone can apply to act as a liquidator. This has created a market where many compete to liquidate for bounties, with many using self-developed automated programs (commonly referred to as "bots"). To ensure they "win" the race, these "bots" send dozens or even hundreds of identical transaction requests.

"We saw nearly 2 million transactions (trades or other types of requests) arriving at the same node per second, with over 90% being completely identical duplicates," Solana co-founder Anatoly Yakovenko stated during a Twitter Space event on January 27.

Regarding the cause of the outage, Hu Zhiwei, director of the Boundary Smart Research Institute, further told Chain Catcher that because Solana treats consensus messages as a special type of transaction message transmitted between validating nodes, the large volume of messages clogged the network, preventing consensus messages from being transmitted normally, thus hindering the consensus process.

image

Composition of Solana TPS Structure Source: solana beach

"At the same time, some features of Solana were specifically exploited, leading to network downtime. For example, the write-lock for concurrent transaction processing was locked on many important addresses, causing transactions to execute sequentially rather than concurrently, greatly affecting message processing capability; nodes retained possible fork information to handle forks, leading to memory overflow," Hu Zhiwei said.

Wu Ming, CTO of the well-known domestic public chain Conflux, analyzed for Chain Catcher that in the case of excessive transactions causing network congestion in Solana, the delay in block forwarding (broadcasting) would increase, making it easier for the ledger to fork; when the ledger fork situation becomes severe, the pressure on the consensus algorithm increases, and if not handled properly, it could ultimately lead to a complete system crash.

"A very important issue here is that nodes should not indiscriminately forward low-cost spam transactions; Solana should have done a better job in this aspect of flow control." Wu Ming stated.

Anatoly Yakovenko also acknowledged this issue during the aforementioned Twitter Space event. He stated that the main problem lies in the original program design, where "duplicate transaction checks" occur after signature verification, meaning that all duplicate data must first undergo signature verification before being checked for "spam transactions." Additionally, before the upgrade of the node client, the program Solana used to delete duplicate data and clear network redundancy ran very slowly, taking hundreds of microseconds.

To avoid interference from "bot" transactions during the next market surge, Anatoly Yakovenko stated that they will introduce "actual flow control" in the upcoming 1.9 version of the Solana mainnet beta.

Another popular public chain, Harmony, is facing similar issues. On January 15, the Harmony network was down for several hours, and the official team raised the base gas fee to 30 gwei to increase the threshold for sending spam transactions.

Post-incident analysis released by the Harmony community showed that the network's leader node received a large amount of spam traffic, combined with the outdated client of the validating nodes poorly handling high traffic situations, leading to this "downtime" incident due to both internal and external factors.

Harmony CTO Rongjian Lan told Chain Catcher that repeated sending of peer-to-peer (p2p) data packets caused congestion in the p2p network, preventing normal consensus messages from being sent, thus the network could not reach "consensus." The internal reason lies in the fact that the parameters of the Harmony p2p network are not optimized enough and there are potential bugs, leading to the aforementioned phenomenon.

"The new Web3 infrastructure needs better traffic monitoring and limiting mechanisms to prevent network abuse." Rongjian Lan stated that after optimizing the parameters of the p2p network protocol, Harmony will undertake a long-term system improvement project, optimizing at the consensus, network, and RPC layers.

Additionally, the Ethereum Layer 2 scaling network Arbitrum One experienced network outages on September 14 last year and January 9 this year, but according to official announcements, these were not directly related to traffic control issues, but rather due to the network's intentionally maintained high level of centralization during its testing phase.

It is reported that the cause of the first incident in Arbitrum One was a bug in its Sequencer, while the recent outage was due to hardware failure of the main Sequencer node, with the backup Sequencer failing to activate in time, resulting in the network "striking" for several hours.

"Although we usually have redundancy to allow the backup Sequencer to seamlessly take over, these functions did not work due to ongoing software upgrades. The result was that the Sequencer stopped processing new transactions," Offchain Labs stated.

It is noted that the Sequencer is a full node operated by the Arbitrum development team Offchain Labs. The Sequencer has certain privileges and can control the ordering of each transaction in the inbox to ensure that users' transaction results can be confirmed immediately.

Offchain Labs stated in the announcement that once Arbitrum is fully decentralized, the strongest guarantees will come.

3. Is Raising the "Base Gas Fee" the Ultimate Solution? Where is the Future of Public Chain Stability?

In fact, under certain motivational incentives, writing scripts and cheating has long been a natural behavior of internet users. With the increase in on-chain interactions, "transaction flooding" and "bots" will inevitably enter the blockchain space.

At the same time, the Polygon network also faced "bad reviews" regarding its operational status. In early January, due to the popularity of the P2E game Sunflower Farmers on Polygon, participating players sent a large number of transaction requests, causing the gas consumption of the smart contract for this chain game to temporarily account for 41.8% of the entire Polygon network, leading to other types of transactions on Polygon being temporarily shelved and the network experiencing high congestion, with average gas prices rising nearly sevenfold within a few days.

image

Average Gas Price Trend of Polygon in the Last Three Months Source: Polygonscan

Polygon has long been troubled by "transaction flooding," with network congestion occurring from time to time. Previously, in October last year, Polygon had already raised the minimum gas price for node clients by 30 times (from 1 gwei to 30 gwei) to cope with the massive "spam transactions."

This response method is consistent with the emergency measures taken by Harmony. However, raising the base gas price increases the cost for users to "cheat" on one hand, while also impacting user experience on the other.

Regarding this common practice by project teams, Wu Ming analyzed for Chain Catcher that raising the base gas as a form of "flow control" is certainly effective, as this measure essentially reduces the throughput that the system can support.

However, he also pointed out that "if you want to do better, you need to work on the system itself to increase the maximum throughput that the system can support, which will involve improvements in consensus algorithms, network forwarding algorithms, storage, and execution optimizations."

The "flow control" improvements disclosed by Solana co-founder Anatoly Yakovenko involve introducing new protocol mechanisms. Anatoly Yakovenko stated that the new upgrade will introduce a QoS flow control mechanism based on staking weight, which is implemented by the "Quic Protocol," reportedly developed by Google and has been around for 5-6 years. Through this protocol, Solana can impose "rating" restrictions on senders. Among them, how to decide how to allocate bandwidth between different blocks is the most critical issue for the development team to tackle—this process requires validators to receive message flows from the rest of the network and prioritize quality of service and congestion control based on the source weight of these messages.

Anatoly Yakovenko stated on Twitter that the aforementioned "flow control" measures will be launched in the next 4-5 weeks.

Hu Zhiwei also mentioned that for traffic attacks, public chains can adopt protective measures for the network traffic of validators, such as using sentinel nodes (i.e., nodes that can switch master-slave roles through a series of mechanisms in case of main node failures to achieve fault tolerance). For solutions with higher TPS, in addition to optimizing within the chain, cross-chain and application-specific chain expansion processing methods can also be considered.

This is also a solution that BSC is exploring. Recently, BSC officially acknowledged in its annual summary that its operational mechanism faces many challenges, including "network congestion and node operators facing difficulties in managing their full nodes to sync with the latest blocks," which led to multiple short-term outages last year.

In response, BSC stated that this was due to the large block settings requiring validating nodes to have more storage space and time to sync blocks, and it will develop towards multi-chain and cross-chain in 2022, launching BSC application side chains (BAS) and BSC partition chains (BPC) to reduce the data storage requirements of the main chain.

image

BSC's Technical Planning for This Year Source: BSC Blog

Will technological improvements and increased decentralization ensure the stability of public chain networks?

In response to this question, some netizens have imitated the "impossible triangle" of blockchain "scalability," proposing a dilemma of "transaction quality": between transaction flooding (spam), censorship resistance, and low fees, achieving any two will inevitably mean that the remaining goal cannot be achieved.

image

Whether this is the case remains unknown until the project teams implement their improvement measures.

However, regardless, the phenomenon of public chain downtime provides insight: for a long time to come, public chains as underlying infrastructure are still in their early stages and need to face more tests in terms of network stability and ecological completeness, especially requiring more measures to cope with surges in transactions to avoid negatively impacting the user experience of ordinary users.

(Loners Liu and Hunter He also contributed to this article)

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators