On the 30th of September 2022 the Solana Network had an outage. This was caused by bug in the consensus algorithm, where a single validator operator had two nodes that were producing duplicate blocks. After 24 hours of running in this faulty configuration, a bug in the Solana code caused consensus to stop working. A code fix was created and a validator network restart was conducted by the Solana Validators. Outage time was <7 hours.
At approximately 22:41 UTC on Friday, Sep. 30, the Solana Mainnet Beta cluster halted when the network was unable to recover from a fork caused by a bug in the consensus algorithm implementation. Block production resumed at approximately 6:57 UTC on Saturday after a coordinated restart, and network operators continued to restore client services over the next several hours.Solana Outage Report
So according to the outage report the downtime was 7hrs 44 minutes. Looking at the status page of the incident is says 6 hrs 19 mins. Safe to say it is somewhere in the middle.
Watch this on YouTube! Our Second Video 🙂
What technically happened?
The following explanation points were provided in the Solana Outage Report.
- Due to a validator operator’s malfunctioning hot-spare node, which the operator had deployed as part of a high-availability configuration, duplicate blocks were produced at the same slot height.
- Both the primary and spare validators became active at the same time, operating with the same node identity, but proposed blocks of differing composition. This situation persisted for at least 24 hours prior to the outage, with most of the validator’s leader slots producing duplicate blocks which were handled safely by the cluster.
- Initially, duplicate blocks were handled by the network as expected. For example, duplicate blocks were produced in slot 153139220 (220) and the cluster reached consensus on one of those blocks before it continued on to slot 153139221 (221), as should happen for duplicate block conflict resolution.
- However, at the next slot, 221, duplicate blocks were observed again but an edge case was encountered. Even though the correct version of the block 221 was confirmed, a bug in the fork selection logic prevented block producers from building on top of 221 and prevented the cluster from achieving consensus.
For those that want to dive into a fantastic technical explanation of what happened, check out Michael from Laine’s YouTube video below!
Outage Report Timeline
Here are the timeline details provided by Solana in the Outage Report.
- 09-30-2022 21:46 UTC: Validators start reporting consensus failure. Voting is still occurring but roots are not advancing.
- 09-30-2022 22:00 UTC: Investigation commences to see if recovery is possible without a restart.
- 09-30-2022 22:41 UTC: the Solana Mainnet Beta cluster halted
- 09-30-2022 23:09 UTC: It is discovered that a validator had been producing duplicate blocks for its leader slots. Its operator is contacted and the validators taken offline.
- 10-01-2022 00:08 UTC: Attempts to recover the cluster fail. Restart planning begins.
- 10-01-2022 01:10 UTC: Restart instructions issued.
- 10-01-2022 06:57 UTC: 80% of stake-weight online, roots advancing and network online.
- 10-01-2022 07:30 UTC: The core team identified the likely bug that caused the consensus failure.
- 10-01-2022 09:30 UTC: A fix was proposed and a test was added to reproduce the edge-case bug.
- 10-03-2022 08:30 UTC: After review by the core team, the patch was merged into the master branch and backported to all release branches.
- 10-03-2022 15:00 – 20:00 UTC: New release binaries were built and deployed to canary nodes for testing.
- 10-04-2022 15:30 UTC: An announcement to upgrade was issued to validators, who then began actively upgrading their systems to version v1.10.40 and v1.13.2.
- 10-07-2022: 04:00 UTC: 90% of stake-weight applied the patch to fix the consensus bug and the core team determined the risk of the bug to the network to be sufficiently mitigated.
How is Solana Outages and Incidents Tracked?
Solana uptime, incidents and outages are shown at the Solana Status page, which tracks Partial Outages and Full Outages.
Is it the same Outage as Before?
Sept 2021 through to May 2022 had outages caused by IDO or NFT bot spam. Whereas the last 2 outages were caused by random bugs. This means that the Solana Upgrades seem to be doing their job in stopping bot spam, but Solana code base still is being impacted by random bugs.
- September 2021 – 17 hours 56 minutes (inclusive) Grape IDO on Raydium (IDO Bot Spam)
- January 2022 – Partial outage for 2 weeks! (was hell-ish) Network Congestion
- April 2022 – 2 hour 42 minute outage (NFT Bot Spam)
- May 2022 – 5 hours 31 minute outage (NFT Bot Spam)
- June 2022 – 4 hours 10 minute outage (Durable Nonce TX Bug)
- October 2022 – 6 hour 19 minute outage (Duplicate Block Consensus Bug)
So what happens next for Solana Blockchain Users?
Some users don’t care, like those who have lived through previous Solana Outages before and hope to see the issue fixed quickly so that Solana gets stronger and more robust in preparation for when real traffic arrives on-chain.
However, if you have invested time, money & credibility in building a Web3 business… a.k.a. are a Buildooor, then you have to mitigate risk. Every outage that lasts for half a day could mean thousands or millions of dollars of lost revenue.
What can Buildooors do?
A lot of buildooors we are in contact with here, design their products to be multi-chain. This means that they BUIDL their product or service so that it can serve blockchain users operating on multiple chains including Solana, Ethereum, Polygon, Binance chain & others. Stay tuned for some upcoming Founder Podcasts where we discuss this!
Upcoming Solana Breakpoint Conference!
Get ready for Solana Breakpoint Conference happening in Lisbon Portugal from 4th to the 7th of November, 2022. See you there!