Vitalik Buterin: Thoughts on Twitter's new feature Community Notes
Original Title: 《What do I think about Community Notes?》
Author: vitalikButerin
Compilation: 深潮 TechFlow
Over the past two years, Twitter (X) has been tumultuous, to say the least. Last year, Elon Musk purchased the platform for $44 billion, subsequently overhauling the company's staffing, content moderation, business model, and site culture—changes that may have stemmed more from Elon Musk's soft power than specific policy decisions. However, amidst these controversial actions, a new feature on Twitter quickly became significant and seemingly enjoyed bipartisan support: Community Notes.
Community Notes is a fact-checking tool that sometimes adds contextual annotations to tweets, like the one above from Elon Musk, serving as a means of fact-checking and countering misinformation. It was initially called Birdwatch and launched as a pilot project in January 2021. Since then, it has gradually expanded, with its fastest growth phase coinciding with Elon Musk's takeover of Twitter last year. Nowadays, Community Notes frequently appears in tweets that garner significant attention on Twitter, including those involving controversial political topics. In my view, and after speaking with many people across the political spectrum, I conclude that these Notes are informative and valuable when they appear.
However, what interests me most about Community Notes is that, although it is not a "crypto project," it may be the closest example of "crypto values" we see in the mainstream world. Community Notes is not authored or curated by a centrally selected group of experts; rather, anyone can write and vote on which Notes are displayed or not, entirely determined by an open-source algorithm. The Twitter site has a detailed and comprehensive guide describing how the algorithm works, allowing you to download data containing published Notes and voting information, run the algorithm locally, and verify that the output matches what is visible on Twitter. While not perfect, it surprisingly approaches the ideal of credible neutrality in quite controversial situations, while also being very useful.
How does the Community Notes algorithm work?
Any Twitter account that meets certain criteria (essentially: active for over 6 months, no violations, and a verified phone number) can register to participate in Community Notes. Currently, participants are being accepted slowly and randomly, but the ultimate plan is to allow any eligible person to join. Once accepted, you can first participate in rating existing Notes, and once your ratings are good enough (measured by how many ratings match the final outcome of that Note), you can also write your own Notes.
When you write a Note, it receives a score based on reviews from other Community Notes members. These reviews can be viewed as votes along three levels: "helpful," "somewhat helpful," and "not helpful," but reviews can also include other tags that play a role in the algorithm. Based on these reviews, Notes receive a score. If a Note's score exceeds 0.40, it will be displayed; otherwise, it will not.
What makes the algorithm unique is how the score is calculated. Unlike simple algorithms that aim to compute some sum or average of user ratings and use that as the final result, the Community Notes scoring algorithm explicitly attempts to prioritize Notes that receive positive evaluations from people with different viewpoints. In other words, if people who typically disagree on ratings ultimately agree on a specific Note, that Note will be highly rated.
Let's delve into how it works. We have a set of users and a set of Notes; we can create a matrix M where the cell Mij represents how user i rates Note j.
For any given Note, most users do not rate that Note, so most entries in the matrix will be zero, but that’s okay. The goal of the algorithm is to create a four-column model of users and Notes, assigning two statistics to each user, which we can call "friendliness" and "polarity," and two statistics to each Note, which we can call "usefulness" and "polarity." The model attempts to predict the matrix as a function of these values using the following formula:
Note that here I introduce the terminology used in the Birdwatch paper, along with my own terms, to make the meaning of the variables more intuitive without delving into mathematical concepts:
μ is a "public sentiment" parameter that measures how high the ratings given by users generally are.
iu is the "friendliness" of the user, indicating how likely that user is to give high ratings.
in is the "usefulness" of the Note, indicating how likely that Note is to receive high ratings. This is the variable we care about.
fu or fn is the "polarity" of the user or Note, indicating their position on the dominant axis of political extremes. In practice, negative polarity roughly means "left-leaning," while positive polarity means "right-leaning," but note that the extreme axis is derived from analyzing user and Note data, and the concepts of left and right are not hardcoded.
The algorithm uses a fairly basic machine learning model (standard gradient descent) to find the best variable values to predict the matrix values. The usefulness assigned to a specific Note is its final score. If a Note's usefulness is at least +0.4, then that Note will be displayed.
The core cleverness here is that "polarity" absorbs the characteristics of a Note that lead it to be liked by some users and disliked by others, while "usefulness" only measures the characteristics of a Note that lead it to be liked by all users. Thus, choosing usefulness can identify Notes that receive cross-tribal recognition and exclude those that are cheered in one tribe but provoke disdain in another.
The above only describes the core part of the algorithm. In reality, there are many additional mechanisms layered on top of it. Fortunately, they are described in the public documentation. These mechanisms include the following:
The algorithm runs multiple times, each time adding some randomly generated extreme "pseudo-votes" to the voting. This means that the true output of the algorithm for each Note is a range of values, and the final result depends on the "lower bound confidence" taken from that range and compared against a threshold of 0.32.
If many users (especially those with similar polarity to the Note) rate a Note as "not useful," and they also specify the same "tags" (e.g., "contentious or biased language," "source does not support the Note") as reasons for their ratings, then the usefulness threshold required for the Note to be published will increase from 0.4 to 0.5 (this may seem small, but it is very significant in practice).
If a Note is accepted, its usefulness must drop below the threshold required for accepting that Note by 0.01 points.
The algorithm runs multiple models for more iterations, sometimes boosting Notes with original usefulness scores between 0.3 and 0.4.
In summary, you end up with some fairly complex Python code, totaling 6,282 lines across 22 files. But it’s all open, and you can download the Notes and rating data and run it yourself to see if the output matches what is happening on Twitter.
So what does this look like in practice?
The biggest difference between this algorithm and a naive voting algorithm is probably the concept of "polarity" values, which I refer to. The algorithm documentation calls them fu and fn, using f to denote factors, as these two terms multiply together; the more general term part is because the ultimate goal is to make fu and fn multidimensional.
Polarity is assigned to users and Notes. The link between user IDs and underlying Twitter accounts is intentionally kept confidential, but the Notes are public. In fact, at least for the English dataset, the polarities generated by the algorithm are closely related to left and right.
Here are some examples of Notes with polarities around -0.8:
Note that I did not cherry-pick these; these are actually the first three rows in the scored_notes.tsv spreadsheet generated when I ran the algorithm locally, with polarity scores (referred to as coreNoteFactor1 in the spreadsheet) less than -0.8.
Now, here are some Notes with polarities around +0.8. It turns out that many of them are either discussions about Brazilian politics in Portuguese or Tesla fans angrily rebutting criticisms of Tesla, so let me pick a few that don’t belong to either category:
Again, I want to emphasize that the "left-right divide" is not hardcoded into the algorithm in any way; it is discovered through computation. This suggests that if you apply this algorithm in other cultural contexts, it can automatically detect their main political divides and bridge those divides.
Meanwhile, the Notes with the highest usefulness look like this. This time, because these Notes are actually displayed on Twitter, I can take a direct screenshot of one:
And another:
For the second Note, it directly involves a highly partisan political topic, but it is a clear, high-quality, and informative Note, thus receiving a high rating. Overall, this algorithm seems effective, and running the code to verify the algorithm's output also appears feasible.
What do I think about the algorithm?
What impresses me most when analyzing this algorithm is its complexity. There is an "academic paper version" that uses gradient descent to find the best fit for five vector and matrix equations, and then there is the real version, a complex series of executions performed by the algorithm, containing many different executions and many arbitrary coefficients along the way.
Even the academic paper version hides the underlying complexity. The equation it optimizes is a negative quartic (because there is a quadratic fu*fn term in the prediction formula, and the cost function measures the square of the error). While optimizing a quadratic equation over an arbitrary number of variables almost always has a unique solution that you can compute with fairly basic linear algebra, optimizing a quartic equation over many variables often has many solutions, so multiple rounds of gradient descent algorithms may yield different answers. Small changes in input can flip the descent from one local minimum to another, significantly altering the output.
This distinction between the algorithms I have been involved in developing (like quadratic funding) feels like the difference between an economist's algorithm and an engineer's algorithm to me. An economist's algorithm, at best, emphasizes simplicity, is relatively easy to analyze, and has clear mathematical properties that indicate it is optimal (or least bad) for the task at hand, ideally also proving how much damage someone could cause when trying to exploit it. On the other hand, an engineer's algorithm is derived through an iterative trial-and-error process, seeing what works and what doesn’t in the engineer's operational environment. An engineer's algorithm is pragmatic and capable of accomplishing tasks, while an economist's algorithm does not completely fall apart in the face of unexpected situations.
Or, as the respected internet philosopher roon (also known as tszzl) said on a related topic:
Of course, I would say that the "theoretical aesthetics" of cryptocurrency are necessary because they can accurately distinguish between protocols that truly do not require trust and those that look good, seem to work well on the surface, but actually require trust in some centralized participants, or worse, may be outright scams.
Deep learning is effective under normal circumstances, but it has inevitable weaknesses against various adversarial machine learning attacks. If done well, technical traps and highly abstracted ladders can counter these attacks. Therefore, I have a question: can we transform Community Notes itself into something more akin to an economist's algorithm?
To understand what this means in practice, let’s explore an algorithm I designed a few years ago for a similar purpose: Pairwise-bounded quadratic funding (a new quadratic funding design).
The goal of Pairwise-bounded quadratic funding is to fill a gap in "regular" quadratic funding, where even if two participants collude, they can contribute very high amounts to a fake project, return the funds to themselves, and receive large subsidies that deplete the entire funding pool. In Pairwise-bounded quadratic funding, we assign a limited budget M for each pair of participants. The algorithm iterates through all possible pairs of participants, and if the algorithm decides to add a subsidy to a project P because both participant A and participant B support it, that subsidy is deducted from the budget allocated to that pair (A, B). Therefore, even if k participants collude, the maximum amount they can steal from the mechanism is k(k-1)M.
This form of algorithm is not applicable in the context of Community Notes, as each user casts very few votes: on average, the common votes between any two users are zero, so simply looking at each pair of users individually does not allow the algorithm to understand the users' polarities. The goal of the machine learning model is precisely to try to "fill in" the matrix from very sparse source data that cannot be analyzed directly in this way. But the challenge of this approach is that additional effort is needed to ensure that the results remain stable in the face of a few bad votes.
Can Community Notes really resist left and right extremes?
We can analyze whether the Community Notes algorithm can actually resist extremes, meaning whether it performs better than a naive voting algorithm. Such a voting algorithm has already resisted extremes to some extent: a post with 200 likes and 100 dislikes performs worse than one with only 200 likes. But does Community Notes do better?
From an abstract algorithmic perspective, it’s hard to say. Why can’t a post with a high average rating but polarized opinions receive strong polarity and high usefulness? The idea is that if these votes are conflicting, polarity should "absorb" the characteristics that lead to that post receiving a large number of votes, but does it really do that?
To check this, I ran my own simplified implementation for 100 rounds. The average results are as follows:
In this test, "good" Notes received a +2 rating among users of the same political faction and a +0 rating among users of the opposing political faction, while "good but more extreme-leaning" Notes received a +4 rating among users of the same faction and a -2 rating among users of the opposing faction. Although the average scores are the same, the polarities differ. In fact, the average usefulness of "good" Notes seems to be higher than that of "good but more extreme-leaning" Notes.
An algorithm that is closer to an "economist's algorithm" would have a clearer story explaining how it punishes extremism.
How useful is all this in high-risk situations?
We can gain some insights by observing a specific case. About a month ago, Ian Bremmer complained that a tweet about Chinese government officials had a highly critical Community Note added to it, but that Note had been deleted.
This is a daunting task. Designing mechanisms in an Ethereum community environment is one thing, where the biggest complaint might just be $20,000 flowing to an extreme Twitter influencer. But when it involves political and geopolitical issues that affect millions of people, the situation is entirely different, and everyone tends to reasonably assume the worst motives. However, if mechanism designers want to make a significant impact on the world, interacting with these high-risk environments is essential.
In the case of Twitter, there is a clear reason to suspect that centralized manipulation led to the deletion of the Note: Elon Musk has many business interests in China, so it is possible that Elon Musk pressured the Community Notes team to intervene in the algorithm's output and delete this specific Note.
Fortunately, the algorithm is open-source and verifiable, so we can actually dig into it! Let’s do that. The URL of the original tweet is https://twitter.com/MFA_China/status/1676157337109946369. The number 1676157337109946369 at the end is the tweet ID. We can search the downloadable data for that ID and identify the specific row in the spreadsheet that has the aforementioned Note:
Here, we have the ID of the Note itself, 1676391378815709184. Then we search for that ID in the scorednotes.tsv and notestatus_history.tsv files generated by running the algorithm. We get the following results:
The second column in the first output is the current rating of that Note. The second output shows the history of that Note: its current status in the seventh column (NEEDSMORERATINGS), while its first status that was not NEEDSMORERATINGS is shown in the fifth column (CURRENTLYRATEDHELPFUL). Thus, we can see that the algorithm itself first displayed that Note and then removed it after its rating slightly declined—seemingly without centralized intervention.
We can also look at the votes themselves to view the issue from another angle. We can scan the ratings-00000.tsv file to separate all ratings for that Note and see how many rated it as HELPFUL and NOT_HELPFUL:
However, if we sort them by timestamp and look at the first 50 votes, we find 40 HELPFUL votes and 9 NOT_HELPFUL votes. Therefore, we arrive at the same conclusion: the initial audience for the Note rated it more positively, while the later audience rated it lower, causing its score to start high and decline over time.
Unfortunately, the exact circumstances of how the Note changed status are difficult to explain: it is not a simple matter of "previously rated above 0.40, now rated below 0.40, so it was deleted." Instead, a large number of NOT_HELPFUL replies triggered one of the exceptional conditions, increasing the usefulness score that the Note needed to maintain above the threshold.
This is another great learning opportunity, teaching us a lesson: making a credible neutral algorithm truly credible requires simplicity. If a Note goes from being accepted to not accepted, there should be a straightforward story explaining why that is the case.
Of course, there is a completely different way to manipulate this voting: Brigading. Someone who sees a Note they disagree with can call upon a highly engaged community (or worse, a large number of fake accounts) to rate it as NOT_HELPFUL, and it may not take many votes to change the Note from "useful" to "extreme." To properly reduce the algorithm's vulnerability to such coordinated attacks, more analysis and work are needed. One possible improvement would be to not allow any user to vote on any Note, but rather to randomly assign Notes to raters using a "for you" algorithm, allowing raters to only rate those Notes they are assigned.
Is Community Notes not "brave" enough?
I see the main criticism of Community Notes is that it does not do enough. I have seen two recent articles mentioning this. Quoting one of the articles:
The program is severely limited in that to make Community Notes public, there must be a broadly accepted consensus among people from various political factions.
"It has to have ideological consensus," he said. "This means that both leftists and rightists must agree that the annotation must be attached to that tweet."
He said that, essentially, it needs to "reach a cross-ideological agreement on the truth, which is nearly impossible to achieve in an environment of increasing partisan conflict."
This is a tricky issue, but ultimately I tend to think it is better to let ten pieces of misinformation tweets spread freely than to allow one tweet to be unjustly annotated. We have witnessed years of fact-checking that is brave and comes from the perspective of "we actually know the truth, we know one side lies more often than the other." What would the outcome be?
Honestly, there is a fairly widespread distrust of the concept of fact-checking. Here, one strategy is to say: ignore those critics, remember that fact-checking experts know the facts better than any voting system, and stick to it. But going all-in on this approach seems risky. It is valuable to establish a cross-tribal institution that is at least respected to some extent by everyone. Just like William Blackstone's maxim and the courts, I feel that to maintain this respect, there needs to be a system where the mistakes it makes are omissions rather than active errors. Therefore, for me, it seems valuable for at least one major organization to take this different path and view its rare cross-tribal respect as a precious resource.
Another reason I think Community Notes can be conservative is that I do not believe every piece of misinformation tweet, or even most misinformation tweets, should receive corrective annotations. Even if less than one percent of misinformation tweets receive background or corrective annotations, Community Notes still provides an extremely valuable service as an educational tool. The goal is not to correct everything; rather, the goal is to remind people that there are multiple viewpoints, and that certain posts that seem compelling and engaging in isolation are actually quite wrong, and yes, you can usually do a basic internet search to verify that it is wrong.
Community Notes cannot become, nor is it intended to be, a panacea for all problems in public epistemology. Whatever problems it does not solve, there is plenty of room for other mechanisms to fill the gaps, whether they are novel gadgets like prediction markets or established organizations that hire full-time staff with domain expertise to try to fill these gaps.
Conclusion
Community Notes is not only an engaging social media experiment but also an example of an emerging type of mechanism design: consciously attempting to identify extremes and favoring mechanisms that promote cross-boundary dialogue rather than perpetuating divisions.
Two other examples I know of in this category are: (i) the pairwise quadratic funding mechanism used in Gitcoin Grants, and (ii) Polis, a discussion tool that uses clustering algorithms to help communities identify universally popular statements across typically divergent viewpoints. This field of mechanism design is valuable, and I hope to see more academic work in this area.
The algorithmic transparency provided by Community Notes is not entirely a fully decentralized social media—if you disagree with how Community Notes operates, there is no way to view the same content through a different algorithm. But this is the closest result we will see from a super-scale application in the coming years, and we can see it has already provided a lot of value, both in preventing centralized manipulation and ensuring that platforms that do not engage in such manipulation receive their due recognition.
I look forward to seeing the development and growth of Community Notes and many similar-spirited algorithms over the next decade.