Decentralized Knowledge Graph Collaborative Platform Construction Practice
This article is sourced from the EpiK Knowledge Protocol.
On January 10, the "2021 Open Source Knowledge Movement" themed event hosted by the EpiK Knowledge Protocol brought a feast of wisdom in open and interconnected knowledge graphs to the industry. The event attracted heavyweight guests, including Xing Chunxiao, Vice Dean of the Institute of Information Technology at Tsinghua University, Wang Haofen, Chairman of the Knowledge Graph SIG of the China Computer Federation / renowned knowledge graph expert / main initiator of OpenKG, and Wang Huizhen, Deputy Director of the Natural Language Processing Laboratory at Northeastern University / founder of Xiao Niu Si Tuo.
At this conference, EpiK's concept and practice of building a co-constructed, shared, and beneficial open knowledge base through a decentralized collaboration model based on blockchain became a core highlight and received high praise from many experts and scholars.
The following article will comprehensively analyze the EpiK Open Source Knowledge Movement from the following aspects:
Why build a decentralized knowledge graph collaboration platform
Challenges faced by the open source knowledge movement
EpiK Knowledge Protocol solution
Who can participate in this open source knowledge movement
1. Why build a decentralized knowledge graph collaboration platform
Currently, the era of artificial intelligence has entered its second half. We are no longer satisfied with unexplainable model simulations; empowering AI with cognitive abilities is a bottleneck that must be overcome. On the road to broadening AI cognition, knowledge graphs, as an important medium for machines to understand human knowledge, are becoming a crucial infrastructure in the era of artificial intelligence.
However, the construction of large-scale knowledge graph infrastructure involves massive knowledge content from various fields and requires high data quality. Therefore, it is necessary to organize a large workforce from different fields to invest in the construction together. However, the trust cost of co-constructing knowledge graphs is extremely high, and the mutual distrust between enterprises and countries leads to a lot of redundant labor. The demand for building a knowledge graph co-construction platform has emerged, and how contributors can share benefits on this platform is a problem that must be solved.
The year 2020 was a year when blockchain decentralized storage technology matured, making it possible to build a permissionless, tamper-proof, and traceable public database. The collaborative platform for co-construction, sharing, and benefiting from knowledge graphs now has a practical foundation.
2. Challenges faced by the open source knowledge movement
The price of Bitcoin continues to hit historical highs, and emerging blockchain applications such as DeFi, IPFS, and DAO are emerging one after another, revealing more possibilities for blockchain. However, building a collaborative platform for co-construction, sharing, and benefiting from knowledge graphs based on blockchain is not an easy task, facing a series of challenges:
First, how to achieve co-construction? Organizing people from various knowledge fields to participate together in building a high-quality large-scale knowledge graph requires effective incentive mechanisms and strict data quality acceptance mechanisms; second, how to achieve sharing? Sharing knowledge graph data faces the problem of trustworthy storage, and creating a tamper-proof public storage platform that all contributors can access without permission is a necessary path; third, how to achieve mutual benefit? Knowledge graph data can be copied and disseminated at zero cost, and finding efficient monetization methods for contributors is the driving force for sustained collaboration.
Based on this, EpiK proposes a complete solution leveraging three cutting-edge blockchain technology branches: "decentralized storage, decentralized autonomous organizations, and token economic models."
3. EpiK Knowledge Protocol solution
In response to the pain points of decentralized knowledge graph construction, EpiK deeply analyzes the application of blockchain technology and outlines a technical architecture for decentralized knowledge graph construction based on blockchain's underlying logic.
The core part is the knowledge storage section, where we introduce three important components:
Storage, providing shared and trustworthy storage, where data cannot be arbitrarily tampered with, and access cannot be denied;
Incentive, providing incentives for various contributor roles within the ecosystem, ensuring that all parties can maximize their own interests while collaboratively building a high-quality knowledge graph;
DAO, allowing the community to participate in the governance of system parameters and dynamically adjust according to different development stages.
1. Storage
EpiK's Storage component is built based on the IPFS protocol. IPFS is a distributed network transmission protocol that connects the accessed computing devices into a single file system. Files submitted to the IPFS network are split into multiple parts, each with an independent Hash value. Using the Merkle Trie data structure, the split data blocks are organized and connected to a single root node, generating a unique File Root Hash, which is the file's Hash value. The roots of multiple files are also organized into a larger Merkle Trie structure, forming a unique Root Hash.
This structure has the advantage that duplicate data blocks are not stored redundantly, and nodes only need to synchronize the Root Hash to maintain a consistent global view of the files. Each node can freely choose which data blocks to store and inform other nodes about the data blocks they have stored. Each node will record the storage status of other nodes in the DHT, making it easy to quickly identify which nodes have the corresponding data when access requests are received.
IPFS successfully connects honest and selfless nodes, providing a unified file system operation interface. However, IPFS also has practical issues: it lacks incentive mechanisms and anti-cheating mechanisms, and nodes may act maliciously or go offline at any time, making it unreliable to build storage solely based on IPFS.
We will introduce the incentive measures in section 3-2, but here we briefly introduce potential cheating methods by nodes. For example, to ensure high availability, a file will be stored in multiple places across the network. If two miners broadcast to the network that they have stored the same file to claim two storage rewards from the entire system, but these two miners may actually share the same physical storage, resulting in only one actual copy of the file, the system should only pay one storage reward. This is a common witch attack in distributed systems.
To prevent witch attacks, EpiK integrates two verification methods proposed by FileCoin in the Storage component: Proof-of-Replication (PoRep) and Proof-of-Spacetime (PoSt).
The role of Proof-of-Replication is to prove that the node has indeed stored a complete new copy of the original data locally as required; the role of Proof-of-Spacetime is to prove that the node continues to store a complete new copy of the original data locally.
The principle of Proof-of-Replication is to use the globally unique ID of the current node as a seed, then seal the source file using a computation-intensive encryption algorithm, and broadcast a zero-knowledge proof of the sealed data. Although the sealing process is complex, other nodes can easily verify the correctness of the sealing process.
The principle of Proof-of-Spacetime is that the node needs to periodically broadcast a random zero-knowledge proof of the stored file. If this proof is generated from a non-sealed source file, it will be extremely time-consuming, which may cause the node to fail to complete the timely broadcast of the proof. If other nodes do not receive the broadcast of the node's Proof-of-Spacetime in time, they will consider that the node has lost this file. Therefore, to ensure the timeliness of Proof-of-Spacetime, nodes cannot discard already sealed file data.
With the storage system and verification mechanism in place, we also need to ensure that all nodes maintain data consistency, which requires all nodes to keep consistent records of what files exist and in what order they are broadcasted to the entire network.
This introduces blockchain ledger technology, where the creation of all new files, their creation order, the behavior of nodes storing files, and the behavior of nodes submitting storage proofs are all recorded in a globally consensus blockchain ledger. Each node will synchronize the complete ledger to obtain a consistent data view with the entire network. With the file content and file order established, EpiK can store knowledge graph database operation log files in the Storage component. After each node synchronizes these log files in order, they can locally restore a complete knowledge graph database that is consistent across the entire network.
Currently, there are over 9,000 nodes registered in the EpiK network, with over 5,000 nodes successfully providing storage. In the current EpiK setup, each file will be stored in 3,000 copies across the network. If there are fewer than 3,000 copies, newly stored nodes can receive additional incentives, making it extremely difficult for hackers to DDoS the entire EpiK knowledge graph database.
Moreover, with the entire network synchronizing the same ledger information, hackers would need to control over 51% of the nodes in the entire network to tamper with the ledger, resulting in extremely high attack costs.
2. Incentives
EpiK categorizes knowledge graph contributors into three types: data miners, domain experts, and bounty hunters, along with a user role known as the data gateway. Every day, the EpiK network produces a fixed number of point rewards. How to reasonably distribute these points among these three roles to incentivize them to contribute to the public knowledge graph database, as well as how to design a reasonable mechanism to reclaim points, is defined in the Incentives component.
Data miners are providers of physical devices who earn rewards by providing storage and bandwidth resources, with 75% of the daily point output belonging to the data miner group.
The more data stored, the higher the earnings; the larger the data download traffic provided, the more the earnings. Meanwhile, to prevent data miners from going offline at will, leading to reduced data backups and decreased system security, all data miners must pledge a portion of their points to become data miners. Point rewards will be automatically distributed through blockchain contracts without the need for any intermediaries.
Domain experts are contributors and verifiers of knowledge graph data and are the only group in the entire system authorized to upload knowledge graph data. They earn rewards by contributing high-quality knowledge graph data. 9% of the daily point output belongs to the domain expert group, with more contributions leading to higher earnings. However, to accommodate the differences in data scale across different fields, the size of data contributed by different domain experts will be proportionally rewarded after taking the logarithm.
Of course, as the only group with the right to upload data in the entire system, domain experts are subject to strict supervision mechanisms. First, domain experts must be nominated by existing domain experts, and the nominated domain experts must receive support from 100,000 votes in the community, with each vote representing one point being locked.
Once the number of votes (locked points) for a domain expert falls below 100,000, they lose their qualification. If a domain expert uploads false or junk data, the community will impose a disqualification penalty, and the person who nominated the disqualified domain expert will also face joint punishment. To encourage voting, 1% of the daily point output belongs to all users who participate in voting, with more votes leading to higher earnings.
Before introducing bounty hunters, let's first explain the data gateway. The data gateway is the only way for users to access the latest first-hand knowledge graph data. Data gateways need to pledge points to obtain data access traffic; for example, pledging 1 point can yield 10MB of data access traffic per day. Therefore, the greater the demand for knowledge graph data on EpiK, the more points data gateways will pledge, increasing the demand for points and enhancing the value of the points held by contributors.
With the concept of data gateways pledging points established, we can now discuss bounty hunters. Bounty hunters are the annotators and verifiers of knowledge graph data, earning rewards by completing tasks issued by domain experts.
The earnings of bounty hunters dynamically change based on the amount of points pledged by data gateways. If data gateways pledge more points, it indicates that the quality of knowledge graph data on EpiK is good, and we will incentivize data miners to enhance bandwidth, making data access services smoother. Thus, 15% of the remaining daily point output will be allocated more to the data miner group; however, if data gateways pledge fewer points, it indicates that the quality of knowledge graph data on EpiK needs improvement, and we will allocate more of the remaining 15% to bounty hunters, allowing more individuals to participate in improving data quality.
In the entire ecosystem, each role maximizes its own interests through the incentive model. Data miners should provide more storage and urge domain experts to optimize the quality of knowledge graph data to earn more rewards; domain experts continuously provide updated, higher-quality data to gain higher rewards through contributions; bounty hunters complete more tasks to earn more rewards, with the invisible hand driving all parties to co-build the knowledge graph.
3. Decentralized Community Governance
A self-driving car is cruising around, looking for passengers. After a passenger gets off, the car uses its profits to charge at a charging station, deciding how to execute its tasks without external help beyond its initial programming. This is an ideal use case of a decentralized organization or DAO, as described by Bitcoin core protocol developer Mike Hearn, where organizations can operate without hierarchical management relying on smart contracts.
DAO is an important extension in the development of blockchain, and the EpiK Knowledge Protocol draws on this organizational form and applies it to the construction of decentralized knowledge graphs.
EpiK has multiple DAOs, including the EpiK DAO, which governs global parameters such as modifying the profit-sharing ratios among groups; the Experts DAO, which governs internal parameters among domain experts, such as modifying the point distribution algorithm among domain experts; and the Miners DAO, which governs internal parameters among miners, such as modifying the number of backups for each file.
Roles at various levels within the DAO realize their functions in the organization through smart contracts, thereby endowing the construction of knowledge graphs with an automated process system, greatly enhancing its professionalism and efficiency. Once the DAO operates, it will liberate tremendous productive forces for the construction of a global super knowledge graph.
Relying on the three-pronged approach, EpiK's knowledge graph + blockchain model bursts forth with unprecedented vitality, establishing an open-source knowledge co-construction and sharing platform for mutual benefit.
4. Who can participate in this open source knowledge movement
The EpiK open source knowledge movement has allowed more people to see the important value of knowledge graphs for AI in the future, while also encouraging more individuals to join the EpiK co-construction and sharing initiative. In fact, EpiK is a foundational data platform where people of different identities can participate in its construction. So, who can get involved?
First, senior practitioners from various industries can sign up to become domain experts in their respective fields. One of their responsibilities is to ensure data accuracy while reasonably breaking down and assigning knowledge graph data annotation tasks to the platform, allowing users to participate in jointly maintaining the knowledge graph of these fields.
Second, EpiK introduces the role of bounty hunters to assist domain experts in completing specific tasks. EpiK bounty hunters only need to answer simple multiple-choice questions, such as Yes or No, with each answer contributing to the gradual improvement of a knowledge graph. After completing tasks, bounty hunters will receive rewards allocated by domain experts based on their labor. According to current estimates, this is no less than an hourly wage of 36 yuan. EpiK hopes to mobilize more people to participate part-time using fragmented time, while also promoting new employment opportunities in third- and fourth-tier cities.
Next, individuals can choose to become miners, simply providing the necessary storage space to become data miners. While earning rich rewards, they are also contributing to the eternal knowledge base of humanity.
Additionally, there is data monetization, which involves two aspects: on one hand, data gateways can earn compensation and profits by providing useful access services for knowledge aggregation as the amount of on-chain data increases; on the other hand, they can connect with application parties, helping enterprises save on the high costs of building databases.
5. In conclusion
This article explains the triple construction logic of the EpiK decentralized knowledge graph open collaboration platform. Based on this, the EpiK knowledge graph library will become an important cornerstone for the future development of artificial intelligence, providing crucial data support for the implementation of future intelligent applications and promoting the continuous upgrading of data value.
The EpiK open source knowledge movement is initiating an epic evangelism from carbon-based life to silicon-based life over the next 50 years, with a path to the future of AI shining brightly.