Filecoin indexer scalable solution designed for heavy loads

FilecoinNetwork
2023-04-07 18:03:03
Collection
The most urgent goal on the road to expansion is to handle the increasing influx of load.

Source: Filecoin Network

This article describes a simple strategy to distribute a large influx of indexing load across an index pool composed of indexer nodes. At the same time, it allows this index pool to be scalable.

The ultimate goal of indexer scaling is to reach 10\^{15} indexes. This is not the byte size of stored data, but the number of stored indexes. An index is essentially a mapping that describes the relationship between the CID identifier and the content provider's data. The actual data scale will be much larger than this. Currently, we can handle about 10\^{12} indexes, and over time, we will progress towards the final scaling goal through a series of steps.

At present, most of the indexing load consists of incoming indexing data. The new data may exceed the capacity (rate and quantity) that a single indexer can handle, and it is increasing rapidly. Therefore, the most urgent goal on the path to scaling is to manage the increasing influx load.

Solution: A Simple Strategy for Handling Index Influx

Data Influx

Data influx occurs when an indexer receives an "announce" message from a publisher, announcing that there is new indexing data available. In response, the indexer retrieves all the indexing data from the publisher that has not yet been fetched. As the number of publishers increases, at some point, a single indexer node will not be able to keep up with the rate of new indexing data being published and may not have enough storage space to store all this data.

Distributing Influx Load

The strategy for scaling indexers to handle congestion load is based on a simple principle—distributing the incoming indexing load across an index pool composed of indexer nodes. This allows for the addition of nodes based on capacity needs without having to move data around to rebalance. It first assigns different content publishers to different index nodes, so that each node can handle a portion of the incoming load. This is achieved through a separate lightweight service called the Assigner Service, which is not part of the critical indexing influx path.

When an indexer reaches its configured storage limit, it will stop accepting new indexing data, while other indexers in the pool will resume accepting data from the publishers assigned to the full indexer. If storage capacity and influx load distribution needs increase, more indexer nodes will be added to the pool.

The three main components of this scaling strategy are:

  • Assigner Service: It assigns publishers to indexers.

  • Indexer Frozen Mode: In this operational mode, new content will not be indexed.

  • Handoff of Publisher Tasks: Reassigning the publisher tasks of frozen indexers to active indexers to resume indexing after the frozen indexer stops operating.

This article will summarize these components. More information can be found in the design document and design presentation.

Pros and Cons of the Scaling Strategy

Advantages:

  • Less synchronization work: No need for every indexer to synchronize with every publisher.

  • Metadata is not redundantly sent to multiple indexers (similar to key sharding): Metadata only exists on the indexer processing the provider.

  • Indexers do not share data. They each manage their own publisher chains.

  • No need to read advertisements just to check providers, similar to provider sharding.

  • Indexers can have different storage capacities.

  • No consensus mechanism is required.

  • Influx load can be redistributed without moving data between indexers.

Disadvantages:

  • Uneven distribution: Some publishers may index more data than others.

  • Query requests need to be distributed and merged: Query requests are sent to all indexers, and responses are merged into one sent to the client.

  • Changes in providers can lead to duplicate indexing (unlike provider sharding).

  • Adding indexers will not take effect immediately unless an existing indexer reaches its storage capacity limit.

The overall benefit of this solution is that its implementation is relatively simple and can remove the constraints of congested scaling.

Assigner Service

The Assigner Service (AS) is responsible for assigning publishers to indexers in its configured indexer pool. For an indexer pool, it runs as a single instance on the same network where its managed indexers are located. An indexer can only be a member of one assigner service's indexer pool.

In addition to assigning new publishers to indexers, the Assigner Service also detects whether indexer nodes have entered frozen mode and is responsible for reassigning publishers from frozen indexers to non-frozen indexers. The index service also republishes direct HTTP announcements through gossip pubsub channels, so that all indexers in the pool can receive this information.

Based on several assumptions, the Assigner Service is intended for use in a single private deployment: tasks can be sent to any indexer, all indexers' management APIs run on a private network (or similar protected network), and there is no established method or protocol for different parties to manage the nodes to be added or removed from the pool.

image

Assigning a Publisher to an Indexer

When an indexer receives an "announce" message from a publisher announcing that there is new indexing data available, the Assigner Service listens for gossip-sub and direct HTTP messages—these messages primarily announce that new advertisements are available. It reads publisher information from each message and determines whether the publisher has already been assigned to the required indexer. If the answer is no, the Assigner Service will choose the indexer with the least workload and assign the publisher to that indexer. After the task is assigned, the indexer will receive announcements from the publisher and handle the incoming data on its own.

The index service handles offline indexers in a way that avoids over-assigning tasks within the indexer pool. The index service also supports configuration options for assigning specific publishers to specific indexers.

Further Reading:

  • No Persisted Assignment State (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#no-persisted-assignment-state) means that indexers can stop or restart at any time.

  • An Indexer Pool (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#indexer-pool) is a collection of indexer nodes in a single deployment.

  • Assignment Replication (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#replication) assigns publishers to multiple indexers.

Indexer Frozen Mode

An indexer automatically enters "frozen" mode after reaching the storage limit defined by the configuration FreezeAtPercent(\<\`https://pkg.go.dev/github.com/ipni/storetheindex/config#Indexer`(https://pkg.go.dev/github.com/ipni/storetheindex/config#Indexer "https://pkg.go.dev/github.com/ipni/storetheindex/config#Indexer")`>)``. In this operational mode, the indexer no longer stores new indexing data but continues to process updates or deletions of indexing data. A frozen indexer will not accept new publisher tasks. Internally, the indexer tracks every advertisement chain it has read, with the aim of ingesting advertisements (related to updates and removal tasks). The indexer will continue to respond to queries for indexing data.

An indexer can also be manually frozen through its admin API. This can be done to temporarily freeze data ingestion until the indexer's storage capacity is increased (or using the Assigner Service). In this way, ongoing indexing work can be taken over by other indexer nodes.

Further Reading:

  • Disk Usage Monitoring (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#disk-usage-monitoring) is the responsibility of each indexer.

  • Freeze capability does not depend on AS (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#freeze-independent-of-assigner).

  • The ability to unfreeze (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#unfreeze) allows the indexer to resume indexing work.

Publisher Handoff

The Assigner Service periodically checks the indexers, and if it finds that an indexer is frozen, it will reassign the publishers assigned to that frozen indexer to other indexers—active indexers will continue the work previously done on the frozen indexer. During the handoff process, active indexers will also obtain provider and related further information from the frozen indexer.

The Assigner Service decides which indexer will receive the handoff of publishers—this follows the same logic as assigning new publishers. The handoff process for each publisher will be conducted individually, so that the tasks of the frozen indexer are assigned to available indexers in the pool.

Further Reading:

  • The Assigner Service can resume incomplete handoff tasks (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#resuming-incomplete-handoff).

  • Publisher data is distributed between frozen and active indexers (https://github.com/ipni/storetheindex/blob/main/doc/scaling-design-for-indest.md#publisher-data-spread-across-frozen-and-active-indexers).

Setting Up an Indexer Pool with Assigner Service

The process of setting up an indexer pool with an Assigner Service is described here (https://github.com/ipni/storetheindex/blob/main/doc/assigner-deployment.md#setting-up-indexer-pool-with-assigner-service) and can be summarized in the following steps:

  • Deploy Indexers (https://github.com/ipni/storetheindex/blob/main/doc/assigner-deployment.md#deploy-indexers)

  • Deploy Assigner Service (https://github.com/ipni/storetheindex/blob/main/doc/assigner-deployment.md#deploy-assigner-service)

  • Deploy additional Indexers as needed (https://github.com/ipni/storetheindex/blob/main/doc/assigner-deployment.md#example-assigner-service-configuration)

A configuration template file for the Assigner Service is also provided (https://github.com/ipni/storetheindex/blob/main/doc/assigner-deployment.md#example-assigner-service-configuration).

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators