e-Preprint archive nrpl.xyz

[Proposal] [RFC] Reticulum based Inter-instance Signaling for Distributed IPFS Archive #4

open opened by 0xlet0.dmrg.yokohama edited

Summary#

Replace Jetstream polling between instances with a Reticulum mesh overlay that carries only minimal pin/unpin signals. Each instance independently manages its own IPFS pins and ATProto records. Reticulum transmits nothing but a DID, two CIDs, and a boolean. IPFS handles content storage and deduplication. ATProto handles local record publishing.

Background#

The current architecture relies on each subscriber instance continuously polling ATProto Jetstream to detect new posts from known publisher instances. This creates unnecessary traffic, adds latency proportional to the polling interval, and requires each instance to maintain awareness of all remote cursors at all times.

The community also spans intercontinental nodes, some of which operate over LoRa radio links where bandwidth is severely constrained (~5 kbps on a 125 kHz channel). A signaling layer that fits within a single LoRa packet and requires no inter-instance coordination beyond the signal itself is strongly preferred.

Proposed Architecture#

Layer Responsibilities#

Content storage and deduplication:  IPFS
Local record publishing:            ATProto PDS  (each instance, own account only)
Inter-instance signaling:           Reticulum    (pin/unpin signal only)
Access control:                     DID whitelist stored as ATProto record
Fallback sync:                      ATProto list_records  (catch-up after downtime)

Each layer does exactly one thing. Reticulum carries no content and no metadata. IPFS deduplicates automatically when multiple nodes pin the same CID. ATProto is never used for inter-instance communication.

Signal Payload#

The entire inter-instance message is:

{
  "did":  "did:plc:xxxxxxxxxxxxxxxxxxxx",
  "cid1": "baf...",
  "cid2": "baf...",
  "pin":  true
}

Approximately 150 bytes. Fits in a single LoRa packet at SF7/125 kHz. Setting pin: false signals that the content should be unpinned. This covers archive withdrawal and deletion through the same channel at no added complexity.

Message Flow#

Publisher instance (authenticated via OAuth)
  1. Upload binary asset to IPFS → obtain CID pair
  2. Announce { did, cid1, cid2, pin: true } via Reticulum
  3. Create net.dmrg.archive.item record on own PDS
     { cid1, cid2, title, description, createdAt }

Subscriber instance
  On Reticulum announce received:
    4. Verify did is in whitelist; discard if not
    5. Pin cid1 and cid2 via local IPFS node
       (IPFS deduplicates if already pinned by another peer)
    6. Fetch net.dmrg.archive.item record from publisher PDS via ATProto
    7. Create mirror record on own PDS

  On unpin signal received (pin: false):
    8. Unpin cid1 and cid2 from local IPFS node
    9. Remove or tombstone mirror record on own PDS

  On restart / reconnect (fallback):
    10. Call list_records on known publisher DIDs via ATProto
    11. Pin any CIDs absent from local IPFS pin set

Why This Division#

Reticulum is the only channel between instances. The signal it carries is intentionally minimal: it tells each subscriber what to pin or unpin, nothing more. All content retrieval is handled by IPFS's native peer-to-peer protocol once the CID is known. Metadata retrieval for mirroring is a local ATProto operation from each subscriber to the publisher's PDS; not inter-instance coordination, but a standard client fetch.

The unpin signal gives the architecture a natural deletion mechanism. When a publisher signals pin: false, subscribers unpin locally. Because IPFS content persistence depends on who is pinning, a CID that no node pins will eventually be garbage-collected across the network. No tombstone propagation protocol is needed.

Why Records Are Not Visible on Bsky#

Records posted under net.dmrg.archive.* collections are outside the app.bsky.* namespace. Bsky clients do not render them. They exist on the ATProto network and are accessible via the AT Protocol API, but are not surfaced in any Bsky feed or timeline. This is consistent with the behavior of other non-Bsky ATProto applications such as Leaflet and PCKT.

OAuth and DID Whitelist#

Why OAuth#

Instances act on behalf of real user accounts rather than operating a single bot account. This allows records to carry genuine authorship, enables per-user key rotation, and avoids coupling the instance to a single long-lived credential.

Authentication Flow#

User browser
  1. Redirected to own PDS authorization endpoint
  2. User approves scope
  3. Instance receives authorization code
  4. Instance exchanges code for access token
  5. DID extracted from token → checked against whitelist
  6. If not whitelisted: token immediately revoked, access denied
  7. If whitelisted: session stored, user may publish

Whitelist Storage#

The whitelist is stored as a net.dmrg.auth.allowlist record on the admin instance's PDS. It can be managed from any ATProto client without a separate admin UI, and changes take effect on the next fetch.

async def load_whitelist(admin_did: str, client) -> set[str]:
    records = await client.com.atproto.repo.list_records({
        "repo": admin_did,
        "collection": "net.dmrg.auth.allowlist"
    })
    return {r.value["did"] for r in records.records}

async def oauth_callback(code: str, state: str, client):
    session = await client.auth.oauth.callback(code, state)
    did = session.did

    whitelist = await load_whitelist(ADMIN_DID, client)

    if did not in whitelist:
        await client.auth.oauth.revoke(session.access_token)
        raise PermissionError(f"DID not in whitelist: {did}")

    return session

If the whitelist record is unreachable at login time, access is denied and no session is created (fail closed).

Optional DID Document Verification#

For additional assurance, the instance may resolve the DID document at login time to verify that the user's PDS endpoint falls within a set of trusted domains. This is operator-configurable and disabled by default.

Whitelist Cache TTL#

The whitelist is fetched from the PDS on each OAuth callback. A cached copy may be used within a configurable TTL to reduce PDS load. Suggested default: 5 minutes.

Reticulum Transport Details#

Node Discovery#

Instances discover each other via Reticulum announce(). Each instance periodically broadcasts its destination hash derived from a stable on-disk identity. Other instances register an announce handler and maintain a known_peers map.

Instance A starts → destination.announce() → propagates through mesh
Instance B receives announce → stores A's destination hash
Instance B starts → destination.announce() → propagates through mesh
Instance A receives announce → stores B's destination hash

No central registry or bootstrap server is required. Path propagation is handled by Reticulum transport nodes in the mesh.

Direct LoRa links across ~200 km dist. are not feasible. The recommended topology uses Yggdrasil as the overlay network between gateway nodes. Yggdrasil assigns each node a stable IPv6 address derived from its public key, eliminating the need for DNS or dynamic IP management. Reticulum TCPInterface peers are configured using these Yggdrasil IPv6 addresses directly.

[KR local LoRa nodes] ──RNode── [KR gateway]
                                      │
                               Yggdrasil overlay
                               (encrypted, self-routing)
                                      │
[JP local LoRa nodes] ──RNode── [JP gateway]

# Expansion
[KR LoRa] ──── [KR gateway]
                     │    \
              Yggdrasil    \
                     │      \
[JP LoRa] ──── [JP gateway]  \
                               \
[TW LoRa] ──── [TW gateway] ───┘

From Reticulum's perspective, the entire mesh is one address space.

Transport Node Configuration#

Gateway instances that relay traffic for other nodes must set:

[reticulum]
  enable_transport = True

End-user or single-purpose archive instances may leave this at the default False.

LoRa Frequency Compliance#

Country Band Max TX Power
Korea 920–923 MHz 10–23 dBm (channel-dependent)
Japan 920–928 MHz 13–20 dBm

Each gateway operates its LoRa interface within its own national regulations. Cross-border signaling passes through the Yggdrasil/TCP layer only.

IPFS#

IPFS is the content layer. Binary assets (images, video, audio) are uploaded by the publisher and referenced by CID in the Reticulum signal. Subscribers pin the CIDs they receive; IPFS deduplicates automatically when multiple nodes pin the same content.

Text metadata (title, description) is never stored in IPFS. It lives in ATProto records only. IPFS infrastructure (daemon, pinning service, gateway) is unchanged by this proposal.

The unpin mechanism means content that no node wishes to preserve will naturally expire from the network. Operators are responsible for deciding their own pin retention policies.

Ownership Verification for Unpin Signals#

A whitelisted DID can issue unpin signals for any CID, creating a risk that a malicious or compromised account could cause subscribers to unpin content it does not own. Two complementary checks are applied on every unpin signal received.

First check, local cache: When a pin signal is processed, the instance records a cid → did mapping in a local SQLite database. On receiving an unpin signal, the instance looks up the CID in this cache. If the cached owner does not match the signaling DID, the signal is silently discarded.

Second check, ATProto verification: If the CID is absent from the local cache (e.g. after a database reset or on a newly joined instance), the instance fetches the publisher's net.dmrg.archive.item records from their PDS and confirms that the CID appears in a record authored by the signaling DID. If no such record exists, the signal is discarded.

async def on_unpin_signal(did: str, cid1: str, cid2: str):
    for cid in [cid1, cid2]:
        cached_owner = db.get(f"owner:{cid}")

        if cached_owner and cached_owner != did:
            return  # cache says different owner; discard

        if not cached_owner:
            # cache miss: verify against ATProto
            if not await verify_ownership_via_atproto(did, cid):
                return  # not found on PDS; discard

    ipfs.unpin(cid1)
    ipfs.unpin(cid2)
    db.delete(f"owner:{cid1}", f"owner:{cid2}")

async def verify_ownership_via_atproto(did: str, cid: str) -> bool:
    records = await atp_client.com.atproto.repo.list_records({
        "repo": did,
        "collection": "net.dmrg.archive.item"
    })
    return any(
        r.value.get("cid1") == cid or r.value.get("cid2") == cid
        for r in records.records
    )

The local cache handles the common case with no network overhead. The ATProto fallback handles cache misses without requiring a dedicated coordination service. No additional infrastructure beyond a local SQLite file is needed.

Failure Modes#

Scenario Behavior
Instance offline during publish Misses signal; recovers via list_records + re-pin on next startup
Reticulum path unavailable Signal not delivered; fallback polling covers the gap
IPFS node unreachable Pin fails; retried on next signal or fallback sync
Duplicate signal received IPFS pin is idempotent; no ill effect
DID not in whitelist Signal discarded; no pin performed
Whitelist record unavailable at login Login rejected; fail closed

Open Questions#

  • Polling interval for fallback sync. Suggested default: 30 minutes for nodes expected to go offline frequently (mobile, LoRa-only); 6 hours for always-on server instances.
  • Reticulum announce interval. Explicit interval should be configurable per instance; Reticulum re-announces periodically by default.
  • Lexicon schema finalization. net.dmrg.archive.item and net.dmrg.auth.allowlist fields and validators are not defined in this proposal and should be tracked separately.
  • Multi-CID records. The current schema has exactly two CID fields. Whether this should be a variable-length array is left for the lexicon proposal.
  • Whitelist TTL. Operator-configurable; 5 minutes suggested as default.
  • Pin retention policy. Whether subscribers are expected to pin indefinitely or may unpin at their discretion is a community norm question, not a protocol question.

Implementation Steps#

  1. Define and publish net.dmrg.archive.item and net.dmrg.auth.allowlist lexicons
  2. Implement ATProto OAuth flow with DID whitelist check on callback
  3. Implement whitelist fetch and cache from net.dmrg.auth.allowlist record
  4. Implement Reticulum announce sender (pin/unpin signal)
  5. Implement Reticulum announce handler: whitelist check → IPFS pin/unpin → ATProto mirror
  6. Implement fallback list_records sync + re-pin on startup
  7. Attach RNode LoRa interfaces to each gateway
  8. Integration test: publish on the instance, verify pin on the other instance

References#

I agree that changing the fetching method from polling to event driven is necessary and it's a great idea. However, I'm not sure lora is right fit for that job, because I think receiving events from from other country cannot handled by lora because of the limited distance (To solve the distance issue from lora, I think ethernet connection is still required). So I think fetching event needs to be worked on the ethernet, and we should consider where lora can be fit within nomos.

And I'm also little confused about how pin/unpin works, how does unpin can work as deleting content?

Otherwise, the proposal looks good for me!

Sorry about my late response! It was a tough week for me ;) You're right, IP backbone is necessary when connecting two distant nodes without sufficient LoRA repeaters en route. As long as I know, reticulum supports that as well.

Regarding the effect of pin/unpin, I have to say that unpin can NOT work as a promising deletion method. However, propagating the message of withdrawal means that the publisher is 'requesting' other nodes to stop hosting the content on the IPFS network, so that's something closest to the content deletion.

Thanks for the thoughtful explanation! I need to read the recticulum manual and better understand how recticulum works over both IP and LoRa. To be honest, LoRa still feels like an extra to me, but I think it's pretty interesting design, so I agree with that proposal.

And thanks to clarifying about the pin/unpin, now I totally get it how it works!

Yes, LoRA is not necessary at all. Reticulum itself is just a networking stack that can operate on various low-spec hardware, so TCP/IP should be its primary backbone in this project.

By the way, if we can maintain a list of community transport nodes (like debian’s apt mirror list), one can use their closest/fastest node and find other instances on the network by their hash.

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 2
AT URI
at://did:plc:2jsk5qbrx3f6od3755x6cmjl/sh.tangled.repo.issue/3mf4lyoapeo22