07 — Discovery Layer

The federated indexer-worker — the only central component in BKA, and the most architecturally distinct piece. This doc specifies its responsibilities, what it stores, what it explicitly never stores, and how BabyAI routes discovery queries.


What the indexer is

A single Cloudflare Worker (~300-500 LOC) with two concerns:

  1. Federation — pulls registered creators' /feed.json on a cadence; maintains a metadata index
  2. Discovery API — answers "what should I see next" queries by routing through BabyAI specialist LoRAs

That's it. No content host. No proxy. No comment service. No subscriber list. The smallest possible central thing.


What the indexer holds

A KV / D1 store with these shapes (D1 for relational queries, KV for the cached feed snapshots):

creators table (D1) — one row per registered creator:

CREATE TABLE creators (
  google_sub      TEXT PRIMARY KEY,    -- Google's stable user id
  handle          TEXT NOT NULL UNIQUE,-- the creator's BKA handle
  feed_url        TEXT NOT NULL,       -- absolute https URL to /feed.json
  site_url        TEXT NOT NULL,       -- absolute https URL to / (display)
  registered_at   INTEGER NOT NULL,
  last_pulled_at  INTEGER,
  last_pull_status TEXT,                -- 'ok' | 'unreachable' | 'malformed' | 'rate_limited'
  paused          INTEGER NOT NULL DEFAULT 0,  -- creator-toggleable
  -- soft delete: if unregistered, mark with deleted_at and stop pulling
  deleted_at      INTEGER
);

items table (D1) — one row per discovered post per creator:

CREATE TABLE items (
  creator_sub      TEXT NOT NULL,
  slug             TEXT NOT NULL,
  title            TEXT NOT NULL,
  summary          TEXT,
  tags             TEXT,                -- comma-separated for simple search
  published_at     INTEGER NOT NULL,
  url              TEXT NOT NULL,       -- absolute URL to the post on the creator's site
  thumbnail_url    TEXT,                -- absolute URL to thumbnail (if any)
  media_kinds      TEXT,                -- comma-separated: 'video','audio','image','text'
  first_seen_at    INTEGER NOT NULL,    -- when the indexer first saw this item
  PRIMARY KEY (creator_sub, slug)
);
CREATE INDEX idx_items_pub ON items(published_at DESC);
CREATE INDEX idx_items_tags ON items(tags);

KV: feed-snapshot:<sub> — the raw last-fetched feed.json per creator (for cache/diff). TTL 7 days.

KV: pull-queue — priority queue for the crawl scheduler.

That's all. Notably absent:

If the creator's site goes offline, the indexer has URLs to nothing. Visitors hitting discovery during that window get items pointing at unreachable hosts — that's the creator's problem, not ours to fix by caching.


Crawl scheduler

Runs as a Cloudflare cron trigger every 10 minutes:

  1. Select up to 100 creators from creators where paused = 0 AND deleted_at IS NULL AND (last_pulled_at IS NULL OR last_pulled_at < now() - cadence)
  2. For each:
    • GET <feed_url> with a short timeout (10s)
    • Parse as the canonical feed schema (10_FEED_SCHEMA.md)
    • Diff against last snapshot: identify new items, removed items, edited items
    • Insert new items, soft-delete removed items, update edited items
    • Cache the new snapshot in KV
  3. Update last_pulled_at + last_pull_status

Cadence: hourly default per creator. Priority-pulled creators (recent publish, recent registration, high view count) move to a 10-minute cadence for ~6 hours, then back to hourly.

Backoff: 3 consecutive failures → cadence drops to 6h, then 24h, then 7d. Recovered creators reset to hourly. Permanently unreachable creators (30d+) get marked paused and we email them (if we have an email — see open question on creator contact).

Rate limiting: never more than 1 req/sec to any single creator origin, regardless of cadence. If we're priority-pulling 100 different creators that's fine (different origins); if we have multiple feeds on the same origin (unlikely for BKA but possible) they share the rate limit.


Discovery API

POST /api/register

Creator registers their feed for discovery. Authenticated with Google JWT (same shape as the community worker).

Request:

{
  "googleIdToken": "<jwt>",
  "handle": "chris",
  "feedUrl": "https://chris-bka-public.pages.dev/feed.json",
  "siteUrl": "https://chris-bka-public.pages.dev"
}

Worker validates JWT, fetches the feed once to confirm it's reachable + valid schema, then inserts/updates the creators row keyed on Google sub. Returns success or the validation error.

POST /api/unregister

Soft-deletes the creator's registration. Stop pulling. Existing items in the index get cleared after a 7-day grace period (gives creators a way to re-register without losing discovery momentum from a brief absence).

Request: { "googleIdToken": "<jwt>" } — sub-keyed.

POST /api/refresh

Priority-pull hint. Creator-side BKA pings this after a publish. Validates JWT (must match a registered creator). Moves the creator to the front of the next crawl batch.

Request: { "googleIdToken": "<jwt>" }

GET /api/discover

Unauthenticated. The actual "what should I see next" route.

Query params:

Worker:

  1. Pulls a candidate set from the items table matching the filters
  2. Routes the query + candidates through a BabyAI discovery specialist (see below)
  3. Returns the ranked + cut list as JSON

Response:

{
  "items": [
    {
      "title": "...",
      "summary": "...",
      "tags": ["..."],
      "publishedAt": "2026-06-29T20:00:00Z",
      "url": "https://...",
      "thumbnailUrl": "https://...",
      "creator": { "handle": "chris", "siteUrl": "https://..." }
    }
  ],
  "model": "babyai-discovery-v1",
  "tookMs": 145
}

GET /api/recent

Unauthenticated. Pure-recency view (no BabyAI). For discovery UIs that want "newest across all creators." Same response shape as /discover.

GET /api/creators/:handle

Public profile fetch. Returns the registered creator's metadata + recent items. Used by discovery clients to land on a creator's profile page.


BabyAI discovery specialists

Discovery routing uses BabyAI's MoE pattern, with specialist LoRAs trained for discovery-specific tasks:

Specialist Trained on Job
Content-matching (query, accepted-item) pairs Given a text query, rank candidate items by semantic relevance
Taste-matching (user-history, accepted-item) pairs Given a user's prior accepts, rank candidates by "you'd probably like this"
Freshness-balanced n/a (heuristic) Blend recency with relevance (boost new items mildly)
Diversity n/a (heuristic) Avoid returning N items from the same creator in a row

For v1 ship: only content-matching + freshness. Taste-matching needs accumulated user-history which we don't have yet. Diversity is a post-filter heuristic that doesn't need a LoRA.

Routing: /api/discover sends {query, candidates} to BabyAI via the same proxy pattern as the community worker (worker holds HF_TOKEN; BabyAI Space serves the model). Single model call per query, returns ranked candidates with confidence scores.

For users who give explicit feedback (thumbs-up/down on returned items), the data trains the taste-matching LoRA via the existing Mosh Pit / preference-learning pattern. Per the BabyAI two-LoRA architecture memory: preference data → routing LoRA training.


What the indexer specifically never does

These are load-bearing. Each one is the difference between "tiny coordinator" and "central platform."


Operator monitoring (us)

We need to know:

Use Cloudflare Workers Analytics Engine, same pattern as agicore-foundry's analytics-engine binding. Tiny dashboard at /admin/stats gated by ADMIN_TOKEN.


Scaling notes

None of these are blockers for v1. All are addressable when (if) they matter.


Creator opt-out / privacy

Creators control discovery participation entirely:

The principle: creators tell the indexer what they want indexed. We index that. Nothing more.


Future: federation with other indexers

Long-term, the indexer surface itself can be one of many — third parties can run their own indexer pointed at the same /feed.json standard, with their own discovery models, their own moderation policies. The schema is the protocol; the indexer implementation is interchangeable.

Out of scope for v1. Worth knowing the door isn't locked.