07 — Discovery Layer

The federated indexer-worker — the only central component in BKA, and the most architecturally distinct piece. This doc specifies its responsibilities, what it stores, what it explicitly never stores, and how BabyAI routes discovery queries.

What the indexer is

A single Cloudflare Worker (~300-500 LOC) with two concerns:

Federation — pulls registered creators' /feed.json on a cadence; maintains a metadata index
Discovery API — answers "what should I see next" queries by routing through BabyAI specialist LoRAs

That's it. No content host. No proxy. No comment service. No subscriber list. The smallest possible central thing.

What the indexer holds

A KV / D1 store with these shapes (D1 for relational queries, KV for the cached feed snapshots):

creators table (D1) — one row per registered creator:

CREATE TABLE creators (
  google_sub      TEXT PRIMARY KEY,    -- Google's stable user id
  handle          TEXT NOT NULL UNIQUE,-- the creator's BKA handle
  feed_url        TEXT NOT NULL,       -- absolute https URL to /feed.json
  site_url        TEXT NOT NULL,       -- absolute https URL to / (display)
  registered_at   INTEGER NOT NULL,
  last_pulled_at  INTEGER,
  last_pull_status TEXT,                -- 'ok' | 'unreachable' | 'malformed' | 'rate_limited'
  paused          INTEGER NOT NULL DEFAULT 0,  -- creator-toggleable
  -- soft delete: if unregistered, mark with deleted_at and stop pulling
  deleted_at      INTEGER
);

items table (D1) — one row per discovered post per creator:

CREATE TABLE items (
  creator_sub      TEXT NOT NULL,
  slug             TEXT NOT NULL,
  title            TEXT NOT NULL,
  summary          TEXT,
  tags             TEXT,                -- comma-separated for simple search
  published_at     INTEGER NOT NULL,
  url              TEXT NOT NULL,       -- absolute URL to the post on the creator's site
  thumbnail_url    TEXT,                -- absolute URL to thumbnail (if any)
  media_kinds      TEXT,                -- comma-separated: 'video','audio','image','text'
  first_seen_at    INTEGER NOT NULL,    -- when the indexer first saw this item
  PRIMARY KEY (creator_sub, slug)
);
CREATE INDEX idx_items_pub ON items(published_at DESC);
CREATE INDEX idx_items_tags ON items(tags);

KV: feed-snapshot:<sub> — the raw last-fetched feed.json per creator (for cache/diff). TTL 7 days.

KV: pull-queue — priority queue for the crawl scheduler.

That's all. Notably absent:

Post bodies (just URLs to the creator's site)
Media (just URLs to GitHub Releases or creator's R2)
Comments
Subscribers
Anything content-shaped

If the creator's site goes offline, the indexer has URLs to nothing. Visitors hitting discovery during that window get items pointing at unreachable hosts — that's the creator's problem, not ours to fix by caching.

Crawl scheduler

Runs as a Cloudflare cron trigger every 10 minutes:

Select up to 100 creators from creators where paused = 0 AND deleted_at IS NULL AND (last_pulled_at IS NULL OR last_pulled_at < now() - cadence)
For each:
- GET <feed_url> with a short timeout (10s)
- Parse as the canonical feed schema (10_FEED_SCHEMA.md)
- Diff against last snapshot: identify new items, removed items, edited items
- Insert new items, soft-delete removed items, update edited items
- Cache the new snapshot in KV
Update last_pulled_at + last_pull_status

Cadence: hourly default per creator. Priority-pulled creators (recent publish, recent registration, high view count) move to a 10-minute cadence for ~6 hours, then back to hourly.

Backoff: 3 consecutive failures → cadence drops to 6h, then 24h, then 7d. Recovered creators reset to hourly. Permanently unreachable creators (30d+) get marked paused and we email them (if we have an email — see open question on creator contact).

Rate limiting: never more than 1 req/sec to any single creator origin, regardless of cadence. If we're priority-pulling 100 different creators that's fine (different origins); if we have multiple feeds on the same origin (unlikely for BKA but possible) they share the rate limit.

Discovery API

`POST /api/register`

Creator registers their feed for discovery. Authenticated with Google JWT (same shape as the community worker).

Request:

{
  "googleIdToken": "<jwt>",
  "handle": "chris",
  "feedUrl": "https://chris-bka-public.pages.dev/feed.json",
  "siteUrl": "https://chris-bka-public.pages.dev"
}

Worker validates JWT, fetches the feed once to confirm it's reachable + valid schema, then inserts/updates the creators row keyed on Google sub. Returns success or the validation error.

`POST /api/unregister`

Soft-deletes the creator's registration. Stop pulling. Existing items in the index get cleared after a 7-day grace period (gives creators a way to re-register without losing discovery momentum from a brief absence).

Request: { "googleIdToken": "<jwt>" } — sub-keyed.

`POST /api/refresh`

Priority-pull hint. Creator-side BKA pings this after a publish. Validates JWT (must match a registered creator). Moves the creator to the front of the next crawl batch.

Request: { "googleIdToken": "<jwt>" }

`GET /api/discover`

Unauthenticated. The actual "what should I see next" route.

Query params:

q — optional text query
tags — optional comma-separated tag filter
kind — optional video|audio|image|text filter
since — optional ISO date, "items newer than this"
limit — default 20, max 100

Worker:

Pulls a candidate set from the items table matching the filters
Routes the query + candidates through a BabyAI discovery specialist (see below)
Returns the ranked + cut list as JSON

Response:

{
  "items": [
    {
      "title": "...",
      "summary": "...",
      "tags": ["..."],
      "publishedAt": "2026-06-29T20:00:00Z",
      "url": "https://...",
      "thumbnailUrl": "https://...",
      "creator": { "handle": "chris", "siteUrl": "https://..." }
    }
  ],
  "model": "babyai-discovery-v1",
  "tookMs": 145
}

`GET /api/recent`

Unauthenticated. Pure-recency view (no BabyAI). For discovery UIs that want "newest across all creators." Same response shape as /discover.

`GET /api/creators/:handle`

Public profile fetch. Returns the registered creator's metadata + recent items. Used by discovery clients to land on a creator's profile page.

BabyAI discovery specialists

Discovery routing uses BabyAI's MoE pattern, with specialist LoRAs trained for discovery-specific tasks:

Specialist	Trained on	Job
Content-matching	(query, accepted-item) pairs	Given a text query, rank candidate items by semantic relevance
Taste-matching	(user-history, accepted-item) pairs	Given a user's prior accepts, rank candidates by "you'd probably like this"
Freshness-balanced	n/a (heuristic)	Blend recency with relevance (boost new items mildly)
Diversity	n/a (heuristic)	Avoid returning N items from the same creator in a row

For v1 ship: only content-matching + freshness. Taste-matching needs accumulated user-history which we don't have yet. Diversity is a post-filter heuristic that doesn't need a LoRA.

Routing: /api/discover sends {query, candidates} to BabyAI via the same proxy pattern as the community worker (worker holds HF_TOKEN; BabyAI Space serves the model). Single model call per query, returns ranked candidates with confidence scores.

For users who give explicit feedback (thumbs-up/down on returned items), the data trains the taste-matching LoRA via the existing Mosh Pit / preference-learning pattern. Per the BabyAI two-LoRA architecture memory: preference data → routing LoRA training.

What the indexer specifically never does

Never serves a post body, image, audio, or video. Only URLs pointing at the creator's own site.
Never caches content longer than the feed snapshot needs (and snapshots are metadata-only).
Never holds creator secrets (no GitHub tokens, no CF tokens — only Google sub which is a public identifier).
Never proxies traffic to creator sites. Visitors load directly from the creator's CF Pages.
Never tracks individual visitors. No cookies, no fingerprinting, no per-user history server-side. (User-history for taste-matching is held client-side in the BKA app's localStorage and sent to BabyAI per-query; the indexer doesn't store it.)
Never accepts pushes from creators about content. Creators expose /feed.json; the indexer pulls. The /api/refresh route is a hint, not a content push.

These are load-bearing. Each one is the difference between "tiny coordinator" and "central platform."

Operator monitoring (us)

We need to know:

Per-creator pull success rate (is anyone unreachable?)
Cron job latency (are we falling behind?)
BabyAI call success rate (is discovery actually working?)
Top tags + top queries (for trend analysis, not personalization)

Use Cloudflare Workers Analytics Engine, same pattern as agicore-foundry's analytics-engine binding. Tiny dashboard at /admin/stats gated by ADMIN_TOKEN.

Scaling notes

D1 limits: 5GB per database. Each creator's items table grows by ~1KB per post. A creator publishing daily for 10 years = ~3650 rows × ~1KB = ~3.6MB. We can hold ~1000 such creators per D1 before hitting the cap. Beyond that: shard by creator hash.
Worker request limits: free tier 100K req/day, paid 10M+. Discovery API is the hot path; for the first 10K daily-active discovery users we're fine on free tier.
BabyAI rate limit: same as agicore-foundry's BabyAI route. If discovery starts dominating, we add a small cache (1-minute LRU on (query, tag-set) keys) to coalesce duplicate queries.

None of these are blockers for v1. All are addressable when (if) they matter.

Creator opt-out / privacy

Creators control discovery participation entirely:

Don't register → never indexed. Site still works.
Pause registration → indexer stops pulling, existing items soft-archived but not deleted (re-pull on resume).
Unregister → 7-day grace, then items purged.
Specific post opt-out: add "discoverable": false to the post's meta.json. Item drops from index on next pull.
Whole-feed opt-out for sensitive periods: set the feed's discoverable: false flag in the top-level feed.json metadata.

The principle: creators tell the indexer what they want indexed. We index that. Nothing more.

Future: federation with other indexers

Long-term, the indexer surface itself can be one of many — third parties can run their own indexer pointed at the same /feed.json standard, with their own discovery models, their own moderation policies. The schema is the protocol; the indexer implementation is interchangeable.

Out of scope for v1. Worth knowing the door isn't locked.