07 — Discovery Layer
The federated indexer-worker — the only central component in BKA, and the most architecturally distinct piece. This doc specifies its responsibilities, what it stores, what it explicitly never stores, and how BabyAI routes discovery queries.
What the indexer is
A single Cloudflare Worker (~300-500 LOC) with two concerns:
- Federation — pulls registered creators'
/feed.jsonon a cadence; maintains a metadata index - Discovery API — answers "what should I see next" queries by routing through BabyAI specialist LoRAs
That's it. No content host. No proxy. No comment service. No subscriber list. The smallest possible central thing.
What the indexer holds
A KV / D1 store with these shapes (D1 for relational queries, KV for the cached feed snapshots):
creators table (D1) — one row per registered creator:
CREATE TABLE creators (
google_sub TEXT PRIMARY KEY, -- Google's stable user id
handle TEXT NOT NULL UNIQUE,-- the creator's BKA handle
feed_url TEXT NOT NULL, -- absolute https URL to /feed.json
site_url TEXT NOT NULL, -- absolute https URL to / (display)
registered_at INTEGER NOT NULL,
last_pulled_at INTEGER,
last_pull_status TEXT, -- 'ok' | 'unreachable' | 'malformed' | 'rate_limited'
paused INTEGER NOT NULL DEFAULT 0, -- creator-toggleable
-- soft delete: if unregistered, mark with deleted_at and stop pulling
deleted_at INTEGER
);
items table (D1) — one row per discovered post per creator:
CREATE TABLE items (
creator_sub TEXT NOT NULL,
slug TEXT NOT NULL,
title TEXT NOT NULL,
summary TEXT,
tags TEXT, -- comma-separated for simple search
published_at INTEGER NOT NULL,
url TEXT NOT NULL, -- absolute URL to the post on the creator's site
thumbnail_url TEXT, -- absolute URL to thumbnail (if any)
media_kinds TEXT, -- comma-separated: 'video','audio','image','text'
first_seen_at INTEGER NOT NULL, -- when the indexer first saw this item
PRIMARY KEY (creator_sub, slug)
);
CREATE INDEX idx_items_pub ON items(published_at DESC);
CREATE INDEX idx_items_tags ON items(tags);
KV: feed-snapshot:<sub> — the raw last-fetched feed.json per creator (for cache/diff). TTL 7 days.
KV: pull-queue — priority queue for the crawl scheduler.
That's all. Notably absent:
- Post bodies (just URLs to the creator's site)
- Media (just URLs to GitHub Releases or creator's R2)
- Comments
- Subscribers
- Anything content-shaped
If the creator's site goes offline, the indexer has URLs to nothing. Visitors hitting discovery during that window get items pointing at unreachable hosts — that's the creator's problem, not ours to fix by caching.
Crawl scheduler
Runs as a Cloudflare cron trigger every 10 minutes:
- Select up to 100 creators from
creatorswherepaused = 0 AND deleted_at IS NULL AND (last_pulled_at IS NULL OR last_pulled_at < now() - cadence) - For each:
GET <feed_url>with a short timeout (10s)- Parse as the canonical feed schema (10_FEED_SCHEMA.md)
- Diff against last snapshot: identify new items, removed items, edited items
- Insert new items, soft-delete removed items, update edited items
- Cache the new snapshot in KV
- Update
last_pulled_at+last_pull_status
Cadence: hourly default per creator. Priority-pulled creators (recent publish, recent registration, high view count) move to a 10-minute cadence for ~6 hours, then back to hourly.
Backoff: 3 consecutive failures → cadence drops to 6h, then 24h, then 7d. Recovered creators reset to hourly. Permanently unreachable creators (30d+) get marked paused and we email them (if we have an email — see open question on creator contact).
Rate limiting: never more than 1 req/sec to any single creator origin, regardless of cadence. If we're priority-pulling 100 different creators that's fine (different origins); if we have multiple feeds on the same origin (unlikely for BKA but possible) they share the rate limit.
Discovery API
POST /api/register
Creator registers their feed for discovery. Authenticated with Google JWT (same shape as the community worker).
Request:
{
"googleIdToken": "<jwt>",
"handle": "chris",
"feedUrl": "https://chris-bka-public.pages.dev/feed.json",
"siteUrl": "https://chris-bka-public.pages.dev"
}
Worker validates JWT, fetches the feed once to confirm it's reachable + valid schema, then inserts/updates the creators row keyed on Google sub. Returns success or the validation error.
POST /api/unregister
Soft-deletes the creator's registration. Stop pulling. Existing items in the index get cleared after a 7-day grace period (gives creators a way to re-register without losing discovery momentum from a brief absence).
Request: { "googleIdToken": "<jwt>" } — sub-keyed.
POST /api/refresh
Priority-pull hint. Creator-side BKA pings this after a publish. Validates JWT (must match a registered creator). Moves the creator to the front of the next crawl batch.
Request: { "googleIdToken": "<jwt>" }
GET /api/discover
Unauthenticated. The actual "what should I see next" route.
Query params:
q— optional text querytags— optional comma-separated tag filterkind— optionalvideo|audio|image|textfiltersince— optional ISO date, "items newer than this"limit— default 20, max 100
Worker:
- Pulls a candidate set from the
itemstable matching the filters - Routes the query + candidates through a BabyAI discovery specialist (see below)
- Returns the ranked + cut list as JSON
Response:
{
"items": [
{
"title": "...",
"summary": "...",
"tags": ["..."],
"publishedAt": "2026-06-29T20:00:00Z",
"url": "https://...",
"thumbnailUrl": "https://...",
"creator": { "handle": "chris", "siteUrl": "https://..." }
}
],
"model": "babyai-discovery-v1",
"tookMs": 145
}
GET /api/recent
Unauthenticated. Pure-recency view (no BabyAI). For discovery UIs that want "newest across all creators." Same response shape as /discover.
GET /api/creators/:handle
Public profile fetch. Returns the registered creator's metadata + recent items. Used by discovery clients to land on a creator's profile page.
BabyAI discovery specialists
Discovery routing uses BabyAI's MoE pattern, with specialist LoRAs trained for discovery-specific tasks:
| Specialist | Trained on | Job |
|---|---|---|
| Content-matching | (query, accepted-item) pairs | Given a text query, rank candidate items by semantic relevance |
| Taste-matching | (user-history, accepted-item) pairs | Given a user's prior accepts, rank candidates by "you'd probably like this" |
| Freshness-balanced | n/a (heuristic) | Blend recency with relevance (boost new items mildly) |
| Diversity | n/a (heuristic) | Avoid returning N items from the same creator in a row |
For v1 ship: only content-matching + freshness. Taste-matching needs accumulated user-history which we don't have yet. Diversity is a post-filter heuristic that doesn't need a LoRA.
Routing: /api/discover sends {query, candidates} to BabyAI via the same proxy pattern as the community worker (worker holds HF_TOKEN; BabyAI Space serves the model). Single model call per query, returns ranked candidates with confidence scores.
For users who give explicit feedback (thumbs-up/down on returned items), the data trains the taste-matching LoRA via the existing Mosh Pit / preference-learning pattern. Per the BabyAI two-LoRA architecture memory: preference data → routing LoRA training.
What the indexer specifically never does
- Never serves a post body, image, audio, or video. Only URLs pointing at the creator's own site.
- Never caches content longer than the feed snapshot needs (and snapshots are metadata-only).
- Never holds creator secrets (no GitHub tokens, no CF tokens — only Google sub which is a public identifier).
- Never proxies traffic to creator sites. Visitors load directly from the creator's CF Pages.
- Never tracks individual visitors. No cookies, no fingerprinting, no per-user history server-side. (User-history for taste-matching is held client-side in the BKA app's localStorage and sent to BabyAI per-query; the indexer doesn't store it.)
- Never accepts pushes from creators about content. Creators expose
/feed.json; the indexer pulls. The/api/refreshroute is a hint, not a content push.
These are load-bearing. Each one is the difference between "tiny coordinator" and "central platform."
Operator monitoring (us)
We need to know:
- Per-creator pull success rate (is anyone unreachable?)
- Cron job latency (are we falling behind?)
- BabyAI call success rate (is discovery actually working?)
- Top tags + top queries (for trend analysis, not personalization)
Use Cloudflare Workers Analytics Engine, same pattern as agicore-foundry's analytics-engine binding. Tiny dashboard at /admin/stats gated by ADMIN_TOKEN.
Scaling notes
- D1 limits: 5GB per database. Each creator's items table grows by ~1KB per post. A creator publishing daily for 10 years = ~3650 rows × ~1KB = ~3.6MB. We can hold ~1000 such creators per D1 before hitting the cap. Beyond that: shard by creator hash.
- Worker request limits: free tier 100K req/day, paid 10M+. Discovery API is the hot path; for the first 10K daily-active discovery users we're fine on free tier.
- BabyAI rate limit: same as agicore-foundry's BabyAI route. If discovery starts dominating, we add a small cache (1-minute LRU on
(query, tag-set)keys) to coalesce duplicate queries.
None of these are blockers for v1. All are addressable when (if) they matter.
Creator opt-out / privacy
Creators control discovery participation entirely:
- Don't register → never indexed. Site still works.
- Pause registration → indexer stops pulling, existing items soft-archived but not deleted (re-pull on resume).
- Unregister → 7-day grace, then items purged.
- Specific post opt-out: add
"discoverable": falseto the post's meta.json. Item drops from index on next pull. - Whole-feed opt-out for sensitive periods: set the feed's
discoverable: falseflag in the top-level feed.json metadata.
The principle: creators tell the indexer what they want indexed. We index that. Nothing more.
Future: federation with other indexers
Long-term, the indexer surface itself can be one of many — third parties can run their own indexer pointed at the same /feed.json standard, with their own discovery models, their own moderation policies. The schema is the protocol; the indexer implementation is interchangeable.
Out of scope for v1. Worth knowing the door isn't locked.