Physics Corpus Quality Platform
A quality and observability layer over an LLM-driven physics problem generator — the eyes the upstream corpus never had on itself.
The upstream generator, batch_exams_gen, produces physics problems one at a time by mutating anchors from a ten-year F=ma corpus. Each generation step sees only its own problem — there is no global view. As batches accumulate, two anchors can mutate into nearly identical problems, and nothing catches it. The corpus has no eyes on itself. This platform is those eyes.
How it works
Anchors and every generated problem load into a normalized BigQuery schema with the
anchors → batches → generated_problems lineage made explicit, not buried in string status fields.
A single embeddings table covers both anchors and generated outputs with an entity_type
discriminator, which keeps cross-type collision queries simple.
Collision detection runs over those embeddings with BigQuery vector search, flagging any pair above a cosine-similarity threshold that starts at 0.95 and is tuned from operator-resolved collisions over time. Pairs that share the same anchor are excluded — that similarity is expected lineage, not a collision.
Every flagged pair lands in a durable collisions table with an operator resolution status
(open, accepted, or regenerated) plus resolved_at and resolved_by. That record is the
artifact the upstream generator structurally cannot produce.
Orchestration
An Airflow DAG runs ingest → embed → detect → report daily, with idempotent reruns, a retry
policy, and a collision count passed to the report task. Each batch also gets an auto-generated
Mermaid lineage diagram showing its anchor → batch → generated relationships, which doubles as
documentation.
The embedding step uses OpenAI text-embedding-3-small as a cost-efficient prototype, structured so
Vertex AI can drop in for production. Local development runs through the Astronomer CLI.
It sits between batch_exams_gen upstream and PhysElo downstream — the quality gate before generated problems reach a live contest.