VeriBiota Profile Catalog v0.1 – Spec¶

This catalog defines what VeriBiota profiles judge across the stack. Profiles are grouped by implementation tier.

Profile	Status	Domain	Consumers
global_affine_v1	Implemented (proof-backed anchors)	alignment	OGN, Helix
edit_script_v1	Implemented (proof-backed anchors)	edit	Helix, OGN
edit_script_normal_form_v1	Implemented (proof-backed anchors)	edit	Helix
prime_edit_plan_v1	Implemented (contract-checked; anchors reserved)	CRISPR/Prime	Helix
pair_hmm_bridge_v1	Implemented (contract-checked; anchors reserved)	HMM/variant	OGN
read_set_conservation_v1	Schema/manifest only (no `check` route yet)	pipeline	OGN
vcf_normalization_v1	Tier 1 (contract-checked; anchors reserved)	variant	OGN, external
offtarget_score_sanity_v1	Schema/manifest only (no `check` route yet)	CRISPR	Helix
snapshot_signature_v1	Emitted provenance record (proof-backed anchor)	provenance	Helix, VeriBiota

Proof status note: “proof-backed anchors” means the referenced theorem IDs are non-placeholder Lean theorems in Biosim/VeriBiota/Theorems.lean. “anchors reserved” means the profile is implemented and tested, but one or more referenced theorem IDs are still placeholders today.

Conventions¶

Profile name: snake_case_vN (e.g., global_affine_v1).
JSON schema: schemas/<domain>/<profile>.schema.json.
Golden fixtures: Tests/profiles/<profile>/*.json.
Manifest: profiles/manifest.json maps profile name → schema hash, theorem IDs.
Verdict JSON shape:

{
  "profile": "<name>",
  "profile_version": "X.Y.Z",
  "status": "passed | failed",
  "theorems": ["VB_..."],
  "instance": { "...summary..." },
  "engine": {
    "veribiota_version": "...",
    "lean_version": "...",
    "build_id": "..."
  }
}

Exit codes: 0 profile holds (passed); 2 profile checked but failed; 1 malformed input or internal error.

Tier 0 – Implemented Profiles¶

1. global_affine_v1 (Status: Implemented, contract locked)¶

Purpose: Guarantee that an implementation’s global affine alignment result matches the formal DP specification and trace semantics.
Intended consumers: OGN core aligner (CPU, GPU); any external tool claiming “VeriBiota-verified global alignment”.
Inputs (conceptual): seqA, seqB strings; scoring { match, mismatch, gap_open, gap_extend }; witness with score (int) and trace list of steps { op: "M" | "I" | "D" }.
Verified properties:
Trace validity: path from (0,0) to (|A|,|B|); no out-of-bounds or illegal transitions.
Score consistency: traceScore(seqA, seqB, scoring, trace) = witness.score.
DP optimality/equivalence: DP recurrence satisfied; dpScore(seqA, seqB, scoring) = specGlobalAffine(seqA, seqB, scoring) = witness.score.
Instance summary: verdict includes seqA_length, seqB_length, trace_length.
Theorem anchors: VB_ALIGN_CORE_001 (DP equals spec score); VB_ALIGN_CORE_002 (trace from DP achieves dpScore).

2. edit_script_v1 (Status: Implemented, contract locked)¶

Purpose: Guarantee that a claimed edit script coherently transforms seqA into seqB.
Intended consumers: Helix edit planner; any tool generating explicit edit scripts (CRISPR, Prime, patching).
Inputs (conceptual): seqA, seqB; edits list of { type: "ins" | "del" | "sub", pos: int, len: int (for del/sub), payload: str (for ins/sub) }.
Verified properties:
Well-defined application: edits apply sequentially without index under/overflow; positions and lengths respect current string state.
Final result equality: applying edits left-to-right yields exactly seqB.
Basic coherence: no degenerate no-op edits (unless later normalized); no overlapping ambiguous edits.
Instance summary: seqA_length, seqB_length, edit_count, and applied_ok boolean.
Theorem anchors: VB_EDIT_001 – edit script application is total and well-defined.

3. edit_script_normal_form_v1 (Status: Implemented, proof-anchored)¶

Purpose: Provide a canonical normalized form so different tools agree on “the same edit”.
Intended consumers: Helix planner; any diff/patch layer in OGN; future audit/tracking.
Inputs: seqA, seqB; edits arbitrary script; optionally normalized_edits claimed normal form.
Verified properties:
Normalization correctness: applyEdits(seqA, edits) = applyEdits(seqA, normalized_edits) = seqB.
Normal form invariants: no adjacent merges possible; minimal representation under chosen ordering (e.g., leftmost, shortest); no redundant sub that is a match.
Idempotence: re-normalizing normalized_edits yields the same script.
Theorem anchors: VB_EDIT_001 – base script correctness; VB_EDIT_002 – normalization preserves semantics and is idempotent.

Implemented profiles with reserved theorem anchors¶

4. prime_edit_plan_v1 (Status: Implemented, anchors reserved)¶

Purpose: Verify that a Prime Editing plan (guide, RT template, nicking strategy) corresponds to a desired net edit script.
Intended consumers: Helix prime editor module (simulated); future recommendations, constraints, AI models.
Inputs (conceptual):
ref_seq: reference context window.
target_seq: intended edited sequence (local window).
plan: pegRNA fields (spacer, PAM, RT template, PBS), optional nicking guide, cut positions, net_edit_script (reuse edit script schema) from ref_seq to target_seq.
Verified properties:
Plan-to-edit linkage: net_edit_script encodes exactly the delta between ref_seq and target_seq; sites of change align with RT template design.
Internal consistency: PBS and RT lengths within allowed bounds; PAM positions consistent with spacer location.
Relationship to edit_script_v1: calls edit_script_v1 on (ref_seq, target_seq, net_edit_script).
Theorem anchors: VB_PE_001 – reserved anchor for prime plan net-edit linkage; VB_EDIT_001 reused for base script correctness.

5. pair_hmm_bridge_v1 (Status: Implemented, anchors reserved)¶

Purpose: Tie DP-based alignment scoring to Pair-HMM likelihood scoring under a specified parameter mapping (small instances).
Intended consumers: OGN variant calling stack; any HMM-based likelihood computation referencing DP.
Inputs: seqA, seqB; dp_scoring match/mismatch/gap penalties; hmm_params transition and emission probabilities (log space); dp_witness DP score (optional trace); hmm_witness log-likelihood or Viterbi score.
Verified properties:
Parameter mapping correctness: DP gap open/extend and mismatch penalties correspond to log transition/emission probabilities via the mapping.
Score equivalence within tolerance: for finite sequences, dp_score equals (or within rounding epsilon of) hmm_score under the mapping.
Structural equivalence (optional): Viterbi path encoded by HMM corresponds to a valid DP alignment path.
Theorem anchors: VB_HMM_001 – reserved anchor for mapping lemma (DP parameters ↔ HMM parameters); VB_HMM_002 – reserved anchor for small-instance equivalence theorem.

6. read_set_conservation_v1 (Status: Schema/manifest only, planned)¶

Purpose: Verify that a pipeline stage conserves the multiset of reads and sample labels modulo defined operations (e.g., marking duplicates, sorting).
Intended consumers: OGN end-to-end pipeline; any ETL/transform stage in the stack.
Inputs: input_summary (read count, sample IDs, checksums per-sample/lane, optional k-mer sketch); output_summary same structure; allowed_ops flags (e.g., sort, tag, mark-dup).
Verified properties:
Count and identity preservation: for non-dropping stages, total counts match; sample IDs preserved without cross-sample mixing.
Checksum or sketch consistency: aggregate hashes or sketches consistent with allowed operations; if dropping low-quality reads, the allowed pattern is codified.
Theorem anchors (planned): VB_PIPE_001 – multiset preservation under pure reordering; VB_PIPE_002 – constrained drop rules.

7. vcf_normalization_v1 (Status: Tier 1 – contract-checked, anchors reserved)¶

Purpose: Guarantee that VCF entries are normalized (left-aligned, minimal representation) without changing variant meaning.
Inputs: hashes/IDs of pre- and post-normalization VCFs; optional reference FASTA hash; per-variant records describing original vs normalized locus/ref/alt and operations applied.
Verified properties:
Variant preservation: canonicalized original equals canonicalized normalized (no gain/loss of variants).
Left-alignment/minimality: normalized record matches canonical form (no shared prefix/suffix remains).
Ref/alt consistency: ref/alt remain valid after trimming and stay aligned to the reference context.
Idempotence: re-normalizing an already normalized variant yields the same variant.
Theorem anchors (Tier 1, reserved today): VB_VCF_001 – normalization preserves semantics; VB_VCF_002 – normalization is idempotent/unique in window.

Tier 2 – Future Profiles (Defined for alignment now)¶

8. offtarget_score_sanity_v1¶

Purpose: Sanity-check off-target scoring fields (CRISPR/Prime design) for monotonicity and boundedness.
Inputs: sites list with mismatch counts, distances to genomic landmarks, reported scores/probabilities.
Verified properties:
Monotonicity: more mismatches must not yield higher risk scores than strictly fewer mismatches in the same context.
Bounded probabilities: probabilities sum to at most 1 across mutually exclusive outcomes.
Basic consistency: no NaNs, infinities, or out-of-range values.
Theorem anchors (planned): VB_OFF_001 – monotonicity lemma over discrete mismatch counts.

9. snapshot_signature_v1 (Provenance binding)¶

Purpose: Emit a provenance record binding a verification run to its input hash, schema hash/id, theorem IDs, and build metadata.
Important: despite the name, snapshot_signature_v1 is not a cryptographic signature. Cryptographic signing is handled separately (Ed25519/JWS on checks/certificates).
Inputs: snapshot_profile, snapshot_hash, schema_hash, schema_id, theorem_ids, build/version/timestamp, plus optional instance_summary and provenance metadata.
Verified properties:
Binding: the emitted document’s schema_hash, schema_id, and theorem_ids mirror the manifest entry used for the run.
Hash recording: the emitted snapshot_hash records the tool-computed sha256:<hex> digest over the exact input bytes supplied for verification.
Theorem anchors: VB_SIG_001 – snapshot signature binds hash + manifest metadata (implemented).

How to Use This Spec¶

Treat this file as the jurisdiction for what VeriBiota is allowed to verify.
Wire Tier 0 profiles wherever needed; implement Tier 1 next; keep Tier 2 visible to guide design choices.
Manifest (profiles/manifest.json), schemas (schemas/...), and golden fixtures (Tests/profiles/...) should stay in sync with these definitions.