For Enterprise & Procurement Teams · v1.3

Evidence Pack

System behavior guarantees, a worked failure trace, and the audit readiness model — for teams evaluating adoption. Self-attested, open for community review.

Diagram of the five-layer NHID-Clinical trust stack: STIR/SHAKEN, NHID-Clinical v1.3, NHID-Auth v2, FHIR AuditEvent R4, and OpenTelemetry, over the Layer 0 NPI gap.

Five testable layers that turn trust claims into evidence

Download PDF — Core Specification Download PDF — Operational Blueprint

Scope of this document: This describes the reference implementation's technical properties. NHID-Clinical does not issue certifications, conduct audits, or validate vendor implementations. Everything here is self-attested and open for community review.

1. System Behavior Guarantees

Deterministic output

Identical input event + identical policy version → identical trace output. The policy engine is a pure function with no side effects.

Replay guarantee

Any stored trace can be replayed. Policy version is embedded in each event header; version mismatches are flagged.

Failure invariants

The engine never raises an unhandled exception. Malformed input returns a deterministic error trace; the caller always gets a response.

Idempotency

Submitting the same request_id twice produces the same policy decision. The event store deduplicates at PERSIST.

2. Real-Corpus Detection Rates

The CTS YAML suite validates the policy engine against synthetic, hand-authored cases. To check behavior against real conversational phrasing, the engine was also run against the Fabricate Battle-Test Corpus — 550 real-world voice AI conversations (4,839 turns). Detection rate is the share of corpus turns where a control correctly fired.

Control	Detection Rate	Notes
IDG-01 (Identity Disclosure)	100%	Holds against real phrasing
EIT-01 (Escalation Trigger)	94.7%	Holds against real phrasing
PDX-01 (PHI Data Exchange)	58.6%	Partial — real-world phrasing diverges from synthetic cases
DBC-01 (Deceptive Behavior Claim)	2.5%	Weak — heuristic phrase list does not yet cover most real-world phrasing
ATR-01 (supplemental)	0.0%	Corpus/adapter structural limitation — not a representative test

This is self-reported and not independently audited. IDG-01 and EIT-01 generalize well beyond the synthetic test suite; DBC-01 and ATR-01 are known weak points against real conversational data and are active areas of work, not resolved claims.

3. Anonymized Failure Trace Example

The example below is synthetic — constructed from observed behavior patterns, with all identifying information removed. It shows what an IDG-01 (late disclosure) violation looks like in the NHID-Clinical audit trace, and what a payer auditor would see when reviewing it.

Anonymized Failure Trace — IDG-01 Violation
Source: Synthetic example based on observed behavior patterns. No real PHI, no real provider data.
Generated: 2026-06-07 | Policy version: nhid-clinical-v1.3 | Correlation ID: [REDACTED]

t=00:00.000  INGEST      POST /voice/process received
             session_id: [REDACTED]
             call_sid:   [REDACTED]
             caller_type: ai_agent

t=00:00.084  VALIDATE    SpeechResult normalized
             turn_count: 0
             content_hash: [REDACTED]

t=00:00.091  STATE       Session reconstructed
             turn_count: 0
             disclosure_timestamp: null
             disclosure_confirmed: false

t=00:00.098  POLICY      IDG-01 evaluated
             rule: "Disclose AI identity before any data exchange"
             turn_count: 0
             disclosure_confirmed: false
             trigger: FIRST_TURN_NO_DISCLOSURE
             action: DISCLOSE_IDENTITY
── Violation recorded ──────────────────────────────────────────────
t=00:00.103  VIOLATION   IDG-01
             severity: critical
             message: "AI identity not disclosed at call start"
             action_taken: DISCLOSE_IDENTITY (forced)
             data_exchanged_before_disclosure: false
             recoverable: true
────────────────────────────────────────────────────────────────────
t=00:00.109  EXEC        TwiML rendered — forced disclosure statement
             text: "This call is being handled by an automated system on behalf of [Provider Name Redacted]."
             disclosure_forced: true

t=00:00.114  PERSIST     Event written
             disclosure_timestamp: 00:00.109
             boundary_violations: ["IDG-01"]
             partial_failure: true
             deterministic_hash: [REDACTED]

── What this means ─────────────────────────────────────────────────
The AI agent did not disclose its automated nature at call start.
The policy engine detected a turn_count=0 exchange with no prior
disclosure and forced a disclosure statement before any data could
be shared. The violation is logged as critical but recoverable.
A payer auditing this session would see:
  - disclosure_timestamp set 109ms into the call (forced, not voluntary)
  - partial_failure: true
  - boundary_violations: ["IDG-01"]
────────────────────────────────────────────────────────────────────

4. Failure & Attack Simulation Coverage

The failure injection harness covers the following scenarios:

Scenario	Expected behavior
Empty SpeechResult	Policy evaluated, event written, no 500
Null bytes in input	Sanitized before engine, sanitized text stored
Missing CallSid (session binding failure)	400 returned, no event written, structured error body
Late disclosure (IDG-01 + PDX-01)	DENY_DATA action, 2 critical violations logged
Escalation path unavailable (EIT-01)	ESCALATE_HUMAN with TwiML fallback, violation logged
Deceptive artifact (DBC-01)	LOG_ONLY, partial_failure=true, session continues
Missing audit fields (ATR-01)	Violation logged, pipeline continues, gap recorded
Bot-to-bot, undisclosed agent	DENY_DATA, stricter gate for ai_agent counterparty
Replay with external_calls_cached=false	Divergence detected, ATR-01 violation, replay flagged FAIL
Duplicate request_id (idempotency)	Identical trace returned, no duplicate event written

5. Audit Readiness Model

An external auditor reconstructing a session from the event store can determine:

When the call started and when the first disclosure statement was made
Whether disclosure preceded any PHI or credential exchange
Whether opt-out or escalation was requested and how it was handled
Which policy engine version processed each event
Whether any partial failures or boundary violations were recorded

Example correlation ID lifecycle:

correlation_id: "auth-2026-05-26-001"

t=00:00.000  INGEST     POST /voice/process received
t=00:00.123  VALIDATE   SpeechResult normalized
t=00:00.131  STATE      Session reconstructed: turn_count=0, disclosure=null
t=00:00.140  POLICY     IDG-01: DISCLOSE_IDENTITY triggered (turn_count=0)
t=00:00.145  EXEC       TwiML disclosure message rendered
t=00:00.152  PERSIST    Event written — disclosure_timestamp set

6. Architecture & Scale Notes

Current reference implementation

FastAPI + SQLite event store. Stateless policy engine. Suitable for development and self-validation. Not load-tested for production at scale.

Path to distributed event store

Replace SQLite with Kafka or S3-backed event log. Policy engine is stateless and horizontally scalable.

Replay preservation

Store input payload + policy version with each event. Policy version change detection prevents silent audit corruption.

7. Risk Register

Risk	Mitigation
Timestamps break exact replay	Hash computed over non-timestamp fields only
Policy engine version change between runs	Policy version embedded in every event; replay rejects mismatches
JSON key ordering variance	Canonical JSON (sorted keys) enforced before hashing
LLM re-invocation during replay	JSON Schema enforces external_calls_cached=true when replay_mode=cached
partial_failure accumulation undetected	boundary_violations[] written per event; rate trackable across sessions

8. One-Page Architecture Summary

What it is: A lightweight, stateless service that logs AI voice agent disclosure behavior. Input: call events from Twilio or equivalent. Output: tamper-evident, deterministically reproducible trace with policy decision and boundary violations.

What it is not: A caller identity verifier, a certification body, or a compliance guarantor. Adoption does not confer HIPAA or TCPA compliance.

Event flow:

[AI Voice Agent] → INGEST → VALIDATE → STATE → POLICY → EXEC → PERSIST
                                                        ↓
                                               [Event Store]
                                                        ↓
                                          [Auditor / Payer System]

Related Resources

Developers reference → — API, traces, test suite
Shadow Evaluation Guide → — 90-day payer observation process
Governance Simulator → — deterministic control walkthrough

Procurement checklist

Run the open-source test suite, review a sample failure trace, confirm audit fields in vendor JSON logs, then decide whether to require conformance in the next RFP cycle.

For Payers guide → Operational Blueprint (PDF) →

Open for feedback

Questions about implementation or adoption?

Reach out directly or join the community discussion.

Start a pilot →

Evidence Pack

1. System Behavior Guarantees

2. Real-Corpus Detection Rates

3. Anonymized Failure Trace Example

4. Failure & Attack Simulation Coverage

5. Audit Readiness Model

6. Architecture & Scale Notes

7. Risk Register

8. One-Page Architecture Summary

Related Resources

Procurement checklist

Questions about implementation or adoption?

Where to go next