Specification (v7.1)

This document is the authoritative specification for tacet, a Bayesian timing side-channel detection system. It defines the statistical methodology, abstract types, and requirements that implementations MUST follow to be conformant.

For implementation details (algorithms, numerical procedures), see the Implementation Guide. For language-specific APIs, see the Rust API, C API, or Go API. For interpreting results, see Interpreting Results and Attacker Models.

Terminology (RFC 2119)

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

In summary:

MUST / REQUIRED / SHALL: Absolute requirement
MUST NOT / SHALL NOT: Absolute prohibition
SHOULD / RECOMMENDED: Strong recommendation (valid reasons to deviate may exist)
SHOULD NOT: Strong discouragement (valid reasons to deviate may exist)
MAY / OPTIONAL: Truly optional

1. Overview

1.1 Problem Statement

Timing side-channel attacks exploit data-dependent execution time in cryptographic implementations. Existing detection tools have significant limitations:

T-test approaches (DudeCT) compare means, missing distributional differences such as cache effects that only affect upper quantiles
P-value misinterpretation: Statistical significance does not equal practical significance; with enough samples, negligible effects become “significant”
CI flakiness: Fixed sample sizes cause tests to pass locally but fail in CI (or vice versa) due to environmental noise
Binary output: No distinction between “no leak detected” and “couldn’t measure reliably”

1.2 Solution

tacet addresses these issues with:

Wasserstein-1 distance: A single metric capturing both uniform shifts and tail effects, measuring the minimum cost to transform one distribution into another
Adaptive Bayesian inference: Collect samples until confident, with natural early stopping
Three-way decisions: Pass / Fail / Inconclusive, distinguishing “safe” from “unmeasurable”
Interpretable output: Posterior probability (0-100%) instead of p-values
Fail-safe design: Prefer Inconclusive over confidently wrong

1.3 Design Goals

Interpretable: Output is a probability, not a t-statistic
Adaptive: Collects more samples when uncertain, stops early when confident
CI-friendly: Three-way output prevents flaky tests
Portable: Handles different timer resolutions via adaptive batching
Honest: Never silently clamps thresholds; reports what it can actually resolve
Fail-safe: CI verdicts SHOULD almost never be confidently wrong
Reproducible: Deterministic results given identical samples and configuration

2. Abstract Types and Semantics

This section defines the types that all implementations MUST provide. Types are specified using language-agnostic pseudocode.

2.1 Outcome

The primary result type returned by the oracle:

Outcome =
  | Pass {
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ), always at θ_eff
      effect: EffectEstimate,
      theta_user: Float,              // User-requested threshold (ns)
      theta_eff: Float,               // Effective threshold used (ns)
      theta_floor: Float,             // Measurement floor at decision time (ns)
      decision_threshold_ns: Float,   // θ_eff at which decision was made
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Fail {
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ)
      effect: EffectEstimate,
      theta_user: Float,
      theta_eff: Float,               // MAY exceed θ_user (leak detected above floor)
      theta_floor: Float,
      decision_threshold_ns: Float,   // θ_eff at which decision was made
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Inconclusive {
      reason: InconclusiveReason,
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ)
      effect: EffectEstimate,
      theta_user: Float,
      theta_eff: Float,
      theta_floor: Float,
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Unmeasurable {
      operation_ns: Float,            // Estimated operation time
      threshold_ns: Float,            // Timer resolution
      platform: String,
      recommendation: String
    }

Semantics:

The field leak_probability MUST be computed as $P(m(\delta) > \theta_{\text{eff}} \mid \Delta)$ , where $\theta_{\text{eff}}$ is the effective threshold used for inference at decision time.

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ , the oracle cannot support a Pass claim at $\theta_{\text{user}}$ , because effects in the range ( $\theta_{\text{user}}$ , $\theta_{\text{eff}}$ ] are not distinguishable from noise under the measured conditions.

Implementations MUST NOT substitute $\theta_{\text{user}}$ into leak_probability when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ .

Pass: MUST be returned when ALL of the following hold:
1. leak_probability < pass_threshold (default 0.05)
2. $\theta_{\text{eff}} \le \theta_{\text{user}} + \varepsilon_{\theta}$ where $\varepsilon_{\theta} = \max(\theta_{\text{tick}}, 10^{-6} \cdot \theta_{\text{user}})$
3. All verdict-blocking quality gates pass
4. $\theta_{\text{user}}$ > 0
Fail: MUST be returned when ALL of the following hold:
1. leak_probability > fail_threshold (default 0.95)
2. All verdict-blocking quality gates pass
Note: Fail MAY be returned when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ . Detecting $m(\delta)$ > $\theta_{\text{eff}}$ implies $m(\delta)$ > $\theta_{\text{user}}$ (since $\theta_{\text{eff}}$ ≥ $\theta_{\text{user}}$ by construction).
Inconclusive: MUST be returned when ANY of the following hold:
1. A verdict-blocking quality gate fails
2. Resource budgets exhausted without reaching a decision threshold
3. leak_probability < pass_threshold but $\theta_{\text{eff}} > \theta_{\text{user}} + \varepsilon_{\theta}$ (threshold elevated)
Unmeasurable: MUST be returned when the operation is too fast to measure reliably (see §4.5)

Exploratory mode (θ_user = 0): When $\theta_{\text{user}}$ = 0, Pass/Fail semantics do not apply. The oracle MUST return Inconclusive with the posterior estimates, allowing users to interpret the results themselves. This mode is useful for profiling and debugging but is not suitable for CI gating.

2.2 AttackerModel

Threat model presets defining the minimum effect size worth detecting:

AttackerModel =
  | SharedHardware      // θ = 0.4 ns (~2 cycles @ 5GHz)
  | PostQuantumSentinel // θ = 2.0 ns (~10 cycles @ 5GHz)
  | AdjacentNetwork     // θ = 100 ns
  | RemoteNetwork       // θ = 50,000 ns (50 μs)
  | Custom { threshold_ns: Float }

Model	Threshold	Use Case
SharedHardware	0.4 ns	SGX enclaves, cross-VM, containers, hyperthreading
PostQuantumSentinel	2.0 ns	Post-quantum crypto (ML-KEM, ML-DSA)
AdjacentNetwork	100 ns	LAN services, HTTP/2 APIs
RemoteNetwork	50 μs	Internet-exposed services

Cycle-based thresholds use a conservative 5 GHz reference frequency (assumes fast attacker hardware—smaller θ = more sensitive = safer).

There is no single correct threshold. The choice of attacker model is a statement about your threat model.

For exploratory analysis without a decision threshold, use Custom { threshold_ns: 0.0 }.

2.3 EffectEstimate

Summary of the detected timing effect:

EffectEstimate = {
  max_effect_ns: Float,                 // Posterior mean of W₁ distance (in nanoseconds)
  credible_interval_ns: (Float, Float), // 95% CI for W₁
  tail_diagnostics: TailDiagnostics     // Decomposition of effect into shift and tail components
}

TailDiagnostics = {
  shift_ns: Float,             // Uniform shift component (median difference)
  tail_ns: Float,              // Tail-specific component (beyond shift)
  tail_share: Float,           // Fraction of effect from tail [0-1]
  tail_slow_share: Float,      // Directionality of tail (p95+): fraction that are slowdowns [0-1]
  quantile_shifts: QuantileShifts,  // Per-quantile differences for interpretation
  pattern_label: EffectPattern // Classification of effect pattern
}

QuantileShifts = {
  p50_ns: Float,  // Median difference (50th percentile shift)
  p90_ns: Float,  // 90th percentile shift
  p95_ns: Float,  // 95th percentile shift
  p99_ns: Float   // 99th percentile shift
}

EffectPattern =
  | TailEffect      // Leak concentrated in upper quantiles (tail_share > 0.6)
  | UniformShift    // Leak affects all quantiles equally (tail_share < 0.3)
  | Mixed           // Combination of shift and tail (0.3 ≤ tail_share ≤ 0.6)
  | Negligible      // No significant effect detected

Interpreting W₁ distance:

The W₁ (Wasserstein-1) distance measures the minimum cost to transform one distribution into another. The tail_diagnostics field decomposes this single scalar metric into interpretable components, helping users understand whether the leak is:

A uniform shift (all quantiles affected equally, e.g., constant-time overhead difference)
A tail effect (upper quantiles affected more, e.g., cache misses)
A mixed pattern (combination of both)

The decomposition works by comparing the W₁ distance to the median difference:

If W₁ ≈ median difference, the effect is uniform (constant shift)
If W₁ >> median difference, the effect is concentrated in the tail (e.g., cache misses)

The tail_slow_share field indicates the directionality of tail deviations (p95+): values near 1.0 indicate tail deviations are predominantly slowdowns, values near 0.0 indicate speedups, and values near 0.5 indicate balanced directionality. This metric operates on quantile-aligned differences and measures what fraction of tail deviation magnitude comes from positive (slowdown) differences.

The quantile_shifts provide per-quantile differences for understanding the effect distribution, helping identify at which percentiles the leak manifests.

2.4 MeasurementQuality

Assessment of measurement precision:

MeasurementQuality =
  | Excellent  // MDE < 5 ns
  | Good       // MDE 5–20 ns
  | Poor       // MDE 20–100 ns
  | TooNoisy   // MDE > 100 ns

2.5 InconclusiveReason

InconclusiveReason =
  | DataTooNoisy { message: String, guidance: String }
  | NotLearning { message: String, guidance: String }
  | WouldTakeTooLong { estimated_time_secs: Float, samples_needed: Int, guidance: String }
  | ThresholdElevated {
      theta_user: Float,                    // What user requested
      theta_eff: Float,                     // What we measured at
      leak_probability_at_eff: Float,       // P(m(δ) > θ_eff | Δ)
      meets_pass_criterion_at_eff: Bool,    // True if P < pass_threshold at θ_eff
      achievable_at_max: Bool,              // Could θ_user be reached with max budget?
      message: String,
      guidance: String
    }
  | TimeBudgetExceeded { current_probability: Float, samples_collected: Int }
  | SampleBudgetExceeded { current_probability: Float, samples_collected: Int }
  | ConditionsChanged { drift: ConditionDrift }

The meets_pass_criterion_at_eff field indicates whether $P(m(\delta) > \theta_{\text{eff}} \mid \Delta)$ < pass_threshold. This allows CI systems to implement policies like “treat pass-criterion-met-at-floor as acceptable” without changing inference semantics.

The achievable_at_max field distinguishes:

true: $\theta_{\text{floor}}$ > $\theta_{\text{user}}$ now, but $\theta_{\text{floor}}(n_{\max}) \le \theta_{\text{user}}$ (more sampling may help)
false: $\theta_{\text{floor}}(n_{\max}) > \theta_{\text{user}}$ (cannot reach $\theta_{\text{user}}$ on this platform/configuration)

2.6 Diagnostics

Diagnostics = {
  // Dependence and effective samples
  dependence_length: Int,           // b̂ from Politis-White (bootstrap only)
  effective_sample_size: Int,       // n / τ̂
  iact_combined: Float,             // max(τ̂_F, τ̂_R)

  // Stationarity
  stationarity_ratio: Float,
  stationarity_ok: Bool,

  // Outlier handling
  outlier_rate_baseline: Float,
  outlier_rate_sample: Float,
  outlier_asymmetry_ok: Bool,

  // Timer and mode
  discrete_mode: Bool,
  timer_resolution_ns: Float,
  duplicate_fraction: Float,

  // Run information
  preflight_ok: Bool,
  calibration_samples: Int,
  total_time_secs: Float,
  seed: Option<Int>,
  threshold_ns: Float,
  timer_name: String,
  platform: String,

  // Gibbs sampler
  gibbs_iters_total: Int,
  gibbs_burnin: Int,
  gibbs_retained: Int,
  lambda_mean: Float,               // Posterior mean of λ (prior precision)
  lambda_mixing_ok: Bool,

  // Robust likelihood
  likelihood_inflated: Bool,        // True if κ_mean < 0.3

  // Warnings
  warnings: List<String>,
  quality_issues: List<QualityIssue>
}

QualityIssue = {
  code: IssueCode,
  message: String,
  guidance: String
}

IssueCode =
  | DependenceHigh       // High autocorrelation, reduced effective samples
  | PrecisionLow         // Limited measurement precision
  | DiscreteMode         // Coarse timer resolution
  | ThresholdIssue       // Cannot achieve requested threshold
  | FilteringApplied     // Outliers were capped
  | StationarityIssue    // Conditions may have changed
  | NumericalIssue       // Gibbs sampler convergence concern
  | LikelihoodInflated   // Uncertainty inflated for robustness

3. Statistical Methodology

This section describes the mathematical foundation of tacet. All formulas in this section are normative; implementations MUST produce equivalent results.

3.1 Test Statistic: Wasserstein-1 Distance

We collect timing samples from two classes:

Fixed class (F): A specific input (e.g., all zeros)
Random class (R): Randomly sampled inputs

We measure the Wasserstein-1 (W₁) distance between the two timing distributions. The W₁ distance measures the minimum cost to transform one distribution into another, where cost is the amount of probability mass times the distance it must be moved.

The test statistic is a 1D scalar (in nanoseconds):

\Delta = W_1(\hat{F}_{\text{Fixed}}, \hat{F}_{\text{Random}}) \in \mathbb{R}

where $\hat{F}$ denotes the empirical cumulative distribution function.

Debiased W₁ distance:

For human-readable output, implementations MAY compute a debiased estimator:

W_1^{\text{deb}} = \max\left(0, W_1(\text{Fixed}, \text{Random}) - \theta_{\text{floor}}\right)

where $\theta_{\text{floor}}$ is the measurement noise floor (§3.3.3). The debiased estimator provides a display-friendly interpretation of effect magnitude relative to measurement noise. Inference (§3.4) uses the raw W₁ distance.

Why W₁ distance?

Timing leaks manifest in different ways:

Uniform shift: A different code path adds constant overhead → entire distribution shifts
Tail effect: Cache misses occur probabilistically → upper quantiles shift more

The W₁ distance captures both patterns in a single scalar metric:

For uniform shifts, W₁ ≈ shift magnitude (median difference)
For tail effects, W₁ emphasizes the tail differences
Mixed patterns are captured proportionally

W₁ properties:

Single metric: Captures distributional differences in nanoseconds
Optimal transport: Weights differences by their quantile position
Natural zero: W₁ between identical distributions equals zero
1D inference: Enables fast Bayesian updating with simple conjugacy

W₁ computation: Implementations MUST compute W₁ using the sorted-samples method for discrete distributions. See the Implementation Guide for the algorithm.

3.2 Two-Phase Architecture

The system operates in two phases:

┌─────────────────────────────────────────────────────────────────┐
│                           Architecture                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │  Calibration │───▶│   Adaptive   │───▶│   Decision   │       │
│  │    Phase     │    │     Loop     │    │    Output    │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│         │                   │                   │                │
│         ▼                   ▼                   ▼                │
│   • Estimate Σ_rate    • Collect batch    • Pass (P<5%)         │
│   • Compute θ_floor    • Update Δ         • Fail (P>95%)        │
│   • Set prior scale    • Scale Σ by 1/n   • Inconclusive        │
│   • Warmup caches      • Update θ_floor                         │
│   • Pre-flight checks  • Compute P(>θ)                          │
│                        • Check quality                          │
│                        • Check stopping                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Calibration (runs once)

Collect initial samples to characterize measurement noise
Estimate covariance structure via stream-based block bootstrap
Compute “covariance rate” $\Sigma_{\text{rate}}$ that scales as $\Sigma = \Sigma_{\text{rate}} / n$
Compute initial measurement floor $\theta_{\text{floor}}$ and floor-rate constant c_floor
Compute effective threshold $\theta_{\text{eff}}$ and calibrate prior scale $\sigma$
Run pre-flight checks (timer sanity, harness sanity, stationarity)

Phase 2: Adaptive Loop (iterates until decision)

Collect batches of samples
Update quantile estimates from all data collected so far
Scale covariance: $\Sigma_n = \Sigma_{\text{rate}}/n$
Update $\theta_{\text{floor}}(n)$ using floor-rate constant
Run Gibbs sampler to approximate posterior and compute $P(\text{effect} > \theta_{\text{eff}})$
Check quality gates (posterior ≈ prior → Inconclusive)
Check decision boundaries (P > 95% → Fail, P < 5% → Pass)
Check time/sample budgets

Why this structure?

The key insight is that covariance scales as 1/n for quantile estimators. By estimating the covariance rate once during calibration, we can cheaply update the posterior as more data arrives; no re-bootstrapping needed. This makes adaptive sampling computationally tractable.

3.3 Calibration Phase

The calibration phase runs once at startup to characterize measurement noise.

Sample collection:

Implementations SHOULD collect n_cal samples per class (default: 5,000). This is enough to estimate covariance structure reliably while keeping calibration fast.

Fragile regime:

A fragile regime is a measurement condition where standard statistical assumptions may not hold, requiring more conservative estimation. The fragile regime is triggered when either:

Discrete timer mode: The timer has coarse resolution (see §3.6)
Low uniqueness: The minimum uniqueness ratio across classes is below 10%: $\min\left(\frac{|\text{unique}(F)|}{n_F}, \frac{|\text{unique}(R)|}{n_R}\right) < 0.10$

When a fragile regime is detected, implementations apply more conservative procedures (larger block lengths, regularized covariance) to maintain calibration.

3.3.1 Acquisition Stream Model

Measurement produces an interleaved acquisition stream indexed by time:

\{(c_t, y_t)\}_{t=1}^{T}, \quad c_t \in \{F, R\}, \; T \approx 2n

where $y_t$ is the measured runtime (or ticks) at acquisition index $t$ , and F/R denote Fixed and Random classes.

Per-class samples are obtained by filtering the stream:

F := \{y_t : c_t = F\}, \quad R := \{y_t : c_t = R\}

Critical principle: The acquisition stream is the data-generating process. All bootstrap and dependence estimation MUST preserve adjacency in acquisition order, not per-class position.

3.3.2 Variance Estimation via Stream-Based Block Bootstrap

Timing measurements exhibit autocorrelation: nearby samples are more similar than distant ones due to cache state, frequency scaling, etc. Standard bootstrap assumes i.i.d. samples, underestimating variance. Implementations MUST use block bootstrap on the acquisition stream to preserve the true dependence structure.

Block length selection: Implementations SHOULD use the Politis-White algorithm to select the optimal block length, with class-conditional ACF to avoid underestimation from interleaved sampling. See the Implementation Guide for the algorithm.

Variance rate (scalar):

The variance of the W₁ estimator scales as 1/n. The variance rate is a scalar for the 1D W₁ distance:

\text{var}_{\text{rate}} = \hat{\sigma}^2_{\text{cal}} \cdot n_{\text{cal}}

where $\hat{\sigma}^2_{\text{cal}}$ is the bootstrap variance estimate from calibration.

Define the calibration long-run variance proxy:

V_{\text{cal}} := \text{var}_{\text{cal}} \cdot n_{\text{cal}}

This is a scalar (not a matrix) because W₁ is a 1D statistic.

Effective sample size via IACT:

Under strong temporal dependence, n samples do not provide n independent observations. Implementations SHOULD compute integrated autocorrelation time (IACT) for diagnostic purposes:

n_{\text{eff}} := \frac{n}{\hat{\tau}}

where $\hat{\tau}$ is estimated via the Geyer Initial Monotone Sequence (IMS) method or equivalent. See the Implementation Guide for algorithms.

Implementations SHOULD report as diagnostics:

Diagnostics.dependence_length = $\hat{b}$ (block length from bootstrap)
Diagnostics.iact_combined = $\hat{\tau}$ (IACT estimate)
Diagnostics.effective_sample_size = $n_{\text{eff}}$ (for diagnostics only)

Variance scaling during adaptive loop:

Since $V_{\text{cal}}$ is computed via block bootstrap (which preserves temporal dependence structure), it already represents the long-run variance rate. During inference with $n$ samples:

\text{var}_n = \frac{V_{\text{cal}}}{n}

The block bootstrap accounts for autocorrelation in the variance estimate. IACT estimates are computed and reported as diagnostics but do not affect inference.

3.3.3 Measurement Floor and Effective Threshold

A critical design element is distinguishing between what the user wants to detect ( $\theta_{\text{user}}$ ) and what the measurement can detect ( $\theta_{\text{floor}}$ ).

Floor-rate constant (from null distribution):

Implementations MUST compute a floor-rate constant once at calibration by bootstrapping the null distribution of raw W₁:

Generate null W₁ replicates via within-class splits: split baseline into two halves and compute W₁ between them; similarly for sample class
Scale each null replicate by $\sqrt{n_{\text{blocks}}}$ where $n_{\text{blocks}} = \max(1, \lfloor n_{\text{cal}} / L \rfloor)$ is the number of independent blocks at calibration
Compute $c_{\text{floor}}$ as the 95th percentile of these scaled null replicates

This SHOULD use at least 2,000 null bootstrap replicates for stable percentile estimation.

Measurement floor (dynamic):

During the adaptive loop, the statistical floor decreases as sample size grows:

\theta_{\text{floor,stat}}(n) = \max\left(\theta_{\text{tick}}, \frac{c_{\text{floor}}}{\sqrt{n_{\text{blocks}}(n)}}\right)

where $n_{\text{blocks}}(n) = \max(1, \lfloor n / L \rfloor)$ is the number of independent blocks for sample size $n$ , and $L$ is the block length.

The tick floor is fixed once batching is determined:

\theta_{\text{tick}} = \frac{\text{1 tick (ns)}}{K}

where $K$ is the batch size. The combined floor:

\theta_{\text{floor}}(n) = \theta_{\text{floor,stat}}(n)

(The tick floor is already incorporated in the $\max$ within $\theta_{\text{floor,stat}}$ .)

Effective threshold ( $\theta_{\text{eff}}$ ):

\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}})

Threshold elevation decision rule:

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ :

Fail propagates: Detecting $m(\delta)$ > $\theta_{\text{eff}}$ implies $m(\delta)$ > $\theta_{\text{user}}$ . Implementations MAY return Fail.
Pass does not propagate: “No detectable effect above $\theta_{\text{eff}}$ ” is compatible with effects in ( $\theta_{\text{user}}$ , $\theta_{\text{eff}}$ ]. Implementations MUST NOT return Pass when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ + $\varepsilon_{\theta}$ .

Dynamic floor updates:

During the adaptive loop, $\theta_{\text{floor}}(n)$ decreases as n grows. Implementations MUST:

Recompute $\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}}(n))$ after each batch
If $\theta_{\text{floor}}(n)$ drops to $\theta_{\text{user}}$ or below, Pass becomes possible (subject to posterior)
Report the $\theta_{\text{eff}}$ used for the final decision

3.3.4 Prior Scale Calibration (Half-t Prior)

The prior on the 1D W₁ effect $\delta$ MUST be a half-t distribution (Student’s t restricted to positive values):

\delta \sim \text{half-}t_\nu(0, \sigma^2)

where $\nu$ is a fixed degrees-of-freedom parameter and $\sigma$ is a scale parameter calibrated to an exceedance target.

W₁ distances are non-negative, so the prior is a half-t distribution (Student’s t restricted to positive values). This is implemented via a scale mixture of Gaussians (see §3.4.3).

Degrees of freedom ( $\nu$ ):

Implementations MUST use $\nu := 4$ .

Calibrating $\sigma$ via exceedance target:

The scale $\sigma$ MUST be chosen so that the prior exceedance probability equals a fixed target $\pi_0$ (default 0.62):

P\left(\delta > \theta_{\text{user}} \;\middle|\; \delta \sim \text{half-}t_\nu(0, \sigma^2)\right) = \pi_0

This MUST be solved by deterministic 1D root-finding using Monte Carlo integration. See the Implementation Guide for the algorithm.

The prior is calibrated against $\theta_{\text{user}}$ (the user’s threat-model threshold). The measurement floor $\theta_{\text{floor}}$ is handled separately in the decision logic (§3.3.3).

Design properties: The half-t with ν = 4 provides heavy tails while maintaining finite variance. The calibrated scale σ balances informativeness and flexibility: the prior is weakly informative (allowing data to dominate when signal is strong) without being vacuous.

3.3.5 Deterministic Seeding Policy

To ensure reproducible results, all random number generation MUST be deterministic by default.

Normative requirement:

Given identical timing samples and configuration, the oracle MUST produce identical results (up to floating-point roundoff).

Seeding policy:

The bootstrap RNG seed MUST be deterministically derived from:
- A fixed library constant seed (default: 0x74696D696E67, “timing” in ASCII)
- A stable hash of configuration parameters
All Monte Carlo RNG seeds (leak probability, floor constant, prior scale) MUST be similarly deterministic
The Gibbs sampler RNG seed MUST be deterministic
The chosen seeds SHOULD be reported in diagnostics

3.4 Bayesian Model

The prior is a half-t distribution over the 1D W₁ distance, implemented via Gibbs sampling on the scale-mixture representation.

3.4.1 Latent Parameter

The latent parameter is the true W₁ distance between timing distributions:

\delta \in \mathbb{R}^+ \quad \text{(true W₁ distance in nanoseconds)}

This is constrained to be non-negative (W₁ distances are always ≥ 0).

3.4.2 Likelihood (Robust t-likelihood via scale mixture)

The observed W₁ statistic $\Delta$ may deviate from the Gaussian approximation when $\text{var}_n$ is underestimated. To prevent pathological posterior certainty under variance misestimation, implementations MUST use a robust likelihood with a scalar precision factor $\kappa$ .

\Delta \mid \delta, \kappa \sim \mathcal{N}\!\left(\delta, \frac{\text{var}_n}{\kappa}\right)

\kappa \sim \text{Gamma}\!\left(\frac{\nu_\ell}{2}, \frac{\nu_\ell}{2}\right)

Marginally, this gives a univariate t-distribution:

\Delta \mid \delta \sim t_{\nu_\ell}(\delta, \text{var}_n)

Gamma parameterization: shape–rate. $E[\kappa] = 1$ .

Degrees of freedom for likelihood ( $\nu_{\ell}$ ):

Implementations MUST use $\nu_{\ell}$ := 4.

Robustness mechanism: When $\text{var}_n$ is underestimated, $\kappa$ is pulled downward and inflates uncertainty automatically. The parameter ν_ℓ = 4 matches the prior degrees of freedom, maintaining consistency in tail behavior between prior and likelihood.

Likelihood inflation warning:

Implementations SHOULD set Diagnostics.likelihood_inflated = true when $\kappa_{\text{mean}} < 0.3$ , indicating the likelihood variance was effectively scaled up for robustness.

3.4.3 Prior (Half-t via scale mixture)

The prior on $\delta \in \mathbb{R}^+$ is:

\delta \sim \text{half-}t_\nu(0, \sigma^2), \quad \nu = 4

Implementations MUST implement inference using the equivalent hierarchical model:

\lambda \sim \text{Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right)

\delta \mid \lambda \sim \text{half-}\mathcal{N}\left(0, \frac{\sigma^2}{\lambda}\right)

where Gamma uses shape–rate parameterization, and half-𝒩 denotes a half-normal distribution (normal restricted to positive values).

Marginal prior variance:

For $\nu$ = 4:

V_0^{\text{marginal}} := \text{Var}(\delta) = 2\sigma^2

This scalar variance is used as the prior variance reference in Gate 1 (§3.5.2).

3.4.4 Posterior Inference (Deterministic Gibbs Sampling)

The posterior is approximated using a deterministic Gibbs sampler over $(\delta, \lambda, \kappa)$ .

Gibbs schedule (normative):

Parameter	Value	Description
N_gibbs	5000	Total iterations
N_burn	1000	Burn-in (discarded)
N_keep	4000	Retained samples

Initialization:

\lambda^{(0)} = 1, \quad \kappa^{(0)} = 1

Iteration order:

For t = 1, …, N_gibbs:

Sample $\delta^{(t)} \sim p(\delta \mid \lambda^{(t-1)}, \kappa^{(t-1)}, \Delta)$ (truncated normal, positive only)
Sample $\lambda^{(t)} \sim p(\lambda \mid \delta^{(t)})$
Sample $\kappa^{(t)} \sim p(\kappa \mid \delta^{(t)}, \Delta)$

The Gibbs conditionals and numerical implementation are detailed in the Implementation Guide.

Why 5000 iterations?

The 1D Gibbs sampler converges much faster than the previous 9D sampler, but we use more iterations to ensure high-quality posterior approximation with minimal Monte Carlo error. The computational cost is still 5-10× lower than the 9D approach due to simpler conditionals.

$\lambda$ mixing diagnostics:

Implementations SHOULD set Diagnostics.lambda_mixing_ok = false when lambda_cv < 0.1 or lambda_ess < 20, indicating the sampler may not have converged.

3.4.5 Decision Functional and Leak Probability

The decision functional is simply the W₁ distance itself:

m(\delta) := \delta

The leak probability is:

P(\text{leak} > \theta_{\text{eff}} \mid \Delta) = P(\delta > \theta_{\text{eff}} \mid \Delta)

Estimation via Gibbs samples:

\widehat{P}(\text{leak}) = \frac{1}{N_{\text{keep}}} \sum_{s=1}^{N_{\text{keep}}} \mathbf{1}\left[\delta^{(s)} > \theta_{\text{eff}}\right]

Posterior summaries:

Implementations MUST compute:

Posterior mean: $\delta_{\text{post}} := \frac{1}{N_{\text{keep}}} \sum_s \delta^{(s)}$
Credible interval: 95% CI for $\delta$ from empirical quantiles of $\{\delta^{(s)}\}_{s=1}^{N_{\text{keep}}}$

Interpreting the probability:

This is a posterior probability, not a p-value. When we report “72% probability of a leak,” we mean: given the data and our model, 72% of the posterior mass corresponds to W₁ distances exceeding $\theta_{\text{eff}}$ .

3.5 Adaptive Sampling Loop

The core innovation: collect samples until confident, with natural early stopping.

Verdict-blocking semantics:

Pass/Fail verdicts MUST be emitted only if all measurement quality gates pass. Otherwise the oracle MUST return Inconclusive.

This policy ensures: CI verdicts should almost never be confidently wrong.

3.5.1 Stopping Criteria

The adaptive loop terminates when any of these conditions is met:

Pass: leak_probability < pass_threshold (default 0.05) AND all quality gates pass
Fail: leak_probability > fail_threshold (default 0.95) AND all quality gates pass
Inconclusive: Any quality gate fails OR budget exhausted without reaching decision threshold

Why adaptive sampling works for Bayesian inference:

Frequentist methods suffer from optional stopping: if you keep sampling until you get a significant result, you inflate your false positive rate.

Bayesian methods don’t have this problem. The posterior probability is valid regardless of when you stop; this is the likelihood principle. Your inference depends only on the data observed, not your sampling plan.

3.5.2 Quality Gates (Verdict-Blocking)

Quality gates detect when data is too poor to reach a confident decision. When any gate triggers, the outcome MUST be Inconclusive.

Gate 1: Insufficient Information Gain

This gate detects when the data provides insufficient information relative to the prior, either because the posterior barely moved from the prior (data too noisy) or because the posterior stopped updating (not learning).

Implementations MUST compute the KL divergence between Gaussian surrogates of the prior and posterior. For the 1D case:

\mathrm{KL} = \frac{1}{2}\left( \frac{V_{\text{post}}}{V_0^{\text{marginal}}} + \frac{\mu_{\text{post}}^2}{V_0^{\text{marginal}}} - 1 + \ln\frac{V_0^{\text{marginal}}}{V_{\text{post}}} \right)

where $V_{\text{post}}$ is the posterior variance of $\delta$ (estimated from Gibbs samples) and $V_0^{\text{marginal}} = 2\sigma^2$ is the marginal prior variance (§3.4.3).

Trigger conditions:

KL < KL_min (default 0.7 nats) → DataTooNoisy
Sum of recent inter-batch KL divergences < 0.001 for 5+ batches → NotLearning

Gate 2: Would Take Too Long

Extrapolate time to decision based on current convergence rate. If projected time exceeds budget by a large margin (e.g., 10×), trigger Inconclusive with reason WouldTakeTooLong.

Gate 3: Resource Budget Exceeded

If elapsed time exceeds configured time budget → TimeBudgetExceeded If total samples per class exceeds configured maximum → SampleBudgetExceeded

Gate 4: Condition Drift Detected

The covariance estimate $\Sigma_{\text{rate}}$ is computed during calibration. If measurement conditions change during the adaptive loop, this estimate becomes invalid.

Detect condition drift by comparing measurement statistics from calibration against the full test run:

Variance ratio: $\sigma_{\text{post}}^2 / \sigma_{\text{cal}}^2$
Autocorrelation change: $|\rho_{\text{post}}(1) - \rho_{\text{cal}}(1)|$
Mean drift: $|\mu_{\text{post}} - \mu_{\text{cal}}| / \sigma_{\text{cal}}$

If variance ratio is outside [0.5, 2.0], or autocorrelation change exceeds 0.3, or mean drift exceeds 3.0, trigger Inconclusive with reason ConditionsChanged.

Note on threshold elevation: The case where $\theta_{\text{floor}}$ > $\theta_{\text{user}}$ is handled by the threshold elevation decision rule (§3.3.3), not by a quality gate.

3.6 Discrete Timer Mode

When the timer has low resolution (e.g., Apple Silicon’s 41ns cntvct_el0), quantile estimation behaves differently due to tied values.

Trigger condition:

Discrete timer mode triggers when minimum uniqueness ratio is below 10%.

Mid-distribution quantiles:

Instead of standard quantile estimators, use mid-distribution quantiles which handle ties correctly. See the Implementation Guide.

Work in ticks internally:

In discrete mode, implementations SHOULD perform computations in ticks (timer’s native unit) and convert to nanoseconds only for display.

Gaussian approximation caveat:

The Gaussian likelihood is a rougher approximation with discrete data. Implementations MUST report a quality issue about discrete timer mode.

3.7 Calibration Validation Requirements

The Bayesian approach requires empirical validation that posteriors are well-calibrated.

Null calibration test (normative requirement):

Implementations MUST provide a “fixed-vs-fixed” validation that measures end-to-end false positive rates:

FPR_gated = P(Fail | H₀, all verdict-blocking gates pass)

Acceptance criteria:

Metric	Target	Action if Exceeded
FPR_gated	2-5%	≤ 5%

Large-effect detection tests:

Implementations MUST include validation tests ensuring the Student’s t prior correctly detects obvious leaks even when measurement noise is high.

4. Measurement Requirements

This section defines abstract requirements for measurement. For implementation details, see the Implementation Guide.

4.1 Timer Requirements

Implementations MUST use a timer that:

Is monotonic (never decreases)
Has known resolution
Reports results that can be converted to nanoseconds

Implementations SHOULD use the highest-resolution timer available on the platform.

4.2 Acquisition Stream Requirements

Measurements MUST be collected as an interleaved acquisition stream (see §3.3.1):

Fixed and Random class measurements MUST be interleaved
The interleaving order SHOULD be randomized
The full acquisition stream (with class labels) MUST be preserved for bootstrap

4.3 Input Pre-generation

All inputs MUST be generated before the measurement loop begins. Generating inputs inside the timed region causes false positives.

4.4 Outlier Handling

Implementations MUST cap (winsorize), not drop, outliers:

Compute t_cap = 99.99th percentile from pooled data
Cap samples exceeding t_cap
Winsorization happens before quantile computation

Quality thresholds: >0.1% capped → warning; >1% → acceptable; >5% → TooNoisy.

4.5 Adaptive Batching

On platforms with coarse timer resolution, implementations SHOULD batch operations:

When batching is needed:

If pilot measurement shows fewer than 5 ticks per call, enable batching.

Batch size selection:

K = \text{clamp}\left( \left\lceil \frac{50}{\text{ticks\_per\_call}} \right\rceil, 1, 20 \right)

Effect scaling:

Reported effects MUST be divided by K to give per-operation estimates.

4.6 Measurability

If ticks per call < 5 even with maximum batching (K=20), implementations MUST return Unmeasurable.

4.7 Pre-flight Checks

Implementations SHOULD perform pre-flight checks:

Timer sanity: Verify monotonicity and reasonable resolution
Harness sanity (fixed-vs-fixed): Detect test harness bugs
Stationarity: Detect drift during measurement

5. API Design Principles

This section provides language-agnostic guidance for API design. These are recommendations (SHOULD) unless marked otherwise.

5.1 Input Specification

Two-class pattern:

Implementations SHOULD expose the DudeCT two-class pattern:

Baseline class: Fixed input (typically all zeros)
Sample class: Variable input (typically random)

This pattern tests for data-dependent timing, not specific value comparisons.

5.2 Configuration Ergonomics

Attacker model presets as primary entry point:

The primary configuration entry point SHOULD be attacker model selection, not raw threshold values.

Sane defaults:

Default configuration SHOULD:

Use AdjacentNetwork attacker model (or equivalent)
Set time budget to 60 seconds
Set sample budget to 1,000,000
Set pass/fail thresholds to 0.05/0.95

5.3 Result Communication

Leak probability prominence:

The leak probability MUST be prominently displayed in results and human-readable output.

Threshold transparency:

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ , implementations MUST clearly indicate this to the user.

Inconclusive guidance:

For Inconclusive outcomes, implementations MUST provide the reason and SHOULD provide actionable guidance.

6. Configuration Parameters

6.1 Required Parameters

Parameter	Type	Description
attacker_model OR threshold_ns	AttackerModel or Float	Defines the effect threshold

6.2 Optional Parameters

Parameter	Default	Description
time_budget	60 seconds	Maximum test duration
max_samples	1,000,000	Maximum samples per class
pass_threshold	0.05	P(leak) below this → Pass
fail_threshold	0.95	P(leak) above this → Fail
calibration_samples	5,000	Samples for calibration phase
batch_size	1,000	Samples per adaptive batch
bootstrap_iterations	2,000	Bootstrap iterations for covariance

Appendix A: Mathematical Notation

Symbol	Meaning
{(c_t, y_t)}	Acquisition stream: class labels and timing measurements
T	Acquisition stream length (≈ 2n)
F, R	Per-class sample sets (filtered from stream)
$\Delta$	Scalar: observed W₁ distance (in nanoseconds)
$\delta$	Scalar: true (latent) W₁ distance (in nanoseconds)
W₁	Wasserstein-1 distance between timing distributions
$\text{var}_n$	Variance of W₁ estimator at sample size n
$\text{var}_{\text{rate}}$	Variance rate (scalar)
$V_{\text{cal}}$	Calibration long-run variance proxy: $\text{var}_{\text{cal}} \cdot n_{\text{cal}}$
$\hat{\tau}$	Integrated autocorrelation time (via Geyer IMS)
n_eff	Effective sample size: $n_{\text{eff}} = n / \hat{\tau}$
n_blocks	Number of independent blocks: $\max(1, \lfloor n / L \rfloor)$ where $L$ is block length
$\nu$	Half-t degrees of freedom (fixed at 4)
$\sigma$	Half-t prior scale (calibrated via exceedance target)
$\lambda$	Latent prior precision multiplier in scale-mixture representation
$\kappa$	Latent likelihood precision multiplier (robust t-likelihood)
$\nu_{\ell}$	Likelihood degrees of freedom (fixed at 4)
$t_{\nu}(0, \sigma^2)$	Univariate Student’s t with $\nu$ df, location 0, scale $\sigma^2$
$\text{half-}t_{\nu}(0, \sigma^2)$	Half-t distribution (t restricted to positive values)
$V_0^{\text{marginal}}$	Marginal prior variance: $2\sigma^2$ for $\nu = 4$
$V_{\text{post}}$	Posterior variance of $\delta$ (estimated from Gibbs samples)
$\delta_{\text{post}}$	Posterior mean of $\delta$ (from Gibbs samples)
N_gibbs	Total Gibbs iterations (5000)
N_burn	Burn-in iterations (1000)
N_keep	Retained Gibbs samples (4000)
$\theta_{\text{user}}$	User-requested threshold
$\theta_{\text{floor}}$	Measurement floor (smallest resolvable effect)
$c_{\text{floor}}$	Floor-rate constant: $\theta_{\text{floor,stat}} = c_{\text{floor}}/\sqrt{n}$
$\theta_{\text{tick}}$	Timer resolution component of floor
$\theta_{\text{eff}}$	Effective threshold used for inference
$m(\delta)$	Decision functional: $\delta$ (the W₁ distance itself)
b̂	Block length (Politis-White, on acquisition stream)
MDE	Minimum detectable effect
n	Samples per class
B	Bootstrap iterations
KL	KL divergence: KL(posterior ∥ prior) for Gate 1
KL_min	Minimum KL threshold for conclusive verdict (default 0.7 nats)

Appendix B: Normative Constants

These constants define conformant implementations. Implementations MAY use different values only where noted.

Constant	Default	Normative	Rationale
Test statistic	W₁ distance	MUST	1D scalar, optimal transport
Prior family	Half-t	MUST	Continuous scale adaptation, non-negative
Degrees of freedom ( $\nu$ )	4	MUST	Heavy tails + finite variance
Gamma parameterization	shape–rate	MUST	Avoid library ambiguity
Gibbs iterations (N_gibbs)	5000	MUST	High-quality posterior approximation
Gibbs burn-in (N_burn)	1000	MUST	Conservative convergence
Gibbs retained (N_keep)	4000	MUST	Low MC variance
Gibbs initialization ( $\lambda^{(0)}$ , $\kappa^{(0)}$ )	1	MUST	Prior means
Likelihood df ( $\nu_{\ell}$ )	8	MUST	Robustness to variance misestimation
Prior exceedance target ( $\pi_0$ )	0.62	SHOULD	Genuine uncertainty
Prior calibration MC draws	50,000	SHOULD	Stable $\sigma$ calibration
Bootstrap iterations	2,000	SHOULD	Variance estimation accuracy
Monte Carlo samples (c_floor)	50,000	SHOULD	Floor-rate constant estimation
Batch size	1,000	SHOULD	Adaptive iteration granularity
Calibration samples	5,000	SHOULD	Initial variance estimation
Pass threshold	0.05	SHOULD	95% confidence of no leak
Fail threshold	0.95	SHOULD	95% confidence of leak
KL_min (nats)	0.7	MUST	Minimum information gain for conclusive verdict
Block length cap	min(3√T, T/3)	SHOULD	Prevent degenerate blocks
Discrete threshold	10% unique	SHOULD	Trigger discrete mode
Min ticks per call	5	SHOULD	Measurability floor
Max batch size	20	SHOULD	Limit microarch artifacts
Default time budget	60 s	MAY	Maximum runtime
Default sample budget	1,000,000	MAY	Maximum samples
Default RNG seed	0x74696D696E67	SHOULD	”timing” in ASCII
Likelihood inflation threshold	0.3	SHOULD	$\kappa_{\text{mean}}$ triggering LikelihoodInflated

Appendix C: References

Statistical methodology:

Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 3. Springer. (Bayesian linear regression)
Politis, D. N. & White, H. (2004). “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23(1):53–70.
Künsch, H. R. (1989). “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics. (Block bootstrap)
Hyndman, R. J. & Fan, Y. (1996). “Sample quantiles in statistical packages.” The American Statistician 50(4):361–365.
Welford, B. P. (1962). “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics 4(3):419–420.
Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed., Ch. 11-12. CRC Press. (Gibbs sampling, scale mixtures)
Lange, K. L., Little, R. J. A., & Taylor, J. M. G. (1989). “Robust Statistical Modeling Using the t Distribution.” JASA 84(408):881-896. (Student’s t for robustness)

Timing attacks:

Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). “Dude, is my code constant time?” DATE. (DudeCT methodology)
Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). “Opportunities and Limits of Remote Timing Attacks.” ACM TISSEC 12(3):17. (Exploitability thresholds)
Van Goethem, T., et al. (2020). “Timeless Timing Attacks.” USENIX Security. (HTTP/2 timing attacks)
Bernstein, D. J. et al. (2024). “KyberSlash.” (Timing vulnerability example)
Dunsche, M. et al. (2025). “SILENT: A New Lens on Statistics in Software Timing Side Channels.” arXiv:2504.19821. (Relevant hypotheses framework)

Existing tools:

dudect (C): https://github.com/oreparaz/dudect
dudect-bencher (Rust): https://github.com/rozbb/dudect-bencher

Appendix D: Changelog

v7.1 (from v7.0)

Statistical correctness refinements:

This release fixes inference semantics in the v7.0 W₁ implementation based on statistician review.

Core changes:

Inference uses raw W₁ (§3.1): Bayesian inference uses raw W₁ distance without debiasing or clamping. Debiased W₁ is computed only for display purposes to help users interpret effect magnitude above measurement noise.
Floor from null distribution (§3.3.3): Measurement floor $c_{\text{floor}}$ is calibrated from the 95th percentile of null W₁ replicates (via within-class splits). This provides correct Type I error control under the null hypothesis.
Block count terminology (Appendix A): Variable $n_{\text{blocks}} = \max(1, \lfloor n / L \rfloor)$ replaces ambiguous $n_{\text{eff}}$ in floor calculations. Effective sample size $n_{\text{eff}} = n / \hat{\tau}$ remains for IACT-based diagnostics.
Prior targets user threshold (§3.3.4): Half-t prior scale $\sigma$ is calibrated so that $P(\delta > \theta_{\text{user}}) = \pi_0$ . The prior encodes the user’s security threshold.
Robust likelihood (§3.4.2): Student-t likelihood degrees of freedom changed from $\nu_{\ell} = 8$ to $\nu_{\ell} = 4$ (matching prior ν = 4) for consistency. Robustness parameter $\kappa \sim \text{Gamma}(\nu_{\ell}/2, \nu_{\ell}/2)$ guards against variance underestimation.
Tail directionality metric (§2.3): tail_slow_share measures fraction of tail deviation magnitude (p95+) from slowdowns, computed as $\sum \max(d_i - \text{shift}, 0) / \sum |d_i - \text{shift}|$ over tail indices. Operates on quantile-aligned differences, not sample identities.

Migration notes:

These changes affect statistical inference but not API surface
Results may differ slightly from v7.0 due to corrected floor calibration and prior targeting
No code changes required for users; test outcomes may be more conservative (fewer false positives)

v7.0 (from v6.0)

Migration from 9D quantile differences to 1D Wasserstein-1 distance:

This is a breaking change that fundamentally simplifies the statistical methodology while improving sensitivity and performance.

Core changes:

Test statistic (§3.1): Replace 9D quantile-difference vector with 1D W₁ (Wasserstein-1) distance
- W₁ measures the minimum cost to transform one distribution into another
- Naturally captures both uniform shifts and tail effects in a single scalar
- Debiased estimator: W₁_deb = max(0, W₁(baseline, sample) - θ_floor)
Prior (§3.4.3): Replace 9D multivariate t-distribution with 1D half-t distribution
- Half-t is appropriate for non-negative W₁ distances
- Implemented via scale mixture: λ ~ Gamma(ν/2, ν/2), δ | λ ~ half-𝒩(0, σ²/λ)
- Marginal prior variance: V₀^marginal = 2σ² (scalar, not matrix)
Variance estimation (§3.3.2): Replace 9×9 covariance matrix with scalar variance
- var_rate (scalar) instead of Σ_rate (matrix)
- Variance scaled by sample size: var_n = var_rate / n
- Block bootstrap still used to preserve temporal dependence
Bayesian inference (§3.4): Replace 9D Gibbs sampler with 1D Gibbs sampler
- Gibbs iterations increased from 256 to 5000 (1000 burn-in, 4000 retained)
- 1D sampler is much simpler but we use more iterations for low MC variance
- Decision functional simplified: m(δ) = δ (the W₁ distance itself)
- Posterior probability: P(δ > θ_eff | Δ)
Quality gate (§3.5.2): Simplified KL divergence formula for 1D case
- KL = ½(V_post/V₀ + μ_post²/V₀ - 1 + ln(V₀/V_post))
- Same threshold (KL_min = 0.7 nats) for data-too-noisy detection
EffectEstimate (§2.3): Updated to reflect W₁ as primary metric
- max_effect_ns: Posterior mean of W₁ distance
- credible_interval_ns: 95% CI for W₁
- tail_diagnostics: Decomposition into shift and tail components
  - shift_ns: Median difference (uniform shift component)
  - tail_ns: Tail-specific component (beyond shift)
  - tail_share: Fraction of effect from tail [0-1]
  - quantile_shifts: Per-quantile differences (p50, p90, p95, p99) for interpretation
  - pattern_label: TailEffect | UniformShift | Mixed | Negligible

Expected benefits:

5-10× faster inference: 1D Gibbs sampler converges much faster than 9D, despite using more iterations
Better tail detection: Optimal transport naturally emphasizes distributional differences
Simpler interpretation: Single distance in nanoseconds vs. 9 correlated quantile differences
Natural debiasing: W₁(F, F) = 0 exactly (unlike quantile differences which vary due to sampling)

Migration notes:

Existing code using tacet v6.x will need to update result handling
The top_quantiles field in EffectEstimate has been replaced with tail_diagnostics
All posterior probabilities now refer to W₁ distance, not max quantile difference
Mathematical notation updated in Appendix A to reflect 1D formulation

v6.0 (from v5.7)

Specification simplification and modularization:

Removed: Research mode as separate outcome type. When θ_user = 0, the oracle returns Inconclusive with posterior estimates. This removes ~50 lines and the ResearchStatus enum.
Removed: 2D projection for reporting (§3.4.6 in v5.7). The shift/tail decomposition and EffectPattern enum have been removed. EffectEstimate now reports max|δ|, 95% CI, and top quantiles by exceedance probability.
Removed: Exploitability classification (§2.4 in v5.7). This is now documented only in the user guide, not the statistical specification.
Simplified: κ robust likelihood diagnostics. Only likelihood_inflated: Bool is exposed; detailed κ mixing diagnostics (kappa_mean, kappa_sd, kappa_cv, kappa_ess, kappa_mixing_ok) are internal implementation details.
Consolidated: Quality gates from 6 to 4:
- Gates 1+2 → Insufficient Information Gain
- Gates 4+5 → Resource Budget Exceeded
- Gate 3 → Would Take Too Long
- Gate 6 → Conditions Changed
Consolidated: Issue codes from 30+ to 8 categories: DependenceHigh, PrecisionLow, DiscreteMode, ThresholdIssue, FilteringApplied, StationarityIssue, NumericalIssue, LikelihoodInflated.
Created: Separate Implementation Guide for algorithms (IACT, block bootstrap, numerical stability, Gibbs conditionals, quantile computation).
Created: Separate Power Module Specification for power/EM analysis.
Removed: Projection mismatch diagnostics and threshold calibration (no longer needed without 2D projection).
Removed: Dimension-aware regularization details (moved to Implementation Guide; timing uses d=9 which rarely triggers stressed regimes).

Rationale: These changes reduce specification complexity by ~40% while preserving all normative statistical requirements. The removed features were either:

Reporting conveniences (2D projection, exploitability classification) that don’t affect inference
Separate code paths for edge cases (Research mode) that can be handled by existing mechanisms
Implementation details (algorithms, numerical procedures) better suited to a separate guide