Skip to content

Specification (v7.1)

This document is the authoritative specification for tacet, a Bayesian timing side-channel detection system. It defines the statistical methodology, abstract types, and requirements that implementations MUST follow to be conformant.

For implementation details (algorithms, numerical procedures), see the Implementation Guide. For language-specific APIs, see the Rust API, C API, or Go API. For interpreting results, see Interpreting Results and Attacker Models.


The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

In summary:

  • MUST / REQUIRED / SHALL: Absolute requirement
  • MUST NOT / SHALL NOT: Absolute prohibition
  • SHOULD / RECOMMENDED: Strong recommendation (valid reasons to deviate may exist)
  • SHOULD NOT: Strong discouragement (valid reasons to deviate may exist)
  • MAY / OPTIONAL: Truly optional

Timing side-channel attacks exploit data-dependent execution time in cryptographic implementations. Existing detection tools have significant limitations:

  • T-test approaches (DudeCT) compare means, missing distributional differences such as cache effects that only affect upper quantiles
  • P-value misinterpretation: Statistical significance does not equal practical significance; with enough samples, negligible effects become “significant”
  • CI flakiness: Fixed sample sizes cause tests to pass locally but fail in CI (or vice versa) due to environmental noise
  • Binary output: No distinction between “no leak detected” and “couldn’t measure reliably”

tacet addresses these issues with:

  1. Wasserstein-1 distance: A single metric capturing both uniform shifts and tail effects, measuring the minimum cost to transform one distribution into another
  2. Adaptive Bayesian inference: Collect samples until confident, with natural early stopping
  3. Three-way decisions: Pass / Fail / Inconclusive, distinguishing “safe” from “unmeasurable”
  4. Interpretable output: Posterior probability (0-100%) instead of p-values
  5. Fail-safe design: Prefer Inconclusive over confidently wrong
  • Interpretable: Output is a probability, not a t-statistic
  • Adaptive: Collects more samples when uncertain, stops early when confident
  • CI-friendly: Three-way output prevents flaky tests
  • Portable: Handles different timer resolutions via adaptive batching
  • Honest: Never silently clamps thresholds; reports what it can actually resolve
  • Fail-safe: CI verdicts SHOULD almost never be confidently wrong
  • Reproducible: Deterministic results given identical samples and configuration

This section defines the types that all implementations MUST provide. Types are specified using language-agnostic pseudocode.

The primary result type returned by the oracle:

Outcome =
| Pass {
leak_probability: Float, // P(m(δ) > θ_eff | Δ), always at θ_eff
effect: EffectEstimate,
theta_user: Float, // User-requested threshold (ns)
theta_eff: Float, // Effective threshold used (ns)
theta_floor: Float, // Measurement floor at decision time (ns)
decision_threshold_ns: Float, // θ_eff at which decision was made
samples_used: Int,
quality: MeasurementQuality,
diagnostics: Diagnostics
}
| Fail {
leak_probability: Float, // P(m(δ) > θ_eff | Δ)
effect: EffectEstimate,
theta_user: Float,
theta_eff: Float, // MAY exceed θ_user (leak detected above floor)
theta_floor: Float,
decision_threshold_ns: Float, // θ_eff at which decision was made
samples_used: Int,
quality: MeasurementQuality,
diagnostics: Diagnostics
}
| Inconclusive {
reason: InconclusiveReason,
leak_probability: Float, // P(m(δ) > θ_eff | Δ)
effect: EffectEstimate,
theta_user: Float,
theta_eff: Float,
theta_floor: Float,
samples_used: Int,
quality: MeasurementQuality,
diagnostics: Diagnostics
}
| Unmeasurable {
operation_ns: Float, // Estimated operation time
threshold_ns: Float, // Timer resolution
platform: String,
recommendation: String
}

Semantics:

The field leak_probability MUST be computed as P(m(δ)>θeffΔ)P(m(\delta) > \theta_{\text{eff}} \mid \Delta), where θeff\theta_{\text{eff}} is the effective threshold used for inference at decision time.

When θeff\theta_{\text{eff}} > θuser\theta_{\text{user}}, the oracle cannot support a Pass claim at θuser\theta_{\text{user}}, because effects in the range (θuser\theta_{\text{user}}, θeff\theta_{\text{eff}}] are not distinguishable from noise under the measured conditions.

Implementations MUST NOT substitute θuser\theta_{\text{user}} into leak_probability when θeff\theta_{\text{eff}} > θuser\theta_{\text{user}}.

  • Pass: MUST be returned when ALL of the following hold:

    1. leak_probability < pass_threshold (default 0.05)
    2. θeffθuser+εθ\theta_{\text{eff}} \le \theta_{\text{user}} + \varepsilon_{\theta} where εθ=max(θtick,106θuser)\varepsilon_{\theta} = \max(\theta_{\text{tick}}, 10^{-6} \cdot \theta_{\text{user}})
    3. All verdict-blocking quality gates pass
    4. θuser\theta_{\text{user}} > 0
  • Fail: MUST be returned when ALL of the following hold:

    1. leak_probability > fail_threshold (default 0.95)
    2. All verdict-blocking quality gates pass

    Note: Fail MAY be returned when θeff\theta_{\text{eff}} > θuser\theta_{\text{user}}. Detecting m(δ)m(\delta) > θeff\theta_{\text{eff}} implies m(δ)m(\delta) > θuser\theta_{\text{user}} (since θeff\theta_{\text{eff}}θuser\theta_{\text{user}} by construction).

  • Inconclusive: MUST be returned when ANY of the following hold:

    1. A verdict-blocking quality gate fails
    2. Resource budgets exhausted without reaching a decision threshold
    3. leak_probability < pass_threshold but θeff>θuser+εθ\theta_{\text{eff}} > \theta_{\text{user}} + \varepsilon_{\theta} (threshold elevated)
  • Unmeasurable: MUST be returned when the operation is too fast to measure reliably (see §4.5)

Exploratory mode (θ_user = 0): When θuser\theta_{\text{user}} = 0, Pass/Fail semantics do not apply. The oracle MUST return Inconclusive with the posterior estimates, allowing users to interpret the results themselves. This mode is useful for profiling and debugging but is not suitable for CI gating.

Threat model presets defining the minimum effect size worth detecting:

AttackerModel =
| SharedHardware // θ = 0.4 ns (~2 cycles @ 5GHz)
| PostQuantumSentinel // θ = 2.0 ns (~10 cycles @ 5GHz)
| AdjacentNetwork // θ = 100 ns
| RemoteNetwork // θ = 50,000 ns (50 μs)
| Custom { threshold_ns: Float }
ModelThresholdUse Case
SharedHardware0.4 nsSGX enclaves, cross-VM, containers, hyperthreading
PostQuantumSentinel2.0 nsPost-quantum crypto (ML-KEM, ML-DSA)
AdjacentNetwork100 nsLAN services, HTTP/2 APIs
RemoteNetwork50 μsInternet-exposed services

Cycle-based thresholds use a conservative 5 GHz reference frequency (assumes fast attacker hardware—smaller θ = more sensitive = safer).

There is no single correct threshold. The choice of attacker model is a statement about your threat model.

For exploratory analysis without a decision threshold, use Custom { threshold_ns: 0.0 }.

Summary of the detected timing effect:

EffectEstimate = {
max_effect_ns: Float, // Posterior mean of W₁ distance (in nanoseconds)
credible_interval_ns: (Float, Float), // 95% CI for W₁
tail_diagnostics: TailDiagnostics // Decomposition of effect into shift and tail components
}
TailDiagnostics = {
shift_ns: Float, // Uniform shift component (median difference)
tail_ns: Float, // Tail-specific component (beyond shift)
tail_share: Float, // Fraction of effect from tail [0-1]
tail_slow_share: Float, // Directionality of tail (p95+): fraction that are slowdowns [0-1]
quantile_shifts: QuantileShifts, // Per-quantile differences for interpretation
pattern_label: EffectPattern // Classification of effect pattern
}
QuantileShifts = {
p50_ns: Float, // Median difference (50th percentile shift)
p90_ns: Float, // 90th percentile shift
p95_ns: Float, // 95th percentile shift
p99_ns: Float // 99th percentile shift
}
EffectPattern =
| TailEffect // Leak concentrated in upper quantiles (tail_share > 0.6)
| UniformShift // Leak affects all quantiles equally (tail_share < 0.3)
| Mixed // Combination of shift and tail (0.3 ≤ tail_share ≤ 0.6)
| Negligible // No significant effect detected

Interpreting W₁ distance:

The W₁ (Wasserstein-1) distance measures the minimum cost to transform one distribution into another. The tail_diagnostics field decomposes this single scalar metric into interpretable components, helping users understand whether the leak is:

  • A uniform shift (all quantiles affected equally, e.g., constant-time overhead difference)
  • A tail effect (upper quantiles affected more, e.g., cache misses)
  • A mixed pattern (combination of both)

The decomposition works by comparing the W₁ distance to the median difference:

  • If W₁ ≈ median difference, the effect is uniform (constant shift)
  • If W₁ >> median difference, the effect is concentrated in the tail (e.g., cache misses)

The tail_slow_share field indicates the directionality of tail deviations (p95+): values near 1.0 indicate tail deviations are predominantly slowdowns, values near 0.0 indicate speedups, and values near 0.5 indicate balanced directionality. This metric operates on quantile-aligned differences and measures what fraction of tail deviation magnitude comes from positive (slowdown) differences.

The quantile_shifts provide per-quantile differences for understanding the effect distribution, helping identify at which percentiles the leak manifests.

Assessment of measurement precision:

MeasurementQuality =
| Excellent // MDE < 5 ns
| Good // MDE 5–20 ns
| Poor // MDE 20–100 ns
| TooNoisy // MDE > 100 ns
InconclusiveReason =
| DataTooNoisy { message: String, guidance: String }
| NotLearning { message: String, guidance: String }
| WouldTakeTooLong { estimated_time_secs: Float, samples_needed: Int, guidance: String }
| ThresholdElevated {
theta_user: Float, // What user requested
theta_eff: Float, // What we measured at
leak_probability_at_eff: Float, // P(m(δ) > θ_eff | Δ)
meets_pass_criterion_at_eff: Bool, // True if P < pass_threshold at θ_eff
achievable_at_max: Bool, // Could θ_user be reached with max budget?
message: String,
guidance: String
}
| TimeBudgetExceeded { current_probability: Float, samples_collected: Int }
| SampleBudgetExceeded { current_probability: Float, samples_collected: Int }
| ConditionsChanged { drift: ConditionDrift }

The meets_pass_criterion_at_eff field indicates whether P(m(δ)>θeffΔ)P(m(\delta) > \theta_{\text{eff}} \mid \Delta) < pass_threshold. This allows CI systems to implement policies like “treat pass-criterion-met-at-floor as acceptable” without changing inference semantics.

The achievable_at_max field distinguishes:

  • true: θfloor\theta_{\text{floor}} > θuser\theta_{\text{user}} now, but θfloor(nmax)θuser\theta_{\text{floor}}(n_{\max}) \le \theta_{\text{user}} (more sampling may help)
  • false: θfloor(nmax)>θuser\theta_{\text{floor}}(n_{\max}) > \theta_{\text{user}} (cannot reach θuser\theta_{\text{user}} on this platform/configuration)
Diagnostics = {
// Dependence and effective samples
dependence_length: Int, // b̂ from Politis-White (bootstrap only)
effective_sample_size: Int, // n / τ̂
iact_combined: Float, // max(τ̂_F, τ̂_R)
// Stationarity
stationarity_ratio: Float,
stationarity_ok: Bool,
// Outlier handling
outlier_rate_baseline: Float,
outlier_rate_sample: Float,
outlier_asymmetry_ok: Bool,
// Timer and mode
discrete_mode: Bool,
timer_resolution_ns: Float,
duplicate_fraction: Float,
// Run information
preflight_ok: Bool,
calibration_samples: Int,
total_time_secs: Float,
seed: Option<Int>,
threshold_ns: Float,
timer_name: String,
platform: String,
// Gibbs sampler
gibbs_iters_total: Int,
gibbs_burnin: Int,
gibbs_retained: Int,
lambda_mean: Float, // Posterior mean of λ (prior precision)
lambda_mixing_ok: Bool,
// Robust likelihood
likelihood_inflated: Bool, // True if κ_mean < 0.3
// Warnings
warnings: List<String>,
quality_issues: List<QualityIssue>
}
QualityIssue = {
code: IssueCode,
message: String,
guidance: String
}
IssueCode =
| DependenceHigh // High autocorrelation, reduced effective samples
| PrecisionLow // Limited measurement precision
| DiscreteMode // Coarse timer resolution
| ThresholdIssue // Cannot achieve requested threshold
| FilteringApplied // Outliers were capped
| StationarityIssue // Conditions may have changed
| NumericalIssue // Gibbs sampler convergence concern
| LikelihoodInflated // Uncertainty inflated for robustness

This section describes the mathematical foundation of tacet. All formulas in this section are normative; implementations MUST produce equivalent results.

3.1 Test Statistic: Wasserstein-1 Distance

Section titled “3.1 Test Statistic: Wasserstein-1 Distance”

We collect timing samples from two classes:

  • Fixed class (F): A specific input (e.g., all zeros)
  • Random class (R): Randomly sampled inputs

We measure the Wasserstein-1 (W₁) distance between the two timing distributions. The W₁ distance measures the minimum cost to transform one distribution into another, where cost is the amount of probability mass times the distance it must be moved.

The test statistic is a 1D scalar (in nanoseconds):

Δ=W1(F^Fixed,F^Random)R\Delta = W_1(\hat{F}_{\text{Fixed}}, \hat{F}_{\text{Random}}) \in \mathbb{R}

where F^\hat{F} denotes the empirical cumulative distribution function.

Debiased W₁ distance:

For human-readable output, implementations MAY compute a debiased estimator:

W1deb=max(0,W1(Fixed,Random)θfloor)W_1^{\text{deb}} = \max\left(0, W_1(\text{Fixed}, \text{Random}) - \theta_{\text{floor}}\right)

where θfloor\theta_{\text{floor}} is the measurement noise floor (§3.3.3). The debiased estimator provides a display-friendly interpretation of effect magnitude relative to measurement noise. Inference (§3.4) uses the raw W₁ distance.

Why W₁ distance?

Timing leaks manifest in different ways:

  • Uniform shift: A different code path adds constant overhead → entire distribution shifts
  • Tail effect: Cache misses occur probabilistically → upper quantiles shift more

The W₁ distance captures both patterns in a single scalar metric:

  • For uniform shifts, W₁ ≈ shift magnitude (median difference)
  • For tail effects, W₁ emphasizes the tail differences
  • Mixed patterns are captured proportionally

W₁ properties:

  • Single metric: Captures distributional differences in nanoseconds
  • Optimal transport: Weights differences by their quantile position
  • Natural zero: W₁ between identical distributions equals zero
  • 1D inference: Enables fast Bayesian updating with simple conjugacy

W₁ computation: Implementations MUST compute W₁ using the sorted-samples method for discrete distributions. See the Implementation Guide for the algorithm.

The system operates in two phases:

┌─────────────────────────────────────────────────────────────────┐
│ Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Calibration │───▶│ Adaptive │───▶│ Decision │ │
│ │ Phase │ │ Loop │ │ Output │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ • Estimate Σ_rate • Collect batch • Pass (P<5%) │
│ • Compute θ_floor • Update Δ • Fail (P>95%) │
│ • Set prior scale • Scale Σ by 1/n • Inconclusive │
│ • Warmup caches • Update θ_floor │
│ • Pre-flight checks • Compute P(>θ) │
│ • Check quality │
│ • Check stopping │
│ │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Calibration (runs once)

  • Collect initial samples to characterize measurement noise
  • Estimate covariance structure via stream-based block bootstrap
  • Compute “covariance rate” Σrate\Sigma_{\text{rate}} that scales as Σ=Σrate/n\Sigma = \Sigma_{\text{rate}} / n
  • Compute initial measurement floor θfloor\theta_{\text{floor}} and floor-rate constant c_floor
  • Compute effective threshold θeff\theta_{\text{eff}} and calibrate prior scale σ\sigma
  • Run pre-flight checks (timer sanity, harness sanity, stationarity)

Phase 2: Adaptive Loop (iterates until decision)

  • Collect batches of samples
  • Update quantile estimates from all data collected so far
  • Scale covariance: Σn=Σrate/n\Sigma_n = \Sigma_{\text{rate}}/n
  • Update θfloor(n)\theta_{\text{floor}}(n) using floor-rate constant
  • Run Gibbs sampler to approximate posterior and compute P(effect>θeff)P(\text{effect} > \theta_{\text{eff}})
  • Check quality gates (posterior ≈ prior → Inconclusive)
  • Check decision boundaries (P > 95% → Fail, P < 5% → Pass)
  • Check time/sample budgets

Why this structure?

The key insight is that covariance scales as 1/n for quantile estimators. By estimating the covariance rate once during calibration, we can cheaply update the posterior as more data arrives; no re-bootstrapping needed. This makes adaptive sampling computationally tractable.

The calibration phase runs once at startup to characterize measurement noise.

Sample collection:

Implementations SHOULD collect n_cal samples per class (default: 5,000). This is enough to estimate covariance structure reliably while keeping calibration fast.

Fragile regime:

A fragile regime is a measurement condition where standard statistical assumptions may not hold, requiring more conservative estimation. The fragile regime is triggered when either:

  • Discrete timer mode: The timer has coarse resolution (see §3.6)
  • Low uniqueness: The minimum uniqueness ratio across classes is below 10%: min(unique(F)nF,unique(R)nR)<0.10\min\left(\frac{|\text{unique}(F)|}{n_F}, \frac{|\text{unique}(R)|}{n_R}\right) < 0.10

When a fragile regime is detected, implementations apply more conservative procedures (larger block lengths, regularized covariance) to maintain calibration.

Measurement produces an interleaved acquisition stream indexed by time:

{(ct,yt)}t=1T,ct{F,R},  T2n\{(c_t, y_t)\}_{t=1}^{T}, \quad c_t \in \{F, R\}, \; T \approx 2n

where yty_t is the measured runtime (or ticks) at acquisition index tt, and F/R denote Fixed and Random classes.

Per-class samples are obtained by filtering the stream:

F:={yt:ct=F},R:={yt:ct=R}F := \{y_t : c_t = F\}, \quad R := \{y_t : c_t = R\}

Critical principle: The acquisition stream is the data-generating process. All bootstrap and dependence estimation MUST preserve adjacency in acquisition order, not per-class position.

3.3.2 Variance Estimation via Stream-Based Block Bootstrap

Section titled “3.3.2 Variance Estimation via Stream-Based Block Bootstrap”

Timing measurements exhibit autocorrelation: nearby samples are more similar than distant ones due to cache state, frequency scaling, etc. Standard bootstrap assumes i.i.d. samples, underestimating variance. Implementations MUST use block bootstrap on the acquisition stream to preserve the true dependence structure.

Block length selection: Implementations SHOULD use the Politis-White algorithm to select the optimal block length, with class-conditional ACF to avoid underestimation from interleaved sampling. See the Implementation Guide for the algorithm.

Variance rate (scalar):

The variance of the W₁ estimator scales as 1/n. The variance rate is a scalar for the 1D W₁ distance:

varrate=σ^cal2ncal\text{var}_{\text{rate}} = \hat{\sigma}^2_{\text{cal}} \cdot n_{\text{cal}}

where σ^cal2\hat{\sigma}^2_{\text{cal}} is the bootstrap variance estimate from calibration.

Define the calibration long-run variance proxy:

Vcal:=varcalncalV_{\text{cal}} := \text{var}_{\text{cal}} \cdot n_{\text{cal}}

This is a scalar (not a matrix) because W₁ is a 1D statistic.

Effective sample size via IACT:

Under strong temporal dependence, n samples do not provide n independent observations. Implementations SHOULD compute integrated autocorrelation time (IACT) for diagnostic purposes:

neff:=nτ^n_{\text{eff}} := \frac{n}{\hat{\tau}}

where τ^\hat{\tau} is estimated via the Geyer Initial Monotone Sequence (IMS) method or equivalent. See the Implementation Guide for algorithms.

Implementations SHOULD report as diagnostics:

  • Diagnostics.dependence_length = b^\hat{b} (block length from bootstrap)
  • Diagnostics.iact_combined = τ^\hat{\tau} (IACT estimate)
  • Diagnostics.effective_sample_size = neffn_{\text{eff}} (for diagnostics only)

Variance scaling during adaptive loop:

Since VcalV_{\text{cal}} is computed via block bootstrap (which preserves temporal dependence structure), it already represents the long-run variance rate. During inference with nn samples:

varn=Vcaln\text{var}_n = \frac{V_{\text{cal}}}{n}

The block bootstrap accounts for autocorrelation in the variance estimate. IACT estimates are computed and reported as diagnostics but do not affect inference.

3.3.3 Measurement Floor and Effective Threshold

Section titled “3.3.3 Measurement Floor and Effective Threshold”

A critical design element is distinguishing between what the user wants to detect (θuser\theta_{\text{user}}) and what the measurement can detect (θfloor\theta_{\text{floor}}).

Floor-rate constant (from null distribution):

Implementations MUST compute a floor-rate constant once at calibration by bootstrapping the null distribution of raw W₁:

  1. Generate null W₁ replicates via within-class splits: split baseline into two halves and compute W₁ between them; similarly for sample class
  2. Scale each null replicate by nblocks\sqrt{n_{\text{blocks}}} where nblocks=max(1,ncal/L)n_{\text{blocks}} = \max(1, \lfloor n_{\text{cal}} / L \rfloor) is the number of independent blocks at calibration
  3. Compute cfloorc_{\text{floor}} as the 95th percentile of these scaled null replicates

This SHOULD use at least 2,000 null bootstrap replicates for stable percentile estimation.

Measurement floor (dynamic):

During the adaptive loop, the statistical floor decreases as sample size grows:

θfloor,stat(n)=max(θtick,cfloornblocks(n))\theta_{\text{floor,stat}}(n) = \max\left(\theta_{\text{tick}}, \frac{c_{\text{floor}}}{\sqrt{n_{\text{blocks}}(n)}}\right)

where nblocks(n)=max(1,n/L)n_{\text{blocks}}(n) = \max(1, \lfloor n / L \rfloor) is the number of independent blocks for sample size nn, and LL is the block length.

The tick floor is fixed once batching is determined:

θtick=1 tick (ns)K\theta_{\text{tick}} = \frac{\text{1 tick (ns)}}{K}

where KK is the batch size. The combined floor:

θfloor(n)=θfloor,stat(n)\theta_{\text{floor}}(n) = \theta_{\text{floor,stat}}(n)

(The tick floor is already incorporated in the max\max within θfloor,stat\theta_{\text{floor,stat}}.)

Effective threshold (θeff\theta_{\text{eff}}):

θeff=max(θuser,θfloor)\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}})

Threshold elevation decision rule:

When θeff\theta_{\text{eff}} > θuser\theta_{\text{user}}:

  1. Fail propagates: Detecting m(δ)m(\delta) > θeff\theta_{\text{eff}} implies m(δ)m(\delta) > θuser\theta_{\text{user}}. Implementations MAY return Fail.

  2. Pass does not propagate: “No detectable effect above θeff\theta_{\text{eff}}” is compatible with effects in (θuser\theta_{\text{user}}, θeff\theta_{\text{eff}}]. Implementations MUST NOT return Pass when θeff\theta_{\text{eff}} > θuser\theta_{\text{user}} + εθ\varepsilon_{\theta}.

Dynamic floor updates:

During the adaptive loop, θfloor(n)\theta_{\text{floor}}(n) decreases as n grows. Implementations MUST:

  1. Recompute θeff=max(θuser,θfloor(n))\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}}(n)) after each batch
  2. If θfloor(n)\theta_{\text{floor}}(n) drops to θuser\theta_{\text{user}} or below, Pass becomes possible (subject to posterior)
  3. Report the θeff\theta_{\text{eff}} used for the final decision

3.3.4 Prior Scale Calibration (Half-t Prior)

Section titled “3.3.4 Prior Scale Calibration (Half-t Prior)”

The prior on the 1D W₁ effect δ\delta MUST be a half-t distribution (Student’s t restricted to positive values):

δhalf-tν(0,σ2)\delta \sim \text{half-}t_\nu(0, \sigma^2)

where ν\nu is a fixed degrees-of-freedom parameter and σ\sigma is a scale parameter calibrated to an exceedance target.

W₁ distances are non-negative, so the prior is a half-t distribution (Student’s t restricted to positive values). This is implemented via a scale mixture of Gaussians (see §3.4.3).

Degrees of freedom (ν\nu):

Implementations MUST use ν:=4\nu := 4.

Calibrating σ\sigma via exceedance target:

The scale σ\sigma MUST be chosen so that the prior exceedance probability equals a fixed target π0\pi_0 (default 0.62):

P(δ>θuser  |  δhalf-tν(0,σ2))=π0P\left(\delta > \theta_{\text{user}} \;\middle|\; \delta \sim \text{half-}t_\nu(0, \sigma^2)\right) = \pi_0

This MUST be solved by deterministic 1D root-finding using Monte Carlo integration. See the Implementation Guide for the algorithm.

The prior is calibrated against θuser\theta_{\text{user}} (the user’s threat-model threshold). The measurement floor θfloor\theta_{\text{floor}} is handled separately in the decision logic (§3.3.3).

Design properties: The half-t with ν = 4 provides heavy tails while maintaining finite variance. The calibrated scale σ balances informativeness and flexibility: the prior is weakly informative (allowing data to dominate when signal is strong) without being vacuous.

To ensure reproducible results, all random number generation MUST be deterministic by default.

Normative requirement:

Given identical timing samples and configuration, the oracle MUST produce identical results (up to floating-point roundoff).

Seeding policy:

  • The bootstrap RNG seed MUST be deterministically derived from:
    • A fixed library constant seed (default: 0x74696D696E67, “timing” in ASCII)
    • A stable hash of configuration parameters
  • All Monte Carlo RNG seeds (leak probability, floor constant, prior scale) MUST be similarly deterministic
  • The Gibbs sampler RNG seed MUST be deterministic
  • The chosen seeds SHOULD be reported in diagnostics

The prior is a half-t distribution over the 1D W₁ distance, implemented via Gibbs sampling on the scale-mixture representation.

The latent parameter is the true W₁ distance between timing distributions:

δR+(true W₁ distance in nanoseconds)\delta \in \mathbb{R}^+ \quad \text{(true W₁ distance in nanoseconds)}

This is constrained to be non-negative (W₁ distances are always ≥ 0).

3.4.2 Likelihood (Robust t-likelihood via scale mixture)

Section titled “3.4.2 Likelihood (Robust t-likelihood via scale mixture)”

The observed W₁ statistic Δ\Delta may deviate from the Gaussian approximation when varn\text{var}_n is underestimated. To prevent pathological posterior certainty under variance misestimation, implementations MUST use a robust likelihood with a scalar precision factor κ\kappa.

Δδ,κN ⁣(δ,varnκ)\Delta \mid \delta, \kappa \sim \mathcal{N}\!\left(\delta, \frac{\text{var}_n}{\kappa}\right) κGamma ⁣(ν2,ν2)\kappa \sim \text{Gamma}\!\left(\frac{\nu_\ell}{2}, \frac{\nu_\ell}{2}\right)

Marginally, this gives a univariate t-distribution:

Δδtν(δ,varn)\Delta \mid \delta \sim t_{\nu_\ell}(\delta, \text{var}_n)

Gamma parameterization: shape–rate. E[κ]=1E[\kappa] = 1.

Degrees of freedom for likelihood (ν\nu_{\ell}):

Implementations MUST use ν\nu_{\ell} := 4.

Robustness mechanism: When varn\text{var}_n is underestimated, κ\kappa is pulled downward and inflates uncertainty automatically. The parameter ν_ℓ = 4 matches the prior degrees of freedom, maintaining consistency in tail behavior between prior and likelihood.

Likelihood inflation warning:

Implementations SHOULD set Diagnostics.likelihood_inflated = true when κmean<0.3\kappa_{\text{mean}} < 0.3, indicating the likelihood variance was effectively scaled up for robustness.

The prior on δR+\delta \in \mathbb{R}^+ is:

δhalf-tν(0,σ2),ν=4\delta \sim \text{half-}t_\nu(0, \sigma^2), \quad \nu = 4

Implementations MUST implement inference using the equivalent hierarchical model:

λGamma(ν2,ν2)\lambda \sim \text{Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right) δλhalf-N(0,σ2λ)\delta \mid \lambda \sim \text{half-}\mathcal{N}\left(0, \frac{\sigma^2}{\lambda}\right)

where Gamma uses shape–rate parameterization, and half-𝒩 denotes a half-normal distribution (normal restricted to positive values).

Marginal prior variance:

For ν\nu = 4:

V0marginal:=Var(δ)=2σ2V_0^{\text{marginal}} := \text{Var}(\delta) = 2\sigma^2

This scalar variance is used as the prior variance reference in Gate 1 (§3.5.2).

3.4.4 Posterior Inference (Deterministic Gibbs Sampling)

Section titled “3.4.4 Posterior Inference (Deterministic Gibbs Sampling)”

The posterior is approximated using a deterministic Gibbs sampler over (δ,λ,κ)(\delta, \lambda, \kappa).

Gibbs schedule (normative):

ParameterValueDescription
N_gibbs5000Total iterations
N_burn1000Burn-in (discarded)
N_keep4000Retained samples

Initialization:

λ(0)=1,κ(0)=1\lambda^{(0)} = 1, \quad \kappa^{(0)} = 1

Iteration order:

For t = 1, …, N_gibbs:

  1. Sample δ(t)p(δλ(t1),κ(t1),Δ)\delta^{(t)} \sim p(\delta \mid \lambda^{(t-1)}, \kappa^{(t-1)}, \Delta) (truncated normal, positive only)
  2. Sample λ(t)p(λδ(t))\lambda^{(t)} \sim p(\lambda \mid \delta^{(t)})
  3. Sample κ(t)p(κδ(t),Δ)\kappa^{(t)} \sim p(\kappa \mid \delta^{(t)}, \Delta)

The Gibbs conditionals and numerical implementation are detailed in the Implementation Guide.

Why 5000 iterations?

The 1D Gibbs sampler converges much faster than the previous 9D sampler, but we use more iterations to ensure high-quality posterior approximation with minimal Monte Carlo error. The computational cost is still 5-10× lower than the 9D approach due to simpler conditionals.

λ\lambda mixing diagnostics:

Implementations SHOULD set Diagnostics.lambda_mixing_ok = false when lambda_cv < 0.1 or lambda_ess < 20, indicating the sampler may not have converged.

3.4.5 Decision Functional and Leak Probability

Section titled “3.4.5 Decision Functional and Leak Probability”

The decision functional is simply the W₁ distance itself:

m(δ):=δm(\delta) := \delta

The leak probability is:

P(leak>θeffΔ)=P(δ>θeffΔ)P(\text{leak} > \theta_{\text{eff}} \mid \Delta) = P(\delta > \theta_{\text{eff}} \mid \Delta)

Estimation via Gibbs samples:

P^(leak)=1Nkeeps=1Nkeep1[δ(s)>θeff]\widehat{P}(\text{leak}) = \frac{1}{N_{\text{keep}}} \sum_{s=1}^{N_{\text{keep}}} \mathbf{1}\left[\delta^{(s)} > \theta_{\text{eff}}\right]

Posterior summaries:

Implementations MUST compute:

  • Posterior mean: δpost:=1Nkeepsδ(s)\delta_{\text{post}} := \frac{1}{N_{\text{keep}}} \sum_s \delta^{(s)}
  • Credible interval: 95% CI for δ\delta from empirical quantiles of {δ(s)}s=1Nkeep\{\delta^{(s)}\}_{s=1}^{N_{\text{keep}}}

Interpreting the probability:

This is a posterior probability, not a p-value. When we report “72% probability of a leak,” we mean: given the data and our model, 72% of the posterior mass corresponds to W₁ distances exceeding θeff\theta_{\text{eff}}.

The core innovation: collect samples until confident, with natural early stopping.

Verdict-blocking semantics:

Pass/Fail verdicts MUST be emitted only if all measurement quality gates pass. Otherwise the oracle MUST return Inconclusive.

This policy ensures: CI verdicts should almost never be confidently wrong.

The adaptive loop terminates when any of these conditions is met:

  1. Pass: leak_probability < pass_threshold (default 0.05) AND all quality gates pass
  2. Fail: leak_probability > fail_threshold (default 0.95) AND all quality gates pass
  3. Inconclusive: Any quality gate fails OR budget exhausted without reaching decision threshold

Why adaptive sampling works for Bayesian inference:

Frequentist methods suffer from optional stopping: if you keep sampling until you get a significant result, you inflate your false positive rate.

Bayesian methods don’t have this problem. The posterior probability is valid regardless of when you stop; this is the likelihood principle. Your inference depends only on the data observed, not your sampling plan.

Quality gates detect when data is too poor to reach a confident decision. When any gate triggers, the outcome MUST be Inconclusive.

Gate 1: Insufficient Information Gain

This gate detects when the data provides insufficient information relative to the prior, either because the posterior barely moved from the prior (data too noisy) or because the posterior stopped updating (not learning).

Implementations MUST compute the KL divergence between Gaussian surrogates of the prior and posterior. For the 1D case:

KL=12(VpostV0marginal+μpost2V0marginal1+lnV0marginalVpost)\mathrm{KL} = \frac{1}{2}\left( \frac{V_{\text{post}}}{V_0^{\text{marginal}}} + \frac{\mu_{\text{post}}^2}{V_0^{\text{marginal}}} - 1 + \ln\frac{V_0^{\text{marginal}}}{V_{\text{post}}} \right)

where VpostV_{\text{post}} is the posterior variance of δ\delta (estimated from Gibbs samples) and V0marginal=2σ2V_0^{\text{marginal}} = 2\sigma^2 is the marginal prior variance (§3.4.3).

Trigger conditions:

  • KL < KL_min (default 0.7 nats) → DataTooNoisy
  • Sum of recent inter-batch KL divergences < 0.001 for 5+ batches → NotLearning

Gate 2: Would Take Too Long

Extrapolate time to decision based on current convergence rate. If projected time exceeds budget by a large margin (e.g., 10×), trigger Inconclusive with reason WouldTakeTooLong.

Gate 3: Resource Budget Exceeded

If elapsed time exceeds configured time budget → TimeBudgetExceeded If total samples per class exceeds configured maximum → SampleBudgetExceeded

Gate 4: Condition Drift Detected

The covariance estimate Σrate\Sigma_{\text{rate}} is computed during calibration. If measurement conditions change during the adaptive loop, this estimate becomes invalid.

Detect condition drift by comparing measurement statistics from calibration against the full test run:

  • Variance ratio: σpost2/σcal2\sigma_{\text{post}}^2 / \sigma_{\text{cal}}^2
  • Autocorrelation change: ρpost(1)ρcal(1)|\rho_{\text{post}}(1) - \rho_{\text{cal}}(1)|
  • Mean drift: μpostμcal/σcal|\mu_{\text{post}} - \mu_{\text{cal}}| / \sigma_{\text{cal}}

If variance ratio is outside [0.5, 2.0], or autocorrelation change exceeds 0.3, or mean drift exceeds 3.0, trigger Inconclusive with reason ConditionsChanged.

Note on threshold elevation: The case where θfloor\theta_{\text{floor}} > θuser\theta_{\text{user}} is handled by the threshold elevation decision rule (§3.3.3), not by a quality gate.

When the timer has low resolution (e.g., Apple Silicon’s 41ns cntvct_el0), quantile estimation behaves differently due to tied values.

Trigger condition:

Discrete timer mode triggers when minimum uniqueness ratio is below 10%.

Mid-distribution quantiles:

Instead of standard quantile estimators, use mid-distribution quantiles which handle ties correctly. See the Implementation Guide.

Work in ticks internally:

In discrete mode, implementations SHOULD perform computations in ticks (timer’s native unit) and convert to nanoseconds only for display.

Gaussian approximation caveat:

The Gaussian likelihood is a rougher approximation with discrete data. Implementations MUST report a quality issue about discrete timer mode.

The Bayesian approach requires empirical validation that posteriors are well-calibrated.

Null calibration test (normative requirement):

Implementations MUST provide a “fixed-vs-fixed” validation that measures end-to-end false positive rates:

  • FPR_gated = P(Fail | H₀, all verdict-blocking gates pass)

Acceptance criteria:

MetricTargetAction if Exceeded
FPR_gated2-5%≤ 5%

Large-effect detection tests:

Implementations MUST include validation tests ensuring the Student’s t prior correctly detects obvious leaks even when measurement noise is high.


This section defines abstract requirements for measurement. For implementation details, see the Implementation Guide.

Implementations MUST use a timer that:

  • Is monotonic (never decreases)
  • Has known resolution
  • Reports results that can be converted to nanoseconds

Implementations SHOULD use the highest-resolution timer available on the platform.

Measurements MUST be collected as an interleaved acquisition stream (see §3.3.1):

  • Fixed and Random class measurements MUST be interleaved
  • The interleaving order SHOULD be randomized
  • The full acquisition stream (with class labels) MUST be preserved for bootstrap

All inputs MUST be generated before the measurement loop begins. Generating inputs inside the timed region causes false positives.

Implementations MUST cap (winsorize), not drop, outliers:

  1. Compute t_cap = 99.99th percentile from pooled data
  2. Cap samples exceeding t_cap
  3. Winsorization happens before quantile computation

Quality thresholds: >0.1% capped → warning; >1% → acceptable; >5% → TooNoisy.

On platforms with coarse timer resolution, implementations SHOULD batch operations:

When batching is needed:

If pilot measurement shows fewer than 5 ticks per call, enable batching.

Batch size selection:

K=clamp(50ticks_per_call,1,20)K = \text{clamp}\left( \left\lceil \frac{50}{\text{ticks\_per\_call}} \right\rceil, 1, 20 \right)

Effect scaling:

Reported effects MUST be divided by K to give per-operation estimates.

If ticks per call < 5 even with maximum batching (K=20), implementations MUST return Unmeasurable.

Implementations SHOULD perform pre-flight checks:

  • Timer sanity: Verify monotonicity and reasonable resolution
  • Harness sanity (fixed-vs-fixed): Detect test harness bugs
  • Stationarity: Detect drift during measurement

This section provides language-agnostic guidance for API design. These are recommendations (SHOULD) unless marked otherwise.

Two-class pattern:

Implementations SHOULD expose the DudeCT two-class pattern:

  • Baseline class: Fixed input (typically all zeros)
  • Sample class: Variable input (typically random)

This pattern tests for data-dependent timing, not specific value comparisons.

Attacker model presets as primary entry point:

The primary configuration entry point SHOULD be attacker model selection, not raw threshold values.

Sane defaults:

Default configuration SHOULD:

  • Use AdjacentNetwork attacker model (or equivalent)
  • Set time budget to 60 seconds
  • Set sample budget to 1,000,000
  • Set pass/fail thresholds to 0.05/0.95

Leak probability prominence:

The leak probability MUST be prominently displayed in results and human-readable output.

Threshold transparency:

When θeff\theta_{\text{eff}} > θuser\theta_{\text{user}}, implementations MUST clearly indicate this to the user.

Inconclusive guidance:

For Inconclusive outcomes, implementations MUST provide the reason and SHOULD provide actionable guidance.


ParameterTypeDescription
attacker_model OR threshold_nsAttackerModel or FloatDefines the effect threshold
ParameterDefaultDescription
time_budget60 secondsMaximum test duration
max_samples1,000,000Maximum samples per class
pass_threshold0.05P(leak) below this → Pass
fail_threshold0.95P(leak) above this → Fail
calibration_samples5,000Samples for calibration phase
batch_size1,000Samples per adaptive batch
bootstrap_iterations2,000Bootstrap iterations for covariance

SymbolMeaning
{(c_t, y_t)}Acquisition stream: class labels and timing measurements
TAcquisition stream length (≈ 2n)
F, RPer-class sample sets (filtered from stream)
Δ\DeltaScalar: observed W₁ distance (in nanoseconds)
δ\deltaScalar: true (latent) W₁ distance (in nanoseconds)
W₁Wasserstein-1 distance between timing distributions
varn\text{var}_nVariance of W₁ estimator at sample size n
varrate\text{var}_{\text{rate}}Variance rate (scalar)
VcalV_{\text{cal}}Calibration long-run variance proxy: varcalncal\text{var}_{\text{cal}} \cdot n_{\text{cal}}
τ^\hat{\tau}Integrated autocorrelation time (via Geyer IMS)
n_effEffective sample size: neff=n/τ^n_{\text{eff}} = n / \hat{\tau}
n_blocksNumber of independent blocks: max(1,n/L)\max(1, \lfloor n / L \rfloor) where LL is block length
ν\nuHalf-t degrees of freedom (fixed at 4)
σ\sigmaHalf-t prior scale (calibrated via exceedance target)
λ\lambdaLatent prior precision multiplier in scale-mixture representation
κ\kappaLatent likelihood precision multiplier (robust t-likelihood)
ν\nu_{\ell}Likelihood degrees of freedom (fixed at 4)
tν(0,σ2)t_{\nu}(0, \sigma^2)Univariate Student’s t with ν\nu df, location 0, scale σ2\sigma^2
half-tν(0,σ2)\text{half-}t_{\nu}(0, \sigma^2)Half-t distribution (t restricted to positive values)
V0marginalV_0^{\text{marginal}}Marginal prior variance: 2σ22\sigma^2 for ν=4\nu = 4
VpostV_{\text{post}}Posterior variance of δ\delta (estimated from Gibbs samples)
δpost\delta_{\text{post}}Posterior mean of δ\delta (from Gibbs samples)
N_gibbsTotal Gibbs iterations (5000)
N_burnBurn-in iterations (1000)
N_keepRetained Gibbs samples (4000)
θuser\theta_{\text{user}}User-requested threshold
θfloor\theta_{\text{floor}}Measurement floor (smallest resolvable effect)
cfloorc_{\text{floor}}Floor-rate constant: θfloor,stat=cfloor/n\theta_{\text{floor,stat}} = c_{\text{floor}}/\sqrt{n}
θtick\theta_{\text{tick}}Timer resolution component of floor
θeff\theta_{\text{eff}}Effective threshold used for inference
m(δ)m(\delta)Decision functional: δ\delta (the W₁ distance itself)
Block length (Politis-White, on acquisition stream)
MDEMinimum detectable effect
nSamples per class
BBootstrap iterations
KLKL divergence: KL(posterior ∥ prior) for Gate 1
KL_minMinimum KL threshold for conclusive verdict (default 0.7 nats)

These constants define conformant implementations. Implementations MAY use different values only where noted.

ConstantDefaultNormativeRationale
Test statisticW₁ distanceMUST1D scalar, optimal transport
Prior familyHalf-tMUSTContinuous scale adaptation, non-negative
Degrees of freedom (ν\nu)4MUSTHeavy tails + finite variance
Gamma parameterizationshape–rateMUSTAvoid library ambiguity
Gibbs iterations (N_gibbs)5000MUSTHigh-quality posterior approximation
Gibbs burn-in (N_burn)1000MUSTConservative convergence
Gibbs retained (N_keep)4000MUSTLow MC variance
Gibbs initialization (λ(0)\lambda^{(0)}, κ(0)\kappa^{(0)})1MUSTPrior means
Likelihood df (ν\nu_{\ell})8MUSTRobustness to variance misestimation
Prior exceedance target (π0\pi_0)0.62SHOULDGenuine uncertainty
Prior calibration MC draws50,000SHOULDStable σ\sigma calibration
Bootstrap iterations2,000SHOULDVariance estimation accuracy
Monte Carlo samples (c_floor)50,000SHOULDFloor-rate constant estimation
Batch size1,000SHOULDAdaptive iteration granularity
Calibration samples5,000SHOULDInitial variance estimation
Pass threshold0.05SHOULD95% confidence of no leak
Fail threshold0.95SHOULD95% confidence of leak
KL_min (nats)0.7MUSTMinimum information gain for conclusive verdict
Block length capmin(3√T, T/3)SHOULDPrevent degenerate blocks
Discrete threshold10% uniqueSHOULDTrigger discrete mode
Min ticks per call5SHOULDMeasurability floor
Max batch size20SHOULDLimit microarch artifacts
Default time budget60 sMAYMaximum runtime
Default sample budget1,000,000MAYMaximum samples
Default RNG seed0x74696D696E67SHOULD”timing” in ASCII
Likelihood inflation threshold0.3SHOULDκmean\kappa_{\text{mean}} triggering LikelihoodInflated

Statistical methodology:

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 3. Springer. (Bayesian linear regression)

  2. Politis, D. N. & White, H. (2004). “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23(1):53–70.

  3. Künsch, H. R. (1989). “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics. (Block bootstrap)

  4. Hyndman, R. J. & Fan, Y. (1996). “Sample quantiles in statistical packages.” The American Statistician 50(4):361–365.

  5. Welford, B. P. (1962). “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics 4(3):419–420.

  6. Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed., Ch. 11-12. CRC Press. (Gibbs sampling, scale mixtures)

  7. Lange, K. L., Little, R. J. A., & Taylor, J. M. G. (1989). “Robust Statistical Modeling Using the t Distribution.” JASA 84(408):881-896. (Student’s t for robustness)

Timing attacks:

  1. Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). “Dude, is my code constant time?” DATE. (DudeCT methodology)

  2. Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). “Opportunities and Limits of Remote Timing Attacks.” ACM TISSEC 12(3):17. (Exploitability thresholds)

  3. Van Goethem, T., et al. (2020). “Timeless Timing Attacks.” USENIX Security. (HTTP/2 timing attacks)

  4. Bernstein, D. J. et al. (2024). “KyberSlash.” (Timing vulnerability example)

  5. Dunsche, M. et al. (2025). “SILENT: A New Lens on Statistics in Software Timing Side Channels.” arXiv:2504.19821. (Relevant hypotheses framework)

Existing tools:

  1. dudect (C): https://github.com/oreparaz/dudect
  2. dudect-bencher (Rust): https://github.com/rozbb/dudect-bencher

Statistical correctness refinements:

This release fixes inference semantics in the v7.0 W₁ implementation based on statistician review.

Core changes:

  • Inference uses raw W₁ (§3.1): Bayesian inference uses raw W₁ distance without debiasing or clamping. Debiased W₁ is computed only for display purposes to help users interpret effect magnitude above measurement noise.

  • Floor from null distribution (§3.3.3): Measurement floor cfloorc_{\text{floor}} is calibrated from the 95th percentile of null W₁ replicates (via within-class splits). This provides correct Type I error control under the null hypothesis.

  • Block count terminology (Appendix A): Variable nblocks=max(1,n/L)n_{\text{blocks}} = \max(1, \lfloor n / L \rfloor) replaces ambiguous neffn_{\text{eff}} in floor calculations. Effective sample size neff=n/τ^n_{\text{eff}} = n / \hat{\tau} remains for IACT-based diagnostics.

  • Prior targets user threshold (§3.3.4): Half-t prior scale σ\sigma is calibrated so that P(δ>θuser)=π0P(\delta > \theta_{\text{user}}) = \pi_0. The prior encodes the user’s security threshold.

  • Robust likelihood (§3.4.2): Student-t likelihood degrees of freedom changed from ν=8\nu_{\ell} = 8 to ν=4\nu_{\ell} = 4 (matching prior ν = 4) for consistency. Robustness parameter κGamma(ν/2,ν/2)\kappa \sim \text{Gamma}(\nu_{\ell}/2, \nu_{\ell}/2) guards against variance underestimation.

  • Tail directionality metric (§2.3): tail_slow_share measures fraction of tail deviation magnitude (p95+) from slowdowns, computed as max(dishift,0)/dishift\sum \max(d_i - \text{shift}, 0) / \sum |d_i - \text{shift}| over tail indices. Operates on quantile-aligned differences, not sample identities.

Migration notes:

  • These changes affect statistical inference but not API surface
  • Results may differ slightly from v7.0 due to corrected floor calibration and prior targeting
  • No code changes required for users; test outcomes may be more conservative (fewer false positives)

Migration from 9D quantile differences to 1D Wasserstein-1 distance:

This is a breaking change that fundamentally simplifies the statistical methodology while improving sensitivity and performance.

Core changes:

  • Test statistic (§3.1): Replace 9D quantile-difference vector with 1D W₁ (Wasserstein-1) distance

    • W₁ measures the minimum cost to transform one distribution into another
    • Naturally captures both uniform shifts and tail effects in a single scalar
    • Debiased estimator: W₁_deb = max(0, W₁(baseline, sample) - θ_floor)
  • Prior (§3.4.3): Replace 9D multivariate t-distribution with 1D half-t distribution

    • Half-t is appropriate for non-negative W₁ distances
    • Implemented via scale mixture: λ ~ Gamma(ν/2, ν/2), δ | λ ~ half-𝒩(0, σ²/λ)
    • Marginal prior variance: V₀^marginal = 2σ² (scalar, not matrix)
  • Variance estimation (§3.3.2): Replace 9×9 covariance matrix with scalar variance

    • var_rate (scalar) instead of Σ_rate (matrix)
    • Variance scaled by sample size: var_n = var_rate / n
    • Block bootstrap still used to preserve temporal dependence
  • Bayesian inference (§3.4): Replace 9D Gibbs sampler with 1D Gibbs sampler

    • Gibbs iterations increased from 256 to 5000 (1000 burn-in, 4000 retained)
    • 1D sampler is much simpler but we use more iterations for low MC variance
    • Decision functional simplified: m(δ) = δ (the W₁ distance itself)
    • Posterior probability: P(δ > θ_eff | Δ)
  • Quality gate (§3.5.2): Simplified KL divergence formula for 1D case

    • KL = ½(V_post/V₀ + μ_post²/V₀ - 1 + ln(V₀/V_post))
    • Same threshold (KL_min = 0.7 nats) for data-too-noisy detection
  • EffectEstimate (§2.3): Updated to reflect W₁ as primary metric

    • max_effect_ns: Posterior mean of W₁ distance
    • credible_interval_ns: 95% CI for W₁
    • tail_diagnostics: Decomposition into shift and tail components
      • shift_ns: Median difference (uniform shift component)
      • tail_ns: Tail-specific component (beyond shift)
      • tail_share: Fraction of effect from tail [0-1]
      • quantile_shifts: Per-quantile differences (p50, p90, p95, p99) for interpretation
      • pattern_label: TailEffect | UniformShift | Mixed | Negligible

Expected benefits:

  1. 5-10× faster inference: 1D Gibbs sampler converges much faster than 9D, despite using more iterations
  2. Better tail detection: Optimal transport naturally emphasizes distributional differences
  3. Simpler interpretation: Single distance in nanoseconds vs. 9 correlated quantile differences
  4. Natural debiasing: W₁(F, F) = 0 exactly (unlike quantile differences which vary due to sampling)

Migration notes:

  • Existing code using tacet v6.x will need to update result handling
  • The top_quantiles field in EffectEstimate has been replaced with tail_diagnostics
  • All posterior probabilities now refer to W₁ distance, not max quantile difference
  • Mathematical notation updated in Appendix A to reflect 1D formulation

Specification simplification and modularization:

  • Removed: Research mode as separate outcome type. When θ_user = 0, the oracle returns Inconclusive with posterior estimates. This removes ~50 lines and the ResearchStatus enum.

  • Removed: 2D projection for reporting (§3.4.6 in v5.7). The shift/tail decomposition and EffectPattern enum have been removed. EffectEstimate now reports max|δ|, 95% CI, and top quantiles by exceedance probability.

  • Removed: Exploitability classification (§2.4 in v5.7). This is now documented only in the user guide, not the statistical specification.

  • Simplified: κ robust likelihood diagnostics. Only likelihood_inflated: Bool is exposed; detailed κ mixing diagnostics (kappa_mean, kappa_sd, kappa_cv, kappa_ess, kappa_mixing_ok) are internal implementation details.

  • Consolidated: Quality gates from 6 to 4:

    • Gates 1+2 → Insufficient Information Gain
    • Gates 4+5 → Resource Budget Exceeded
    • Gate 3 → Would Take Too Long
    • Gate 6 → Conditions Changed
  • Consolidated: Issue codes from 30+ to 8 categories: DependenceHigh, PrecisionLow, DiscreteMode, ThresholdIssue, FilteringApplied, StationarityIssue, NumericalIssue, LikelihoodInflated.

  • Created: Separate Implementation Guide for algorithms (IACT, block bootstrap, numerical stability, Gibbs conditionals, quantile computation).

  • Created: Separate Power Module Specification for power/EM analysis.

  • Removed: Projection mismatch diagnostics and threshold calibration (no longer needed without 2D projection).

  • Removed: Dimension-aware regularization details (moved to Implementation Guide; timing uses d=9 which rarely triggers stressed regimes).

Rationale: These changes reduce specification complexity by ~40% while preserving all normative statistical requirements. The removed features were either:

  1. Reporting conveniences (2D projection, exploitability classification) that don’t affect inference
  2. Separate code paths for edge cases (Research mode) that can be handled by existing mechanisms
  3. Implementation details (algorithms, numerical procedures) better suited to a separate guide