Specification (v7.1)
This document is the authoritative specification for tacet, a Bayesian timing side-channel detection system. It defines the statistical methodology, abstract types, and requirements that implementations MUST follow to be conformant.
For implementation details (algorithms, numerical procedures), see the Implementation Guide. For language-specific APIs, see the Rust API, C API, or Go API. For interpreting results, see Interpreting Results and Attacker Models.
Terminology (RFC 2119)
Section titled “Terminology (RFC 2119)”The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
In summary:
- MUST / REQUIRED / SHALL: Absolute requirement
- MUST NOT / SHALL NOT: Absolute prohibition
- SHOULD / RECOMMENDED: Strong recommendation (valid reasons to deviate may exist)
- SHOULD NOT: Strong discouragement (valid reasons to deviate may exist)
- MAY / OPTIONAL: Truly optional
1. Overview
Section titled “1. Overview”1.1 Problem Statement
Section titled “1.1 Problem Statement”Timing side-channel attacks exploit data-dependent execution time in cryptographic implementations. Existing detection tools have significant limitations:
- T-test approaches (DudeCT) compare means, missing distributional differences such as cache effects that only affect upper quantiles
- P-value misinterpretation: Statistical significance does not equal practical significance; with enough samples, negligible effects become “significant”
- CI flakiness: Fixed sample sizes cause tests to pass locally but fail in CI (or vice versa) due to environmental noise
- Binary output: No distinction between “no leak detected” and “couldn’t measure reliably”
1.2 Solution
Section titled “1.2 Solution”tacet addresses these issues with:
- Wasserstein-1 distance: A single metric capturing both uniform shifts and tail effects, measuring the minimum cost to transform one distribution into another
- Adaptive Bayesian inference: Collect samples until confident, with natural early stopping
- Three-way decisions: Pass / Fail / Inconclusive, distinguishing “safe” from “unmeasurable”
- Interpretable output: Posterior probability (0-100%) instead of p-values
- Fail-safe design: Prefer Inconclusive over confidently wrong
1.3 Design Goals
Section titled “1.3 Design Goals”- Interpretable: Output is a probability, not a t-statistic
- Adaptive: Collects more samples when uncertain, stops early when confident
- CI-friendly: Three-way output prevents flaky tests
- Portable: Handles different timer resolutions via adaptive batching
- Honest: Never silently clamps thresholds; reports what it can actually resolve
- Fail-safe: CI verdicts SHOULD almost never be confidently wrong
- Reproducible: Deterministic results given identical samples and configuration
2. Abstract Types and Semantics
Section titled “2. Abstract Types and Semantics”This section defines the types that all implementations MUST provide. Types are specified using language-agnostic pseudocode.
2.1 Outcome
Section titled “2.1 Outcome”The primary result type returned by the oracle:
Outcome = | Pass { leak_probability: Float, // P(m(δ) > θ_eff | Δ), always at θ_eff effect: EffectEstimate, theta_user: Float, // User-requested threshold (ns) theta_eff: Float, // Effective threshold used (ns) theta_floor: Float, // Measurement floor at decision time (ns) decision_threshold_ns: Float, // θ_eff at which decision was made samples_used: Int, quality: MeasurementQuality, diagnostics: Diagnostics } | Fail { leak_probability: Float, // P(m(δ) > θ_eff | Δ) effect: EffectEstimate, theta_user: Float, theta_eff: Float, // MAY exceed θ_user (leak detected above floor) theta_floor: Float, decision_threshold_ns: Float, // θ_eff at which decision was made samples_used: Int, quality: MeasurementQuality, diagnostics: Diagnostics } | Inconclusive { reason: InconclusiveReason, leak_probability: Float, // P(m(δ) > θ_eff | Δ) effect: EffectEstimate, theta_user: Float, theta_eff: Float, theta_floor: Float, samples_used: Int, quality: MeasurementQuality, diagnostics: Diagnostics } | Unmeasurable { operation_ns: Float, // Estimated operation time threshold_ns: Float, // Timer resolution platform: String, recommendation: String }Semantics:
The field leak_probability MUST be computed as , where is the effective threshold used for inference at decision time.
When > , the oracle cannot support a Pass claim at , because effects in the range (, ] are not distinguishable from noise under the measured conditions.
Implementations MUST NOT substitute into leak_probability when > .
-
Pass: MUST be returned when ALL of the following hold:
leak_probability < pass_threshold(default 0.05)- where
- All verdict-blocking quality gates pass
- > 0
-
Fail: MUST be returned when ALL of the following hold:
leak_probability > fail_threshold(default 0.95)- All verdict-blocking quality gates pass
Note: Fail MAY be returned when > . Detecting > implies > (since ≥ by construction).
-
Inconclusive: MUST be returned when ANY of the following hold:
- A verdict-blocking quality gate fails
- Resource budgets exhausted without reaching a decision threshold
leak_probability < pass_thresholdbut (threshold elevated)
-
Unmeasurable: MUST be returned when the operation is too fast to measure reliably (see §4.5)
Exploratory mode (θ_user = 0): When = 0, Pass/Fail semantics do not apply. The oracle MUST return Inconclusive with the posterior estimates, allowing users to interpret the results themselves. This mode is useful for profiling and debugging but is not suitable for CI gating.
2.2 AttackerModel
Section titled “2.2 AttackerModel”Threat model presets defining the minimum effect size worth detecting:
AttackerModel = | SharedHardware // θ = 0.4 ns (~2 cycles @ 5GHz) | PostQuantumSentinel // θ = 2.0 ns (~10 cycles @ 5GHz) | AdjacentNetwork // θ = 100 ns | RemoteNetwork // θ = 50,000 ns (50 μs) | Custom { threshold_ns: Float }| Model | Threshold | Use Case |
|---|---|---|
| SharedHardware | 0.4 ns | SGX enclaves, cross-VM, containers, hyperthreading |
| PostQuantumSentinel | 2.0 ns | Post-quantum crypto (ML-KEM, ML-DSA) |
| AdjacentNetwork | 100 ns | LAN services, HTTP/2 APIs |
| RemoteNetwork | 50 μs | Internet-exposed services |
Cycle-based thresholds use a conservative 5 GHz reference frequency (assumes fast attacker hardware—smaller θ = more sensitive = safer).
There is no single correct threshold. The choice of attacker model is a statement about your threat model.
For exploratory analysis without a decision threshold, use Custom { threshold_ns: 0.0 }.
2.3 EffectEstimate
Section titled “2.3 EffectEstimate”Summary of the detected timing effect:
EffectEstimate = { max_effect_ns: Float, // Posterior mean of W₁ distance (in nanoseconds) credible_interval_ns: (Float, Float), // 95% CI for W₁ tail_diagnostics: TailDiagnostics // Decomposition of effect into shift and tail components}
TailDiagnostics = { shift_ns: Float, // Uniform shift component (median difference) tail_ns: Float, // Tail-specific component (beyond shift) tail_share: Float, // Fraction of effect from tail [0-1] tail_slow_share: Float, // Directionality of tail (p95+): fraction that are slowdowns [0-1] quantile_shifts: QuantileShifts, // Per-quantile differences for interpretation pattern_label: EffectPattern // Classification of effect pattern}
QuantileShifts = { p50_ns: Float, // Median difference (50th percentile shift) p90_ns: Float, // 90th percentile shift p95_ns: Float, // 95th percentile shift p99_ns: Float // 99th percentile shift}
EffectPattern = | TailEffect // Leak concentrated in upper quantiles (tail_share > 0.6) | UniformShift // Leak affects all quantiles equally (tail_share < 0.3) | Mixed // Combination of shift and tail (0.3 ≤ tail_share ≤ 0.6) | Negligible // No significant effect detectedInterpreting W₁ distance:
The W₁ (Wasserstein-1) distance measures the minimum cost to transform one distribution into another. The tail_diagnostics field decomposes this single scalar metric into interpretable components, helping users understand whether the leak is:
- A uniform shift (all quantiles affected equally, e.g., constant-time overhead difference)
- A tail effect (upper quantiles affected more, e.g., cache misses)
- A mixed pattern (combination of both)
The decomposition works by comparing the W₁ distance to the median difference:
- If W₁ ≈ median difference, the effect is uniform (constant shift)
- If W₁ >> median difference, the effect is concentrated in the tail (e.g., cache misses)
The tail_slow_share field indicates the directionality of tail deviations (p95+): values near 1.0 indicate tail deviations are predominantly slowdowns, values near 0.0 indicate speedups, and values near 0.5 indicate balanced directionality. This metric operates on quantile-aligned differences and measures what fraction of tail deviation magnitude comes from positive (slowdown) differences.
The quantile_shifts provide per-quantile differences for understanding the effect distribution, helping identify at which percentiles the leak manifests.
2.4 MeasurementQuality
Section titled “2.4 MeasurementQuality”Assessment of measurement precision:
MeasurementQuality = | Excellent // MDE < 5 ns | Good // MDE 5–20 ns | Poor // MDE 20–100 ns | TooNoisy // MDE > 100 ns2.5 InconclusiveReason
Section titled “2.5 InconclusiveReason”InconclusiveReason = | DataTooNoisy { message: String, guidance: String } | NotLearning { message: String, guidance: String } | WouldTakeTooLong { estimated_time_secs: Float, samples_needed: Int, guidance: String } | ThresholdElevated { theta_user: Float, // What user requested theta_eff: Float, // What we measured at leak_probability_at_eff: Float, // P(m(δ) > θ_eff | Δ) meets_pass_criterion_at_eff: Bool, // True if P < pass_threshold at θ_eff achievable_at_max: Bool, // Could θ_user be reached with max budget? message: String, guidance: String } | TimeBudgetExceeded { current_probability: Float, samples_collected: Int } | SampleBudgetExceeded { current_probability: Float, samples_collected: Int } | ConditionsChanged { drift: ConditionDrift }The meets_pass_criterion_at_eff field indicates whether < pass_threshold. This allows CI systems to implement policies like “treat pass-criterion-met-at-floor as acceptable” without changing inference semantics.
The achievable_at_max field distinguishes:
true: > now, but (more sampling may help)false: (cannot reach on this platform/configuration)
2.6 Diagnostics
Section titled “2.6 Diagnostics”Diagnostics = { // Dependence and effective samples dependence_length: Int, // b̂ from Politis-White (bootstrap only) effective_sample_size: Int, // n / τ̂ iact_combined: Float, // max(τ̂_F, τ̂_R)
// Stationarity stationarity_ratio: Float, stationarity_ok: Bool,
// Outlier handling outlier_rate_baseline: Float, outlier_rate_sample: Float, outlier_asymmetry_ok: Bool,
// Timer and mode discrete_mode: Bool, timer_resolution_ns: Float, duplicate_fraction: Float,
// Run information preflight_ok: Bool, calibration_samples: Int, total_time_secs: Float, seed: Option<Int>, threshold_ns: Float, timer_name: String, platform: String,
// Gibbs sampler gibbs_iters_total: Int, gibbs_burnin: Int, gibbs_retained: Int, lambda_mean: Float, // Posterior mean of λ (prior precision) lambda_mixing_ok: Bool,
// Robust likelihood likelihood_inflated: Bool, // True if κ_mean < 0.3
// Warnings warnings: List<String>, quality_issues: List<QualityIssue>}
QualityIssue = { code: IssueCode, message: String, guidance: String}
IssueCode = | DependenceHigh // High autocorrelation, reduced effective samples | PrecisionLow // Limited measurement precision | DiscreteMode // Coarse timer resolution | ThresholdIssue // Cannot achieve requested threshold | FilteringApplied // Outliers were capped | StationarityIssue // Conditions may have changed | NumericalIssue // Gibbs sampler convergence concern | LikelihoodInflated // Uncertainty inflated for robustness3. Statistical Methodology
Section titled “3. Statistical Methodology”This section describes the mathematical foundation of tacet. All formulas in this section are normative; implementations MUST produce equivalent results.
3.1 Test Statistic: Wasserstein-1 Distance
Section titled “3.1 Test Statistic: Wasserstein-1 Distance”We collect timing samples from two classes:
- Fixed class (F): A specific input (e.g., all zeros)
- Random class (R): Randomly sampled inputs
We measure the Wasserstein-1 (W₁) distance between the two timing distributions. The W₁ distance measures the minimum cost to transform one distribution into another, where cost is the amount of probability mass times the distance it must be moved.
The test statistic is a 1D scalar (in nanoseconds):
where denotes the empirical cumulative distribution function.
Debiased W₁ distance:
For human-readable output, implementations MAY compute a debiased estimator:
where is the measurement noise floor (§3.3.3). The debiased estimator provides a display-friendly interpretation of effect magnitude relative to measurement noise. Inference (§3.4) uses the raw W₁ distance.
Why W₁ distance?
Timing leaks manifest in different ways:
- Uniform shift: A different code path adds constant overhead → entire distribution shifts
- Tail effect: Cache misses occur probabilistically → upper quantiles shift more
The W₁ distance captures both patterns in a single scalar metric:
- For uniform shifts, W₁ ≈ shift magnitude (median difference)
- For tail effects, W₁ emphasizes the tail differences
- Mixed patterns are captured proportionally
W₁ properties:
- Single metric: Captures distributional differences in nanoseconds
- Optimal transport: Weights differences by their quantile position
- Natural zero: W₁ between identical distributions equals zero
- 1D inference: Enables fast Bayesian updating with simple conjugacy
W₁ computation: Implementations MUST compute W₁ using the sorted-samples method for discrete distributions. See the Implementation Guide for the algorithm.
3.2 Two-Phase Architecture
Section titled “3.2 Two-Phase Architecture”The system operates in two phases:
┌─────────────────────────────────────────────────────────────────┐│ Architecture │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Calibration │───▶│ Adaptive │───▶│ Decision │ ││ │ Phase │ │ Loop │ │ Output │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ • Estimate Σ_rate • Collect batch • Pass (P<5%) ││ • Compute θ_floor • Update Δ • Fail (P>95%) ││ • Set prior scale • Scale Σ by 1/n • Inconclusive ││ • Warmup caches • Update θ_floor ││ • Pre-flight checks • Compute P(>θ) ││ • Check quality ││ • Check stopping ││ │└─────────────────────────────────────────────────────────────────┘Phase 1: Calibration (runs once)
- Collect initial samples to characterize measurement noise
- Estimate covariance structure via stream-based block bootstrap
- Compute “covariance rate” that scales as
- Compute initial measurement floor and floor-rate constant c_floor
- Compute effective threshold and calibrate prior scale
- Run pre-flight checks (timer sanity, harness sanity, stationarity)
Phase 2: Adaptive Loop (iterates until decision)
- Collect batches of samples
- Update quantile estimates from all data collected so far
- Scale covariance:
- Update using floor-rate constant
- Run Gibbs sampler to approximate posterior and compute
- Check quality gates (posterior ≈ prior → Inconclusive)
- Check decision boundaries (P > 95% → Fail, P < 5% → Pass)
- Check time/sample budgets
Why this structure?
The key insight is that covariance scales as 1/n for quantile estimators. By estimating the covariance rate once during calibration, we can cheaply update the posterior as more data arrives; no re-bootstrapping needed. This makes adaptive sampling computationally tractable.
3.3 Calibration Phase
Section titled “3.3 Calibration Phase”The calibration phase runs once at startup to characterize measurement noise.
Sample collection:
Implementations SHOULD collect n_cal samples per class (default: 5,000). This is enough to estimate covariance structure reliably while keeping calibration fast.
Fragile regime:
A fragile regime is a measurement condition where standard statistical assumptions may not hold, requiring more conservative estimation. The fragile regime is triggered when either:
- Discrete timer mode: The timer has coarse resolution (see §3.6)
- Low uniqueness: The minimum uniqueness ratio across classes is below 10%:
When a fragile regime is detected, implementations apply more conservative procedures (larger block lengths, regularized covariance) to maintain calibration.
3.3.1 Acquisition Stream Model
Section titled “3.3.1 Acquisition Stream Model”Measurement produces an interleaved acquisition stream indexed by time:
where is the measured runtime (or ticks) at acquisition index , and F/R denote Fixed and Random classes.
Per-class samples are obtained by filtering the stream:
Critical principle: The acquisition stream is the data-generating process. All bootstrap and dependence estimation MUST preserve adjacency in acquisition order, not per-class position.
3.3.2 Variance Estimation via Stream-Based Block Bootstrap
Section titled “3.3.2 Variance Estimation via Stream-Based Block Bootstrap”Timing measurements exhibit autocorrelation: nearby samples are more similar than distant ones due to cache state, frequency scaling, etc. Standard bootstrap assumes i.i.d. samples, underestimating variance. Implementations MUST use block bootstrap on the acquisition stream to preserve the true dependence structure.
Block length selection: Implementations SHOULD use the Politis-White algorithm to select the optimal block length, with class-conditional ACF to avoid underestimation from interleaved sampling. See the Implementation Guide for the algorithm.
Variance rate (scalar):
The variance of the W₁ estimator scales as 1/n. The variance rate is a scalar for the 1D W₁ distance:
where is the bootstrap variance estimate from calibration.
Define the calibration long-run variance proxy:
This is a scalar (not a matrix) because W₁ is a 1D statistic.
Effective sample size via IACT:
Under strong temporal dependence, n samples do not provide n independent observations. Implementations SHOULD compute integrated autocorrelation time (IACT) for diagnostic purposes:
where is estimated via the Geyer Initial Monotone Sequence (IMS) method or equivalent. See the Implementation Guide for algorithms.
Implementations SHOULD report as diagnostics:
Diagnostics.dependence_length= (block length from bootstrap)Diagnostics.iact_combined= (IACT estimate)Diagnostics.effective_sample_size= (for diagnostics only)
Variance scaling during adaptive loop:
Since is computed via block bootstrap (which preserves temporal dependence structure), it already represents the long-run variance rate. During inference with samples:
The block bootstrap accounts for autocorrelation in the variance estimate. IACT estimates are computed and reported as diagnostics but do not affect inference.
3.3.3 Measurement Floor and Effective Threshold
Section titled “3.3.3 Measurement Floor and Effective Threshold”A critical design element is distinguishing between what the user wants to detect () and what the measurement can detect ().
Floor-rate constant (from null distribution):
Implementations MUST compute a floor-rate constant once at calibration by bootstrapping the null distribution of raw W₁:
- Generate null W₁ replicates via within-class splits: split baseline into two halves and compute W₁ between them; similarly for sample class
- Scale each null replicate by where is the number of independent blocks at calibration
- Compute as the 95th percentile of these scaled null replicates
This SHOULD use at least 2,000 null bootstrap replicates for stable percentile estimation.
Measurement floor (dynamic):
During the adaptive loop, the statistical floor decreases as sample size grows:
where is the number of independent blocks for sample size , and is the block length.
The tick floor is fixed once batching is determined:
where is the batch size. The combined floor:
(The tick floor is already incorporated in the within .)
Effective threshold ():
Threshold elevation decision rule:
When > :
-
Fail propagates: Detecting > implies > . Implementations MAY return Fail.
-
Pass does not propagate: “No detectable effect above ” is compatible with effects in (, ]. Implementations MUST NOT return Pass when > + .
Dynamic floor updates:
During the adaptive loop, decreases as n grows. Implementations MUST:
- Recompute after each batch
- If drops to or below, Pass becomes possible (subject to posterior)
- Report the used for the final decision
3.3.4 Prior Scale Calibration (Half-t Prior)
Section titled “3.3.4 Prior Scale Calibration (Half-t Prior)”The prior on the 1D W₁ effect MUST be a half-t distribution (Student’s t restricted to positive values):
where is a fixed degrees-of-freedom parameter and is a scale parameter calibrated to an exceedance target.
W₁ distances are non-negative, so the prior is a half-t distribution (Student’s t restricted to positive values). This is implemented via a scale mixture of Gaussians (see §3.4.3).
Degrees of freedom ():
Implementations MUST use .
Calibrating via exceedance target:
The scale MUST be chosen so that the prior exceedance probability equals a fixed target (default 0.62):
This MUST be solved by deterministic 1D root-finding using Monte Carlo integration. See the Implementation Guide for the algorithm.
The prior is calibrated against (the user’s threat-model threshold). The measurement floor is handled separately in the decision logic (§3.3.3).
Design properties: The half-t with ν = 4 provides heavy tails while maintaining finite variance. The calibrated scale σ balances informativeness and flexibility: the prior is weakly informative (allowing data to dominate when signal is strong) without being vacuous.
3.3.5 Deterministic Seeding Policy
Section titled “3.3.5 Deterministic Seeding Policy”To ensure reproducible results, all random number generation MUST be deterministic by default.
Normative requirement:
Given identical timing samples and configuration, the oracle MUST produce identical results (up to floating-point roundoff).
Seeding policy:
- The bootstrap RNG seed MUST be deterministically derived from:
- A fixed library constant seed (default: 0x74696D696E67, “timing” in ASCII)
- A stable hash of configuration parameters
- All Monte Carlo RNG seeds (leak probability, floor constant, prior scale) MUST be similarly deterministic
- The Gibbs sampler RNG seed MUST be deterministic
- The chosen seeds SHOULD be reported in diagnostics
3.4 Bayesian Model
Section titled “3.4 Bayesian Model”The prior is a half-t distribution over the 1D W₁ distance, implemented via Gibbs sampling on the scale-mixture representation.
3.4.1 Latent Parameter
Section titled “3.4.1 Latent Parameter”The latent parameter is the true W₁ distance between timing distributions:
This is constrained to be non-negative (W₁ distances are always ≥ 0).
3.4.2 Likelihood (Robust t-likelihood via scale mixture)
Section titled “3.4.2 Likelihood (Robust t-likelihood via scale mixture)”The observed W₁ statistic may deviate from the Gaussian approximation when is underestimated. To prevent pathological posterior certainty under variance misestimation, implementations MUST use a robust likelihood with a scalar precision factor .
Marginally, this gives a univariate t-distribution:
Gamma parameterization: shape–rate. .
Degrees of freedom for likelihood ():
Implementations MUST use := 4.
Robustness mechanism: When is underestimated, is pulled downward and inflates uncertainty automatically. The parameter ν_ℓ = 4 matches the prior degrees of freedom, maintaining consistency in tail behavior between prior and likelihood.
Likelihood inflation warning:
Implementations SHOULD set Diagnostics.likelihood_inflated = true when , indicating the likelihood variance was effectively scaled up for robustness.
3.4.3 Prior (Half-t via scale mixture)
Section titled “3.4.3 Prior (Half-t via scale mixture)”The prior on is:
Implementations MUST implement inference using the equivalent hierarchical model:
where Gamma uses shape–rate parameterization, and half-𝒩 denotes a half-normal distribution (normal restricted to positive values).
Marginal prior variance:
For = 4:
This scalar variance is used as the prior variance reference in Gate 1 (§3.5.2).
3.4.4 Posterior Inference (Deterministic Gibbs Sampling)
Section titled “3.4.4 Posterior Inference (Deterministic Gibbs Sampling)”The posterior is approximated using a deterministic Gibbs sampler over .
Gibbs schedule (normative):
| Parameter | Value | Description |
|---|---|---|
| N_gibbs | 5000 | Total iterations |
| N_burn | 1000 | Burn-in (discarded) |
| N_keep | 4000 | Retained samples |
Initialization:
Iteration order:
For t = 1, …, N_gibbs:
- Sample (truncated normal, positive only)
- Sample
- Sample
The Gibbs conditionals and numerical implementation are detailed in the Implementation Guide.
Why 5000 iterations?
The 1D Gibbs sampler converges much faster than the previous 9D sampler, but we use more iterations to ensure high-quality posterior approximation with minimal Monte Carlo error. The computational cost is still 5-10× lower than the 9D approach due to simpler conditionals.
mixing diagnostics:
Implementations SHOULD set Diagnostics.lambda_mixing_ok = false when lambda_cv < 0.1 or lambda_ess < 20, indicating the sampler may not have converged.
3.4.5 Decision Functional and Leak Probability
Section titled “3.4.5 Decision Functional and Leak Probability”The decision functional is simply the W₁ distance itself:
The leak probability is:
Estimation via Gibbs samples:
Posterior summaries:
Implementations MUST compute:
- Posterior mean:
- Credible interval: 95% CI for from empirical quantiles of
Interpreting the probability:
This is a posterior probability, not a p-value. When we report “72% probability of a leak,” we mean: given the data and our model, 72% of the posterior mass corresponds to W₁ distances exceeding .
3.5 Adaptive Sampling Loop
Section titled “3.5 Adaptive Sampling Loop”The core innovation: collect samples until confident, with natural early stopping.
Verdict-blocking semantics:
Pass/Fail verdicts MUST be emitted only if all measurement quality gates pass. Otherwise the oracle MUST return Inconclusive.
This policy ensures: CI verdicts should almost never be confidently wrong.
3.5.1 Stopping Criteria
Section titled “3.5.1 Stopping Criteria”The adaptive loop terminates when any of these conditions is met:
- Pass:
leak_probability < pass_threshold(default 0.05) AND all quality gates pass - Fail:
leak_probability > fail_threshold(default 0.95) AND all quality gates pass - Inconclusive: Any quality gate fails OR budget exhausted without reaching decision threshold
Why adaptive sampling works for Bayesian inference:
Frequentist methods suffer from optional stopping: if you keep sampling until you get a significant result, you inflate your false positive rate.
Bayesian methods don’t have this problem. The posterior probability is valid regardless of when you stop; this is the likelihood principle. Your inference depends only on the data observed, not your sampling plan.
3.5.2 Quality Gates (Verdict-Blocking)
Section titled “3.5.2 Quality Gates (Verdict-Blocking)”Quality gates detect when data is too poor to reach a confident decision. When any gate triggers, the outcome MUST be Inconclusive.
Gate 1: Insufficient Information Gain
This gate detects when the data provides insufficient information relative to the prior, either because the posterior barely moved from the prior (data too noisy) or because the posterior stopped updating (not learning).
Implementations MUST compute the KL divergence between Gaussian surrogates of the prior and posterior. For the 1D case:
where is the posterior variance of (estimated from Gibbs samples) and is the marginal prior variance (§3.4.3).
Trigger conditions:
- KL < KL_min (default 0.7 nats) →
DataTooNoisy - Sum of recent inter-batch KL divergences < 0.001 for 5+ batches →
NotLearning
Gate 2: Would Take Too Long
Extrapolate time to decision based on current convergence rate. If projected time exceeds budget by a large margin (e.g., 10×), trigger Inconclusive with reason WouldTakeTooLong.
Gate 3: Resource Budget Exceeded
If elapsed time exceeds configured time budget → TimeBudgetExceeded
If total samples per class exceeds configured maximum → SampleBudgetExceeded
Gate 4: Condition Drift Detected
The covariance estimate is computed during calibration. If measurement conditions change during the adaptive loop, this estimate becomes invalid.
Detect condition drift by comparing measurement statistics from calibration against the full test run:
- Variance ratio:
- Autocorrelation change:
- Mean drift:
If variance ratio is outside [0.5, 2.0], or autocorrelation change exceeds 0.3, or mean drift exceeds 3.0, trigger Inconclusive with reason ConditionsChanged.
Note on threshold elevation: The case where > is handled by the threshold elevation decision rule (§3.3.3), not by a quality gate.
3.6 Discrete Timer Mode
Section titled “3.6 Discrete Timer Mode”When the timer has low resolution (e.g., Apple Silicon’s 41ns cntvct_el0), quantile estimation behaves differently due to tied values.
Trigger condition:
Discrete timer mode triggers when minimum uniqueness ratio is below 10%.
Mid-distribution quantiles:
Instead of standard quantile estimators, use mid-distribution quantiles which handle ties correctly. See the Implementation Guide.
Work in ticks internally:
In discrete mode, implementations SHOULD perform computations in ticks (timer’s native unit) and convert to nanoseconds only for display.
Gaussian approximation caveat:
The Gaussian likelihood is a rougher approximation with discrete data. Implementations MUST report a quality issue about discrete timer mode.
3.7 Calibration Validation Requirements
Section titled “3.7 Calibration Validation Requirements”The Bayesian approach requires empirical validation that posteriors are well-calibrated.
Null calibration test (normative requirement):
Implementations MUST provide a “fixed-vs-fixed” validation that measures end-to-end false positive rates:
- FPR_gated = P(Fail | H₀, all verdict-blocking gates pass)
Acceptance criteria:
| Metric | Target | Action if Exceeded |
|---|---|---|
| FPR_gated | 2-5% | ≤ 5% |
Large-effect detection tests:
Implementations MUST include validation tests ensuring the Student’s t prior correctly detects obvious leaks even when measurement noise is high.
4. Measurement Requirements
Section titled “4. Measurement Requirements”This section defines abstract requirements for measurement. For implementation details, see the Implementation Guide.
4.1 Timer Requirements
Section titled “4.1 Timer Requirements”Implementations MUST use a timer that:
- Is monotonic (never decreases)
- Has known resolution
- Reports results that can be converted to nanoseconds
Implementations SHOULD use the highest-resolution timer available on the platform.
4.2 Acquisition Stream Requirements
Section titled “4.2 Acquisition Stream Requirements”Measurements MUST be collected as an interleaved acquisition stream (see §3.3.1):
- Fixed and Random class measurements MUST be interleaved
- The interleaving order SHOULD be randomized
- The full acquisition stream (with class labels) MUST be preserved for bootstrap
4.3 Input Pre-generation
Section titled “4.3 Input Pre-generation”All inputs MUST be generated before the measurement loop begins. Generating inputs inside the timed region causes false positives.
4.4 Outlier Handling
Section titled “4.4 Outlier Handling”Implementations MUST cap (winsorize), not drop, outliers:
- Compute t_cap = 99.99th percentile from pooled data
- Cap samples exceeding t_cap
- Winsorization happens before quantile computation
Quality thresholds: >0.1% capped → warning; >1% → acceptable; >5% → TooNoisy.
4.5 Adaptive Batching
Section titled “4.5 Adaptive Batching”On platforms with coarse timer resolution, implementations SHOULD batch operations:
When batching is needed:
If pilot measurement shows fewer than 5 ticks per call, enable batching.
Batch size selection:
Effect scaling:
Reported effects MUST be divided by K to give per-operation estimates.
4.6 Measurability
Section titled “4.6 Measurability”If ticks per call < 5 even with maximum batching (K=20), implementations MUST return Unmeasurable.
4.7 Pre-flight Checks
Section titled “4.7 Pre-flight Checks”Implementations SHOULD perform pre-flight checks:
- Timer sanity: Verify monotonicity and reasonable resolution
- Harness sanity (fixed-vs-fixed): Detect test harness bugs
- Stationarity: Detect drift during measurement
5. API Design Principles
Section titled “5. API Design Principles”This section provides language-agnostic guidance for API design. These are recommendations (SHOULD) unless marked otherwise.
5.1 Input Specification
Section titled “5.1 Input Specification”Two-class pattern:
Implementations SHOULD expose the DudeCT two-class pattern:
- Baseline class: Fixed input (typically all zeros)
- Sample class: Variable input (typically random)
This pattern tests for data-dependent timing, not specific value comparisons.
5.2 Configuration Ergonomics
Section titled “5.2 Configuration Ergonomics”Attacker model presets as primary entry point:
The primary configuration entry point SHOULD be attacker model selection, not raw threshold values.
Sane defaults:
Default configuration SHOULD:
- Use
AdjacentNetworkattacker model (or equivalent) - Set time budget to 60 seconds
- Set sample budget to 1,000,000
- Set pass/fail thresholds to 0.05/0.95
5.3 Result Communication
Section titled “5.3 Result Communication”Leak probability prominence:
The leak probability MUST be prominently displayed in results and human-readable output.
Threshold transparency:
When > , implementations MUST clearly indicate this to the user.
Inconclusive guidance:
For Inconclusive outcomes, implementations MUST provide the reason and SHOULD provide actionable guidance.
6. Configuration Parameters
Section titled “6. Configuration Parameters”6.1 Required Parameters
Section titled “6.1 Required Parameters”| Parameter | Type | Description |
|---|---|---|
| attacker_model OR threshold_ns | AttackerModel or Float | Defines the effect threshold |
6.2 Optional Parameters
Section titled “6.2 Optional Parameters”| Parameter | Default | Description |
|---|---|---|
| time_budget | 60 seconds | Maximum test duration |
| max_samples | 1,000,000 | Maximum samples per class |
| pass_threshold | 0.05 | P(leak) below this → Pass |
| fail_threshold | 0.95 | P(leak) above this → Fail |
| calibration_samples | 5,000 | Samples for calibration phase |
| batch_size | 1,000 | Samples per adaptive batch |
| bootstrap_iterations | 2,000 | Bootstrap iterations for covariance |
Appendix A: Mathematical Notation
Section titled “Appendix A: Mathematical Notation”| Symbol | Meaning |
|---|---|
| {(c_t, y_t)} | Acquisition stream: class labels and timing measurements |
| T | Acquisition stream length (≈ 2n) |
| F, R | Per-class sample sets (filtered from stream) |
| Scalar: observed W₁ distance (in nanoseconds) | |
| Scalar: true (latent) W₁ distance (in nanoseconds) | |
| W₁ | Wasserstein-1 distance between timing distributions |
| Variance of W₁ estimator at sample size n | |
| Variance rate (scalar) | |
| Calibration long-run variance proxy: | |
| Integrated autocorrelation time (via Geyer IMS) | |
| n_eff | Effective sample size: |
| n_blocks | Number of independent blocks: where is block length |
| Half-t degrees of freedom (fixed at 4) | |
| Half-t prior scale (calibrated via exceedance target) | |
| Latent prior precision multiplier in scale-mixture representation | |
| Latent likelihood precision multiplier (robust t-likelihood) | |
| Likelihood degrees of freedom (fixed at 4) | |
| Univariate Student’s t with df, location 0, scale | |
| Half-t distribution (t restricted to positive values) | |
| Marginal prior variance: for | |
| Posterior variance of (estimated from Gibbs samples) | |
| Posterior mean of (from Gibbs samples) | |
| N_gibbs | Total Gibbs iterations (5000) |
| N_burn | Burn-in iterations (1000) |
| N_keep | Retained Gibbs samples (4000) |
| User-requested threshold | |
| Measurement floor (smallest resolvable effect) | |
| Floor-rate constant: | |
| Timer resolution component of floor | |
| Effective threshold used for inference | |
| Decision functional: (the W₁ distance itself) | |
| b̂ | Block length (Politis-White, on acquisition stream) |
| MDE | Minimum detectable effect |
| n | Samples per class |
| B | Bootstrap iterations |
| KL | KL divergence: KL(posterior ∥ prior) for Gate 1 |
| KL_min | Minimum KL threshold for conclusive verdict (default 0.7 nats) |
Appendix B: Normative Constants
Section titled “Appendix B: Normative Constants”These constants define conformant implementations. Implementations MAY use different values only where noted.
| Constant | Default | Normative | Rationale |
|---|---|---|---|
| Test statistic | W₁ distance | MUST | 1D scalar, optimal transport |
| Prior family | Half-t | MUST | Continuous scale adaptation, non-negative |
| Degrees of freedom () | 4 | MUST | Heavy tails + finite variance |
| Gamma parameterization | shape–rate | MUST | Avoid library ambiguity |
| Gibbs iterations (N_gibbs) | 5000 | MUST | High-quality posterior approximation |
| Gibbs burn-in (N_burn) | 1000 | MUST | Conservative convergence |
| Gibbs retained (N_keep) | 4000 | MUST | Low MC variance |
| Gibbs initialization (, ) | 1 | MUST | Prior means |
| Likelihood df () | 8 | MUST | Robustness to variance misestimation |
| Prior exceedance target () | 0.62 | SHOULD | Genuine uncertainty |
| Prior calibration MC draws | 50,000 | SHOULD | Stable calibration |
| Bootstrap iterations | 2,000 | SHOULD | Variance estimation accuracy |
| Monte Carlo samples (c_floor) | 50,000 | SHOULD | Floor-rate constant estimation |
| Batch size | 1,000 | SHOULD | Adaptive iteration granularity |
| Calibration samples | 5,000 | SHOULD | Initial variance estimation |
| Pass threshold | 0.05 | SHOULD | 95% confidence of no leak |
| Fail threshold | 0.95 | SHOULD | 95% confidence of leak |
| KL_min (nats) | 0.7 | MUST | Minimum information gain for conclusive verdict |
| Block length cap | min(3√T, T/3) | SHOULD | Prevent degenerate blocks |
| Discrete threshold | 10% unique | SHOULD | Trigger discrete mode |
| Min ticks per call | 5 | SHOULD | Measurability floor |
| Max batch size | 20 | SHOULD | Limit microarch artifacts |
| Default time budget | 60 s | MAY | Maximum runtime |
| Default sample budget | 1,000,000 | MAY | Maximum samples |
| Default RNG seed | 0x74696D696E67 | SHOULD | ”timing” in ASCII |
| Likelihood inflation threshold | 0.3 | SHOULD | triggering LikelihoodInflated |
Appendix C: References
Section titled “Appendix C: References”Statistical methodology:
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 3. Springer. (Bayesian linear regression)
-
Politis, D. N. & White, H. (2004). “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23(1):53–70.
-
Künsch, H. R. (1989). “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics. (Block bootstrap)
-
Hyndman, R. J. & Fan, Y. (1996). “Sample quantiles in statistical packages.” The American Statistician 50(4):361–365.
-
Welford, B. P. (1962). “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics 4(3):419–420.
-
Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed., Ch. 11-12. CRC Press. (Gibbs sampling, scale mixtures)
-
Lange, K. L., Little, R. J. A., & Taylor, J. M. G. (1989). “Robust Statistical Modeling Using the t Distribution.” JASA 84(408):881-896. (Student’s t for robustness)
Timing attacks:
-
Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). “Dude, is my code constant time?” DATE. (DudeCT methodology)
-
Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). “Opportunities and Limits of Remote Timing Attacks.” ACM TISSEC 12(3):17. (Exploitability thresholds)
-
Van Goethem, T., et al. (2020). “Timeless Timing Attacks.” USENIX Security. (HTTP/2 timing attacks)
-
Bernstein, D. J. et al. (2024). “KyberSlash.” (Timing vulnerability example)
-
Dunsche, M. et al. (2025). “SILENT: A New Lens on Statistics in Software Timing Side Channels.” arXiv:2504.19821. (Relevant hypotheses framework)
Existing tools:
- dudect (C): https://github.com/oreparaz/dudect
- dudect-bencher (Rust): https://github.com/rozbb/dudect-bencher
Appendix D: Changelog
Section titled “Appendix D: Changelog”v7.1 (from v7.0)
Section titled “v7.1 (from v7.0)”Statistical correctness refinements:
This release fixes inference semantics in the v7.0 W₁ implementation based on statistician review.
Core changes:
-
Inference uses raw W₁ (§3.1): Bayesian inference uses raw W₁ distance without debiasing or clamping. Debiased W₁ is computed only for display purposes to help users interpret effect magnitude above measurement noise.
-
Floor from null distribution (§3.3.3): Measurement floor is calibrated from the 95th percentile of null W₁ replicates (via within-class splits). This provides correct Type I error control under the null hypothesis.
-
Block count terminology (Appendix A): Variable replaces ambiguous in floor calculations. Effective sample size remains for IACT-based diagnostics.
-
Prior targets user threshold (§3.3.4): Half-t prior scale is calibrated so that . The prior encodes the user’s security threshold.
-
Robust likelihood (§3.4.2): Student-t likelihood degrees of freedom changed from to (matching prior ν = 4) for consistency. Robustness parameter guards against variance underestimation.
-
Tail directionality metric (§2.3):
tail_slow_sharemeasures fraction of tail deviation magnitude (p95+) from slowdowns, computed as over tail indices. Operates on quantile-aligned differences, not sample identities.
Migration notes:
- These changes affect statistical inference but not API surface
- Results may differ slightly from v7.0 due to corrected floor calibration and prior targeting
- No code changes required for users; test outcomes may be more conservative (fewer false positives)
v7.0 (from v6.0)
Section titled “v7.0 (from v6.0)”Migration from 9D quantile differences to 1D Wasserstein-1 distance:
This is a breaking change that fundamentally simplifies the statistical methodology while improving sensitivity and performance.
Core changes:
-
Test statistic (§3.1): Replace 9D quantile-difference vector with 1D W₁ (Wasserstein-1) distance
- W₁ measures the minimum cost to transform one distribution into another
- Naturally captures both uniform shifts and tail effects in a single scalar
- Debiased estimator: W₁_deb = max(0, W₁(baseline, sample) - θ_floor)
-
Prior (§3.4.3): Replace 9D multivariate t-distribution with 1D half-t distribution
- Half-t is appropriate for non-negative W₁ distances
- Implemented via scale mixture: λ ~ Gamma(ν/2, ν/2), δ | λ ~ half-𝒩(0, σ²/λ)
- Marginal prior variance: V₀^marginal = 2σ² (scalar, not matrix)
-
Variance estimation (§3.3.2): Replace 9×9 covariance matrix with scalar variance
- var_rate (scalar) instead of Σ_rate (matrix)
- Variance scaled by sample size: var_n = var_rate / n
- Block bootstrap still used to preserve temporal dependence
-
Bayesian inference (§3.4): Replace 9D Gibbs sampler with 1D Gibbs sampler
- Gibbs iterations increased from 256 to 5000 (1000 burn-in, 4000 retained)
- 1D sampler is much simpler but we use more iterations for low MC variance
- Decision functional simplified: m(δ) = δ (the W₁ distance itself)
- Posterior probability: P(δ > θ_eff | Δ)
-
Quality gate (§3.5.2): Simplified KL divergence formula for 1D case
- KL = ½(V_post/V₀ + μ_post²/V₀ - 1 + ln(V₀/V_post))
- Same threshold (KL_min = 0.7 nats) for data-too-noisy detection
-
EffectEstimate (§2.3): Updated to reflect W₁ as primary metric
- max_effect_ns: Posterior mean of W₁ distance
- credible_interval_ns: 95% CI for W₁
- tail_diagnostics: Decomposition into shift and tail components
- shift_ns: Median difference (uniform shift component)
- tail_ns: Tail-specific component (beyond shift)
- tail_share: Fraction of effect from tail [0-1]
- quantile_shifts: Per-quantile differences (p50, p90, p95, p99) for interpretation
- pattern_label: TailEffect | UniformShift | Mixed | Negligible
Expected benefits:
- 5-10× faster inference: 1D Gibbs sampler converges much faster than 9D, despite using more iterations
- Better tail detection: Optimal transport naturally emphasizes distributional differences
- Simpler interpretation: Single distance in nanoseconds vs. 9 correlated quantile differences
- Natural debiasing: W₁(F, F) = 0 exactly (unlike quantile differences which vary due to sampling)
Migration notes:
- Existing code using tacet v6.x will need to update result handling
- The top_quantiles field in EffectEstimate has been replaced with tail_diagnostics
- All posterior probabilities now refer to W₁ distance, not max quantile difference
- Mathematical notation updated in Appendix A to reflect 1D formulation
v6.0 (from v5.7)
Section titled “v6.0 (from v5.7)”Specification simplification and modularization:
-
Removed: Research mode as separate outcome type. When θ_user = 0, the oracle returns Inconclusive with posterior estimates. This removes ~50 lines and the ResearchStatus enum.
-
Removed: 2D projection for reporting (§3.4.6 in v5.7). The shift/tail decomposition and EffectPattern enum have been removed. EffectEstimate now reports max|δ|, 95% CI, and top quantiles by exceedance probability.
-
Removed: Exploitability classification (§2.4 in v5.7). This is now documented only in the user guide, not the statistical specification.
-
Simplified: κ robust likelihood diagnostics. Only
likelihood_inflated: Boolis exposed; detailed κ mixing diagnostics (kappa_mean, kappa_sd, kappa_cv, kappa_ess, kappa_mixing_ok) are internal implementation details. -
Consolidated: Quality gates from 6 to 4:
- Gates 1+2 → Insufficient Information Gain
- Gates 4+5 → Resource Budget Exceeded
- Gate 3 → Would Take Too Long
- Gate 6 → Conditions Changed
-
Consolidated: Issue codes from 30+ to 8 categories: DependenceHigh, PrecisionLow, DiscreteMode, ThresholdIssue, FilteringApplied, StationarityIssue, NumericalIssue, LikelihoodInflated.
-
Created: Separate Implementation Guide for algorithms (IACT, block bootstrap, numerical stability, Gibbs conditionals, quantile computation).
-
Created: Separate Power Module Specification for power/EM analysis.
-
Removed: Projection mismatch diagnostics and threshold calibration (no longer needed without 2D projection).
-
Removed: Dimension-aware regularization details (moved to Implementation Guide; timing uses d=9 which rarely triggers stressed regimes).
Rationale: These changes reduce specification complexity by ~40% while preserving all normative statistical requirements. The removed features were either:
- Reporting conveniences (2D projection, exploitability classification) that don’t affect inference
- Separate code paths for edge cases (Research mode) that can be handled by existing mechanisms
- Implementation details (algorithms, numerical procedures) better suited to a separate guide