How It Works

This page describes the statistical methodology used by tacet to detect timing side channels.

The core idea

In plain English: We run your code thousands of times with two types of inputs (baseline and sample). If the timing distributions differ by more than your threshold, we report a leak. The key innovation is how we make that comparison:

Compare distributions, not averages: A timing leak might only affect some runs (cache misses, branch mispredictions), so we compare the full timing distribution using multiple percentiles.
Bayesian probability: Instead of a yes/no answer, we compute the probability that there’s a security-relevant leak. This lets us say “72% confident there’s a leak” rather than just “the t-statistic is 4.7.”
Adaptive stopping: We keep collecting samples until we’re confident enough to decide. Clear cases stop early; borderline cases get more evidence.

Two-phase pipeline

tacet uses an adaptive two-phase approach:

                    Timing Samples
                          │
           ┌──────────────┴──────────────┐
           ▼                             ▼
   ┌───────────────┐           ┌──────────────────┐
   │  Phase 1:     │           │  Phase 2:        │
   │  Calibration  │           │  Adaptive Loop   │
   ├───────────────┤           ├──────────────────┤
   │ 5,000 samples │           │ Batch collection │
   │ Covariance    │           │ Posterior update │
   │ estimation    │           │ Decision check   │
   │ Prior setup   │   ────>   │ Quality gates    │
   │               │           │                  │
   │ Fixed cost    │           │ Stops when       │
   │               │           │ P < 0.05 or      │
   │               │           │ P > 0.95         │
   └───────────────┘           └──────────────────┘

Phase 1: Calibration

The calibration phase collects an initial batch of samples (default: 5,000 per class) to:

Estimate noise covariance: Timing measurements are noisy. Block bootstrap estimates the covariance structure of this noise.
Compute minimum detectable effect (MDE): Based on noise levels, what’s the smallest effect we could reliably detect?
Set Bayesian priors: The prior distribution is centered on zero effect with spread determined by the MDE.

Phase 2: Adaptive loop

After calibration, the adaptive loop collects samples in batches until a decision is reached:

Collect batch: Gather N samples per class (default: 1,000)
Update posterior: Combine new data with existing posterior using Bayesian updating
Check decision thresholds: If P(leak) < 0.05, pass. If P(leak) > 0.95, fail.
Apply quality gates: Check for conditions that would make continued sampling futile
Repeat or stop: Either collect more samples or return a result

This adaptive approach:

Stops early on clear pass/fail cases (saves time)
Gathers more evidence when the signal is near the threshold
Returns Inconclusive when hitting resource limits or quality gates

Wasserstein distance

Rather than comparing means (which can miss distributional differences), tacet uses the Wasserstein-1 distance (also known as the earth mover’s distance) to measure how different the timing distributions are:

W_1(\text{baseline}, \text{sample}) = \frac{1}{n} \sum_{i=1}^{n} \left| x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)} \right|

where $x_{\text{baseline}}^{(i)}$ and $x_{\text{sample}}^{(i)}$ are the $i$ -th smallest values from each class.

Why W₁? The Wasserstein-1 distance captures the “cost” of transforming one distribution into another via optimal transport. Unlike a simple mean comparison, it’s sensitive to both location shifts and shape changes in the distribution.

Effect decomposition

The W₁ distance can be decomposed into two interpretable components:

W_1 = |\text{shift}| + \text{tail}

where:

shift = median of rank-matched differences $\{x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)}\}$
tail = mean of centered residuals $\left| (x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)}) - \text{shift} \right|$

The tail share ratio distinguishes timing leak patterns:

\text{tail\_share} = \frac{\text{tail}}{|\text{shift}| + \text{tail}}

Tail Share	Pattern	Likely cause
< 0.20	UniformShift	Branch on secret data (all timings shift uniformly)
0.20–0.50	Mixed	Multiple leak sources (branches + cache effects)
> 0.50	TailEffect	Data-dependent cache behavior (upper quantiles affected more)

This decomposition is useful because:

A pure UniformShift suggests a conditional branch on secret data
A pure TailEffect suggests data-dependent cache misses or memory access patterns
Mixed patterns suggest multiple leak sources or complex microarchitectural effects

Bayesian inference

The core question: given observed timing data, what’s the probability that the timing difference exceeds your security threshold θ?

In plain English: We start by assuming there’s probably no leak (the prior). As we collect data, we update our belief. If the data strongly suggests a difference, our confidence in a leak increases. If not, it decreases. The final number (72%, 3%, etc.) represents how confident we are that there’s a security-relevant leak.

Prior

We start by assuming “probably no leak”:

The prior is centered at zero effect
Its spread is determined by how noisy our measurements are (the MDE)
Default: 75% prior probability of no leak

Likelihood

Each batch of samples updates our belief. The likelihood tells us: “How probable is this data if the true W₁ distance is δ?” We model the W₁ distance using a Student’s t distribution with robust variance estimation to handle measurement uncertainty.

Posterior

After each batch, we combine our prior belief with the new evidence using Bayes’ theorem:

p(\delta \mid \text{data}) \propto p(\text{data} \mid \delta) \cdot p(\delta)

This gives us an updated probability distribution over the true W₁ distance $\delta$ .

Decision

We integrate the posterior to answer: “What’s the probability that the W₁ distance exceeds the security threshold θ?”

P(\text{leak}) = P(\delta > \theta \mid \text{all data})

Probability	Decision
$P(\text{leak}) < 0.05$	Pass: confident there’s no exploitable leak
$P(\text{leak}) > 0.95$	Fail: confident there’s a leak
$0.05 \leq P(\text{leak}) \leq 0.95$	Continue collecting or Inconclusive

This is different from DudeCT’s t-test approach. DudeCT asks “is there any statistical difference?” while tacet asks “is there a security-relevant difference above my threshold $\theta$ ?”

Quality gates

The adaptive loop includes five quality gates that can stop sampling early:

Gate	Triggers when	Outcome
`DataTooNoisy`	Posterior ≈ prior after calibration	Inconclusive
`NotLearning`	KL divergence collapsed for 5+ batches	Inconclusive
`WouldTakeTooLong`	Projected time exceeds budget	Inconclusive
`TimeBudgetExceeded`	Wall-clock time exceeded	Inconclusive
`SampleBudgetExceeded`	Sample limit reached	Inconclusive

These gates prevent wasting time on tests that won’t converge.

Adaptive batching

On platforms with coarse timer resolution (e.g., Apple Silicon’s ~42ns counter), single-operation measurements are quantized. Adaptive batching compensates:

Pilot measurement: Measure ~100 warmup iterations to estimate operation time $t_{\text{op}}$
$K$ selection: Choose batch size $K$ to achieve ≥50 timer ticks per measurement:

K = \max\!\left(1, \left\lceil \frac{50 \cdot r_{\text{timer}}}{t_{\text{op}}} \right\rceil\right)

where $r_{\text{timer}}$ is the timer resolution in nanoseconds.

Bound $K$ : Cap at $K = 20$ to avoid microarchitectural artifacts
Batch measurement: Measure $K$ iterations together, analyze batch totals

Batching is disabled when:

Timer resolution is already fine (>5 ticks per call)
Using cycle-accurate timers (kperf, perf_event)
Operation is slow enough
Running on x86_64 (rdtsc provides cycle-accurate timing)

Block bootstrap

Variance estimation uses circular block bootstrap to account for potential autocorrelation in timing data:

Divide samples into overlapping blocks of length $\ell \approx n^{1/3}$
Resample blocks with replacement
Compute W₁ distance on resampled data
Repeat $B = 2{,}000$ times to estimate variance $\hat{\sigma}^2$

Block bootstrap handles short-range dependencies that might arise from CPU state, cache effects, or measurement overhead.

References

For the complete mathematical specification, see the Specification.

Academic references:

Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). “Dude, is my code constant time?” DATE. (Original DudeCT methodology)
Van Goethem, T., et al. (2020). “Timeless Timing Attacks.” USENIX Security. (HTTP/2 timing attacks)
Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). “Opportunities and limits of remote timing attacks.” ACM TISSEC. (Effect size to exploitability mapping)