[WIP] Linear time pairwise alignment of random strings

This post is a work in progress [WIP]/sketch proof to show that pairwise alignment of random strings with random mutations can be done in linear time.

Pairwise alignment in subquadratic time

Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.e. \(O(n^{2-\delta})\) for any \(\delta >0\)) assuming the Strong Exponential Time Hypothesis.

To break this barrier, there are multiple approaches:

  1. Use algorithms with an output-sensitive runtime: \(O(ns)\) or \(O(n + s^2)\), which is \(o(n^2)\) when \(s = o(n)\).

  2. Use approximation algorithms: these do not have to adhere to the \(n^2\) lower bound for exact algorithms.

  3. Assume (uniform) random input, which is what we will do here. There are a few possibilities:

    1. A given string, and a random string.
    2. A non-random/arbitrary string, and a random mutation of it.
    3. A random string, and a non-random/arbitrary/adversarial mutation of it.
    4. Two independent random strings.
    5. One random string, of which the other is a random mutation via some model.
    given \(A\)random \(A\)
    given \(B\)\(\omega(n^{2-\delta})\)-
    \(B\) given at distance \(s\)1: \(O(n+s^2)\)3.3: ?
    independent random \(B\)3.1: ?3.4: ?
    \(B\) random \(e\%\leq f(n)\) mutations of \(A\)3.2: ?3.5: \(O(n)\) 1 , this post

We will make the strongest possible assumption on the randomness of the input: that \(A\) is random, and that \(B\) is a random mutation of \(A\) with a limited/bounded error rate \(e \leq f(n)\).

Random model

There are multiple possible random models. One decision to make is the number of errors to introduce for a given error rate:

Fixed total
Use exactly \(e\cdot n\) errors.
Binomial total
Use \(Bin(n, e)\) errors.
Bernoulli per position
Loop over the positions, and for each character apply a mutation with probability \(e\).
Geometric per position
A position can be mutated multiple times. Use a geometric distribution with parameter \(p = 1/(1+e)\) for the number of mutations, so that each position has expected \(e\) mutations.

Another choice is the order in which mutations are applied:

Iterated errors
For each mutation, first choose whether it will be an insertion, deletion, or substitution. Then choose a random position in the string, and for insertions and substitutions a replacement character. This way, mutations can affect earlier mutations.
Up-front
Another possibility is to generate all errors up-front before applying any. This can work by generating a list of insert/delete/substitute instructions, and then executing them. Substitute instructions always refer to characters of the original string (never to inserted characters), and inserted characters are never deleted. This seems to be similar to the indel channel model [TODO: read on this; citation].

A final choice is whether or not to allow identity substitutions that do not change the current character.

For any model, we have a mapping between the characters of \(A\) and \(B\). In particular for the up-front model, each character of \(A\) maps to some interval of \(B\), and each character of \(B\) maps to a single character of \(A\). These characters correspond to each other.

Comparison

The main difference between these methods is that iterative errors slightly favours longer runs of inserts: once a character is inserted, the probability that a second character will be inserted in the same ‘run’ is slightly larger because now there are two positions instead of just one.

The benefit of the up-front method is that it is easier to analyse the probability distribution of the number of errors in an interval, since the generated mutations do not depend on earlier mutations. I think this is the model of choice to analyse our probabilistic algorithm.

Algorithm

The algorithm runs an A* search from \(s\) to \(t\), using a heuristic \(h\). During the algorithm, a match is pruned as soon as its start is expanded. This increases the value of \(h\) and thus improves the A*.

Counting-seeds heuristic

We use a simplified version of the chained-seeds heuristic used in A*PA, that is in fact much closer to the original seed-heuristic of Ivanov, Bichsel, and Vechev (2021). We partition \(A\) into seeds of length \(k\), and find all matches for each seed. The heuristic at a state \(u\) is simply the number of remaining seeds for which no matches exist. This is an admissible heuristic, i.e. lower bound on the remaining distance to \(t\), since all remaining seeds must be aligned, and each seed without a match will incur a cost of at least \(1\) for its alignment.

Note that the value of \(h(u)\) only depends on the position of the seed that ‘covers’ \(u\).

Match pruning

Once we expand the start of a match, we remove this match from consideration. If this is the last non-pruned match for a given seed, this means that \(h\) increases by \(1\) for all states left of \(u\). This can be implemented using e.g. a Fenwick tree: both queries and updates take \(O(\log (n/k))\) time this way.

TODO Analysis

We can choose \(k \geq \log_\Sigma n^2\) to ensure \(o(1)\) false-positive matches in total.

We make the following assumptions, and show that each is false only with exponentially small probability.

  • All matches are true positives: they indicate a substring of \(B\) that corresponds to a substring of \(A\).
  • The optimal alignment of \(A\) and \(B\) goes through all matches.
  • [There will probably be more.]

A* works by keeping a priority queue of states ordered by \(f = g + h\). Our \(h\langle i, j\rangle\) only depends on \(i\), and is non-increasing. This means that \(f=g+h\) can only increase when \(g\) goes up by \(1\) (because of mismatch/indels) while \(h\) stays the same. Let \(F\) be the minimum $ Thus, if it weren’t for pruning, at any point in time all states have either value \(F\) or \(F+1\), where \(F\) is the smallest \(f(u)\) over explored (non-expanded) states \(u\).

In our model, we will pay for the expansion of all states at distance \(F\) as soon as \(F\) first reaches that value.

References

Backurs, Arturs, and Piotr Indyk. 2018. “Edit Distance Cannot Be Computed in Strongly Subquadratic Time (Unless Seth Is False).” Siam Journal on Computing 47 (3): 1087–97. https://doi.org/10.1137/15m1053128.
Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2021. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” Biorxiv. https://doi.org/10.1101/2021.11.05.467453.

  1. or \(O(n \log n)\)? ↩︎