Research topics

Here I list some ideas for research topics / papers / tasks that need doing:

In progress

A* pairwise aligner [GitHub]
Exact global pairwise alignment of random strings in expected linear time.

Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.

Proof of expected linear time alignment
I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance (\(|\Sigma|^{1/e} \ll n\)), but need to spend some time on details and writing it down. WIP post: linear time pairwise alignment.
Finish review post
Write text on diagonal transition and fill in other remaining TODO sections.
Post comparing heuristics
A comparison of the different heuristics that have been used so far.

On hold

Spaced $k$mer similarity
See this post. Not as interesting as the A*PA work.

Pending ideas/blogposts

Some ideas I have, that each deserve at least a blogpost once I find the time.

Low memory WFA backtracking
Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA.
NW/Diagonal transition with exponential search and heuristics
Think about merging NW/Diagonal transition with exponential search and heuristics. When doing exponential search, we can use a heuristic to limit the band/number of diagonals to process. If we are given the actual distance, this may lead to a linear time algorithm to prove that it is indeed the correct distance.

TODO: Figure out how to use pruning heuristics in this case – so far this only works well for non-pruning/static heuristics.

Match score for asymmetric and dual affine costs
As mentioned in the WFAlm paper by Eizenga and Paten, a match score can be handled in the cost-only setting by instead increasing the cost of mismatches and extends. This idea generalizes to asymmetric costs and dual affine costs. Further, it can be modified to not have half-integer costs: instead of effectively adding cost \(l/2\) per processed character, we can add cost \(l\) each time a character of \(A\) is processed (that is, for mismatches and deletion-extend edges). By adding it in the direction of the shorter sequence, the final score will be lower. Some investigation is needed to see whether adding score \(l\) to characters of \(A\) or \(B\) actually makes a significant difference.
Homopolymer affine layers
We could make affine layers with a special (lower) cost that only allows homopolymer inserts/deletes. From first experiments that such a cost is not necessarily well-defined, so further evaluations are needed.
A*PA post
A post on the A*PA paper.
  • Additionally, there can be a post on how it extends/is similar to many other existing ideas/algorithms/papers.
Semi-global pairwise alignment review
similar to the global pairwise alignment review, but for semi-global / $k$-approximate string matches.
Finding (in)exact matches
This turns out to be hard to do fast. For exact matches the fastest is a hashmap mapping qgrams to locations. Instead of an absl::flat_hash_map, it may be better to handroll something and ignore/bail on collisions.

For inexact matches, everything is more complicated, but again the fastest seems to be a hashmap and looking up all mutations of a string.

A simple optimization here is to build a map over \(A\), which contains only \(n/k\) seeds, and do lookups for all \(n\) $k$mers in \(B\).

Smaller tasks

Extract A*PA benchmarks
simple coding task Extract the evals directory to a separate reusable repo with benchmarks and tests.
Find ONT reads corresponding to a reference
annoying It would be nice to test A*PA on a set of real ONT reads. We need to find a dataset where the reads are from exactly the same genome as the reference, so that no biological mutations (in particular long indels) are present in the data.
Test/bench datasets
It would be nice to have more testcases than just random ones. In particular, the following would be nice:
  • random strings with multiple long indels
  • Random with non-uniform mutations (only inserts, only deletions, …)
  • Random with different length distribution on indel length, so they are longer than the default geometric distribution.

Future plans

Exact global pairwise alignment review
Finish that post, and convert the post into a paper, once it’s done.
Extend A*PA to build on diagonal-transition
needs work Currently the algorithm is based on the vanilla \(O(nm)\) DP. Instead we can base it on the diagonal transition methods to reduce the number of states visited and the memory needed to store \(g\).

This should provide a speedup especially in regions where the linear search falls back to quadratic behaviour.

More A*PA extensions
  • Ends-free/semi-global alignment: easy I know how this would work and just needs doing.
  • Affine costs: tricky should be possible, but harder. Will be very tricky to get right (bug-free).
  • Replace gap-cost transition by letter-count-cost transition: hard very unclear how this would work, and whether the transformation can be preserved.
Review paper on semi-global pairwise alignment
low priority lots of work/time Similar to the table I made for global exact pairwise alignment, but for semi-global/mapping. There are a lot of papers in this area. Navarro (2001) also does this with a focus on $k$-approximate string matching, but it quite old by now.

Open questions

  • Can WFA/diagonal transition benefit from bit-parallel techniques? (Likely answer: No.)
  • [unrelated] Given a function \(f : \Sigma^k \to \{0,1\}\) on $k$-mers. How often do you expect this to change value when computing it for all $k$-mers of a length \(2k\) string. Assume that \(f\) has some structure (so that its values correlate for similar strings), but is mostly independent (?) for unrelated strings, i.e. something similar to the sign of Tensor Sketching, or e.g. whether the number of zeros or ones in the $k$mer is larger.