Research.CuriousCodinghttps://research.curiouscoding.nl/Recent content on Research.CuriousCodingHugo -- gohugo.ioenThu, 11 Aug 2022 00:00:00 +0200Diamond optimisation for diagonal transitionhttps://research.curiouscoding.nl/posts/diamond-optimization/Mon, 01 Aug 2022 00:00:00 +0200https://research.curiouscoding.nl/posts/diamond-optimization/Table of Contents Diamond transition or how technicalities can break concepts But let’s take a closer look Conclusion References Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.
In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.
Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al.Bidirectional A*https://research.curiouscoding.nl/notes/bidirectional-astar/Thu, 28 Jul 2022 17:59:00 +0200https://research.curiouscoding.nl/notes/bidirectional-astar/These are some links and papers on bidirectional A* variants. Nothing insightfull at the moment.
small lecture introduces \(h_f(u) = \frac 12 (\pi_f(u) - \pi_r)\). Not found a paper yet. An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977) introduces a bidirectoinal variant Bidirectional Heuristic Search Again (Champeaux 1983) fixes a bug in the above paper Efficient modified bidirectional A* algorithm for optimal route-finding Didn’t read closely yet. A new bidirectional algorithm for shortest paths (Pijls 2008) Actually a new methods.The BiWFA meeting conditionhttps://research.curiouscoding.nl/notes/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://research.curiouscoding.nl/notes/biwfa-meeting-condition/Table of Contents References cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2022), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:A* variantshttps://research.curiouscoding.nl/notes/astar-variants/Sun, 12 Jun 2022 12:04:00 +0200https://research.curiouscoding.nl/notes/astar-variants/These are some quick notes listing papers related to A* itself and variants. In particular, here I’m interested in papers that update \(h\) during the A* search, as a background for pruning.
Specifically, our version of pruning increases \(h\) during a single A* search, and in fact the heuristic becomes in-admissible after pruning.
Changing \(h\) The original A* paper has a proof of optimality. Later papers consider this also with heuristics that change their value over time.IGGSY 22 Slideshttps://research.curiouscoding.nl/notes/iggsy-presentation-slides/Sun, 12 Jun 2022 12:04:00 +0200https://research.curiouscoding.nl/notes/iggsy-presentation-slides/These are the slides Pesho Ivanov and I presented at IGGSY 2022 on Astarix and A*PA.
Drive: here
Pdf: hereBenchmark attention pointshttps://research.curiouscoding.nl/notes/benchmarks/Thu, 28 Apr 2022 23:33:00 +0200https://research.curiouscoding.nl/notes/benchmarks/Pin CPU frequency CPUs, especially laptops, have turboboost, (thermal) throttling, and powersave features. Make sure to pin the CPU core frequency low enough that it can be sustained for long times without throttling. In my case, the `performance` governor can fix the CPU frequency. The base frequency of my CPU is 2.6GHz, but I set it slightly lower since I prefer consistency.
sudo cpupower frequency-set -g performance sudo cpupower frequency-set -u 1.Motivationhttps://research.curiouscoding.nl/notes/motivation/Thu, 28 Apr 2022 23:22:00 +0200https://research.curiouscoding.nl/notes/motivation/It’s not the need for faster software that motivates; it’s the mathematical discovery that needs sharing.[WIP] Linear time pairwise alignment of random stringshttps://research.curiouscoding.nl/notes/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://research.curiouscoding.nl/notes/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Comparison Algorithm Counting-seeds heuristic Match pruning TODO Analysis References This post is a work in progress [WIP]/sketch proof to show that pairwise alignment of random strings with random mutations can be done in linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.e. \(O(n^{2-\delta})\) for any \(\delta >0\)) assuming the Strong Exponential Time Hypothesis.Variations on the WFA recursionhttps://research.curiouscoding.nl/posts/wfa-variations/Sun, 17 Apr 2022 03:14:00 +0200https://research.curiouscoding.nl/posts/wfa-variations/Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions References cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method.Publicationshttps://research.curiouscoding.nl/pages/publications/Fri, 15 Apr 2022 00:00:00 +0200https://research.curiouscoding.nl/pages/publications/ (Groot Koerkamp and van der Wegen 2019) (Groot Koerkamp and Živný 2021)
Groot Koerkamp, Ragnar, and Marieke van der Wegen. 2019. “Stable gonality is computable.” Discrete Mathematics & Theoretical Computer Science vol. 21 no. 1, ICGT 2018 (June). https://doi.org/10.23638/DMTCS-21-1-10. Groot Koerkamp, Ragnar, and Stanislav Živný. 2021. “On Rainbow-Free Colourings of Uniform Hypergraphs.” Theoretical Computer Science 885 (September): 69–76. https://doi.org/10.1016/j.tcs.2021.06.022.Research topicshttps://research.curiouscoding.nl/pages/todo/Fri, 15 Apr 2022 00:00:00 +0200https://research.curiouscoding.nl/pages/todo/Table of Contents In progress On hold Pending ideas/blogposts Smaller tasks Future plans Open questions Here I list some ideas for research topics / papers / tasks that need doing:
In progress A* pairwise aligner [GitHub] Exact global pairwise alignment of random strings in expected linear time. Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.
Proof of expected linear time alignment I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance (\(|\Sigma|^{1/e} \ll n\)), but need to spend some time on details and writing it down.Glossaryhttps://research.curiouscoding.nl/pages/glossary/Thu, 14 Apr 2022 00:00:00 +0200https://research.curiouscoding.nl/pages/glossary/This is a growing list of ambiguous terms and their definitions. More of a place to store random remarks than a complete reference for now.
diagonal transition name introduced by Navarro (2001) approximate approximate algorithm: an algorithms that does not always give the correct answer.
$k$-approximate string matching: variant semi-global alignment where we find all matches of a pattern in a reference with at most \(k\) mistakes.
Also approximate string matching: alternative name for global pairwise alignment.A review of exact global pairwise alignmenthttps://research.curiouscoding.nl/posts/pairwise-alignment/Fri, 01 Apr 2022 00:00:00 +0200https://research.curiouscoding.nl/posts/pairwise-alignment/Table of Contents Variants of pairwise alignment Cost models Alignment types A chronological overview of global pairwise alignment Algorithms in detail Classic DP algorithms Cubic algorithm of Needleman and Wunsch (1970) A quadratic DP Local alignment Affine costs Minimizing vs. maximizing duality Four Russians method TODO \(O(ns)\) methods TODO Exponential search on band TODO LCS: thresholds, $k$-candidates and contours TODO Diagonal transition: furthest reaching and wavefronts TODO Suffixtree for \(O(n+s^2)\) expected runtime Using less memory Computing the score in linear space Divide-and-conquer TODO LCSk[++] algorithms Theoretical lower bound TODO A note on DP (toposort) vs Dijkstra vs A* TODO Tools TODO Notes for other posts Semi-global alignment papers Approximate pairwise aligners Old vs new papers References This post explains the many variants of pairwise alignment, and covers papers defining and exploring the topic.Pruning for A* heuristicshttps://research.curiouscoding.nl/notes/pruning/Sat, 11 Dec 2021 00:00:00 +0100https://research.curiouscoding.nl/notes/pruning/Note: this post extends the concept of multiple-path pruning presented in Poole, David L. and Mackworth, Alan K. (2017).
Say we’re running A* in a graph from \(s\) to \(t\). \(d(s,t)\) is the distance we are looking for.
An A* heuristic has to satisfy \(h(u) \leq d(u, t)\) to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.AStarixhttps://research.curiouscoding.nl/notes/astarix/Fri, 12 Nov 2021 13:05:00 +0100https://research.curiouscoding.nl/notes/astarix/Papers
AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs:
Input A reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align Output An optimal alignment of each input sequence The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.Neighbor joininghttps://research.curiouscoding.nl/notes/neighbor-joining/Fri, 12 Nov 2021 11:57:00 +0100https://research.curiouscoding.nl/notes/neighbor-joining/Neighbor joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters.
This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l).UPGMAhttps://research.curiouscoding.nl/notes/upgma/Thu, 28 Oct 2021 11:56:00 +0200https://research.curiouscoding.nl/notes/upgma/Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method.
Input Matrix of pairwise distances Output Phylogeny Algorithm Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged. Complexity \(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.RTFEhttps://research.curiouscoding.nl/notes/rfte/Fri, 22 Oct 2021 15:16:00 +0200https://research.curiouscoding.nl/notes/rfte/Read The F*ing Error
When you complain about an error without reading it first. When you assume you understand the problem halfway through reading the error, and only after more debugging you realize you failed to read properly.1st law of Procrastinationhttps://research.curiouscoding.nl/notes/procrastination/Fri, 22 Oct 2021 11:46:00 +0200https://research.curiouscoding.nl/notes/procrastination/Important deadlines require important procrastination.Data should be reviewedhttps://research.curiouscoding.nl/notes/data-should-be-reviewed/Fri, 22 Oct 2021 11:41:00 +0200https://research.curiouscoding.nl/notes/data-should-be-reviewed/Experiments and their analysis should be reproducible, and all data/figures in a paper should be reviewable. Pipelines (e.g. snakemake files) to generated them should be attached to the paper.
I’ve asked for automated scripts to reproduce test data on 3+ github repositories now, and got a satisfactory answer zero times:
WFA: https://github.com/smarco/WFA/issues/26
Link to a datadump on the block-aligner repository. Good to have actual data, but exactly how this data was created is unclear to me.Spaced K-mer Seeded Distancehttps://research.curiouscoding.nl/posts/spaced-kmer-distance/Wed, 20 Oct 2021 00:00:00 +0200https://research.curiouscoding.nl/posts/spaced-kmer-distance/Table of Contents Background $k$-mers Sketching MinHash Terminology Introduction Spaced $k$-mer Seeded Distance Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm TODO Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Open Sciencehttps://research.curiouscoding.nl/posts/open-science/Tue, 19 Oct 2021 00:00:00 +0200https://research.curiouscoding.nl/posts/open-science/Let’s go over some reasons for why I’m writing this blog.
The internet is more accessible than papers The inspiration for this blog is the post on Succinct de Bruijn Graphs by Alex Bowe. I think blog posts are a great way to quickly learn about new ideas and concepts, since they are usually more accessible than papers. A blog post can omit some of the more formal text required in papers and spend more time explaining things on an intuitive level.Hugo and ox-hugohttps://research.curiouscoding.nl/notes/hugo/Thu, 14 Oct 2021 00:00:00 +0200https://research.curiouscoding.nl/notes/hugo/Here’s the customary how I made this site using X post.
This site is built using Hugo and ox-hugo.
The source is written in Org mode, which is converted to markdown by ox-hugo. To get started yourself, check out the initial commit of the source repository and build from there.
Some notes:
I’m using the Hugo-coder theme. Since the conversion from Org to markdown is done using an Emacs plugin, the emacs folder contains a simple init.Hello, World!https://research.curiouscoding.nl/notes/hello-world/Wed, 13 Oct 2021 00:00:00 +0200https://research.curiouscoding.nl/notes/hello-world/print("Hello, World!") std::cout << "Hello, World!" << std::endl;Abouthttps://research.curiouscoding.nl/about/Mon, 01 Jan 0001 00:00:00 +0000https://research.curiouscoding.nl/about/Hi there ;) I’m doing a PhD in bioinformatics at the BMI lab at ETH Zurich. Currently I’m working on near-linear algorithms for exact pairwise alignment.
This blog is where I dump my thoughts on my PhD research. For now it includes some short notes/remarks/ideas for research, and a few longer posts that may eventually turn into papers.
Feel free to use this blog as inspiration and build on the ideas you see here, as long as you cite appropriately.Readmehttps://research.curiouscoding.nl/readme/Mon, 01 Jan 0001 00:00:00 +0000https://research.curiouscoding.nl/readme/Research notes This repository contains the source of my blog: https://research.curiouscoding.nl.
Feel free to comment on the code or create an issue if you see something off.
This blog is written in Org, converted to markdown by ox-hugo and built using Hugo.
License All written text (i.e. everything rendered on my blog) is licensed under CC BY-SA 4.0.
The Hugo, ox-hugo, and org mode related source code (everything in the initial commit) are licensed under MIT.