DNA construction strategies from graph algorithms

Note: This is a bloggification of these IWBDA 2019 slides on how path-finding tricks used in Google Maps and strategy games can also help build long DNA molecules.

Story time! The Edinburgh Genome Foundry, where I worked a few years, is a platform that sells custom synthetic DNA. Customers email long sequences (ATGCTAC..., typically 10,000 nucleotides or longer) which are then synthesized on the robotic setup below, by progressively assembling smaller DNA fragments into bigger ones.

Two robots gossiping during a coffee break

The platform can assemble thousands of DNA fragments per day, assuming you can channel enough projects to feed the beast. And the first step is to remove barriers for the customers. How do you enable biologists worldwide to order DNA constructs as easily as they would order dinner or a plane trip?

One of our first software projects, a collaboration with Autodesk , was a customer portal where users could design DNA sequences and order them from different froundries:

Skyscanner of DNA assembly

The website was sunset by Autodesk before it could become the Skyscanner of large DNA, and while it is open-source we never had the infrastructure nor the bandwidth to maintain it. Customers would just have to design their sequences with classic applications such as Benchling, SnapGene, or more likely Microsoft Word (no kidding).

But the quote generation problem, or how to automatically analyse a customer’s DNA sequence and determine the cheapest, fastest way our foundry could build it, remained a hot topic. Manually assessing and planning large DNA construction projects to the point you can deliver a quote can take weeks for most complex sequences, and some customers won’t have the patience. This is how the DNA Weaver project was born. the next sections describe how it represents and resolves different DNA construction problems.

From parts, from scratch, from snatch

How you build long DNA sequences depends on your preferred DNA assembly techniques, which DNA fragments providers you trust, what DNA fragments you may already have in stock, what organisms you could snatch DNA from, and how cheap and fast you want the whole project to be.

Parts assembly projects

The easiest scenario is when customers request DNA sequences that can be assembled from standard genetic parts. These are DNA fragments that are often easy to find (laboratories near you may have them in their fridge if you don’t already), easy to replenish (via miniprep), and designed to easily clip together into a construct. Sometimes this construct is in turn used as a part in a subsequent, higher order construct, and so on until you obtain the desired final sequence.

"Just like legos"

Standard parts assembly is cheap, reliable, fast (around one day per assembly), and the planning consists mainly in finding the best order in which to assemble the parts, which can be aided by software solutions like RavenCAD :

Synthesis from scratch

DNA sequences that are completely novel can be assembled from the bottom up. You start from oligos, small DNA molecules of 40-100 nucleotides synthesized one nucleotide at a time. From these oligos you assemble bigger fragments (1000 nucleotides), then fuse these fragments into even bigger blocks etc.

Whole synthetic genomes of hundreds of thousands (and soon millions) of nucleotides are built this way. Each synthesis project has its particularities, and implements custom software to plan the thousands of assembly operations required (which are then performed either by automated biofoundries or by armies of students and postdocs).

Here is a fully synthetic and re-designed Caulobacter crescentus genome by Christen et al.. Note that they didn’t start from printed oligos, and instead ordered 1,000-nucleotide blocks from a company specialized in DNA printing:

Bit-of-both assembly projects

Some sequences are assembled neither entirely from parts, nor entirely or scratch, but rather from a mix of DNA reuse and de-novo synthesis. A typical example is the assembly of pre-existing genetic parts from a standard library, with a gene that would be synthesized by one of our external vendor:

Some vendors are cheap (count %100 for a 1000-nucleotide gene) but will only accept “easy” sequences, others will be up to twice more expensive, but will really try and synthesize anything you throw at them.

All vendors have limits on the sequence size they’ll accept, and so the longest sequences must be ordered in multiple smaller fragments (sometimes from different providers) that are then be fused together, possibly at the same time as the other parts of the desired sequence:

It’s starting to look like an interesting computational problem! Let’s throw in one more popular cloning technique.

If the gene you want is from an organism you have in your freezer (for instance E. coli bacteria), you can easily extract the gene via PCR extraction, by ordering two small oligos that match its starting and ending sequences, and will guide an enzyme to create copies of it from the organism’s chromosome:

This method can get you 4,000-nucleotide genetic components for just a few dollars and a day of work! But what if the sequence naturally present is E. coli is only almost the sequence you want, with two different nucleotides?

The sequence you need, and the sequence you deserve

In that case you would use a protocol called site-directed mutagenesis where you order 6 oligos and perform 3 separate PCRs, to create 3 fragments with slighlty altered sequences, and fuse everything together:

Selecting a stitch

Not only can DNA fragments come from many different sources, they can also be fused together in different ways. Here are schematics of Golden Gate Assembly, Gibson Assembly, and LCR assembly, to only show a few:

These methods vary in their capabilities, for instance how many sequence fragments they can assemble at once, which sequence patterns will cause problems (repeats, secondary structures, homologies, etc,), which fragments and oligos sequences must be ordered to perform an assembly, and so on.

So yeah, it’s complicated

The multitude of possible sources and assembly methods can make the planning of large DNA assemblies a hard combinatorial problem. I’ve seen meetings for complex projects where everyone with DNA wisdom is gathered in a same room, the requested sequence(s) put on display, and it goes a bit like this:

Hey look! I can't meme.

Even with many experts in the room, errors can be made, and workload estimations can be wrong. So how do you replace this meeting by software?

Step 1: representing the problem

The best way I found to represent DNA assembly problems is supply networks, which show the flow of DNA fragments from the original sources and providers to the final construct. Here is the supply network for “I can build DNA constructs via Golden Gate assembly of fragments either bought from vendor CheapDNA, or assembled from oligos”:

While DNA fragments flow top to bottom in the network, DNA requests flow bottom to top: the Golden Gate assembly station asks CheapDna and Oligo Assembly “I need this fragment, which of you can produce it and for which price? I’ll go with the cheapest.”, and both providers will come back with their best offer. CheapDNA determines the price based on a pricing policy (for instance 10c per basepair), while Oligo Assembly first asks its own provider Oligos.com for oligo prices, before adding the cost of the oligo fusion operation.

Here are some more detailed schemas of how different sources of DNA may respond to a fragment request:

Supply network are generally modeled using a Python script (example), but I also made a toy web application (ever a toy) that lets you build a supply network and submit a sequence to build:

Here are some screenshots of the app’s output, showing a suggested assembly plan and a shopping list of parts and oligos to buy (part sequences are in a separate spreadsheet):

The main advantage of supply networks is how easily they can represent common scenarios. Here is “I assemble DNA via Gibson Assembly and I have 3 vendors to buy fragments from”.

With the schema above, each DNA fragment may come from a different vendor, which is not ideal (it’s a lot of paperwork!). Here is a variation meaning “I have 3 vendors to choose from, but all fragments will come from the vendor which can deliver all fragments for the lowest total price”.

Here is “I assemble long DNA molecules from scratch in several steps, starting with oligos I order.”

Here is the supply network for site-directed mutagenesis, described earlier:

Finally, here is “I use yeast recombination to assemble big fragments, which I obtain via E. coli PCR, or by Golden Gate or Gibson Assembly of smaller fragments obtained from DeluxeDNA, CheapDNA, or oligo assembly”. You get the idea.

So it is possible to represent the most common DNA assembly problems via networks of DNA suppliers following mostly simple rules. The one nodes of this network that will require complex internal logics are assembly stations, who have the central responsibility of decomposing the requested sequence into fragments that will as cheap as possible to produce by their sources. How do they do it?

Step 2: finding the right cuts

Given a set of DNA sources (vendors, libraries, PCR extraction…) and a given sequence to assemble, which sequence fragments should be ordered, and from which sources ? After all, there are trillions of ways of decomposing a sequence into fragments:

There have been many approaches to this problem, using deterministic algorithms, genetic algorithms, etc., but these approaches generally explore only a small fraction of the possibilities. The most efficient solution I found consists in representing the question as a graph shortest-path problem. First, take a segment of the sequence, and add any necessary flanking sequences (which, depending on the assembly method used, could be homologies with neighboring segments, enzyme restriction sites, etc.). You obtain the DNA sequence order corresponding to that segment:

Then submit this fragment to all the assembly station’s suppliers. Some will respond that they cannot provide the sequence, and some will respond with a price. Select the best price and mark the segment:

Do the same for every subsegment of the sequence, to obtain a decomposition graph:

In that graph, find the shortest path from the first to the last nucleotide: it corresponds to the optimal sequence decomposition into contiguous segments. All you need at this point is to remember the best supplier for each segment, and order the fragments from them:

This method very easy to code, as most graph libraries implement the Dijkstra shortest-path algorithm. Here is a minimal example in Python (using the Networkx framework) where a sequence is decomposed so that fragments will be either purchased (at 10c per nucleotide) or obtained from the fridge for free:

import networkx as nx

def decompose_sequence(sequence, free_fragment_sequences, basepair_cost=0.1):
    def cost(subsequence):
        if subsequence in free_fragment_sequences:
            return 0
        else:
            return basepair_cost * len(subsequence)
    graph_edges = [
        (i, j, {"cost": cost(sequence[i:j])})
        for i in range(len(sequence))
        for j in range(i + 2, len(sequence) + 1)
    ]
    graph = nx.DiGraph(graph_edges)
    path = nx.shortest_path(graph, 0, len(sequence), weight='cost')
    return [
        dict(segment=(i, j), sequence=sequence[i: j], cost=graph[i][j]['cost'])
        for i, j in zip(path, path[1:])
    ]

decompose_sequence(
    sequence = "ATTCTACTAAATTTGCACTCAGAGAGAACCATTA",
    free_fragment_sequences = ["AAATTT", "GAGAGA", "GGCCC"]
)
# Result:
# >>> [{'segment': (0, 8),   'sequence': 'ATTCTACT', 'cost': 0.8 },
#      {'segment': (8, 14),  'sequence': 'AAATTT',   'cost': 0   },
#      {'segment': (14, 21), 'sequence': 'GCACTCA',  'cost': 0.7 },
#      {'segment': (21, 27), 'sequence': 'GAGAGA',   'cost': 0   },
#      {'segment': (27, 33), 'sequence': 'ACCATT',   'cost': 0.6 }]

This shortest-path trick is one of my proudest computational stunts, but I can’t be certain to have been the first to use it. California-based DNA vendor Twist Bioscience has been recruiting developers with graph theory skills years ago, and more recently Italy-based Doulix published this tweet which really looks like graph-based assembly optimization (and they may even have nailed circular shortest paths to assemble plasmids!).

Don’t waste your time on assembly, just focus on your plasmid design. Check out Doulix’s latest cloning algorithm with all new #GibsonAssembly assistant. Just presented at #4Bio #globalengage #synbio #cloning pic.twitter.com/C9499Cls17
— info@doulix.com (@Doulix_SynthBio) December 5, 2017

Finally, a Github search gave me this project from the University of Washington which also uses shortest path algorithms, and integrates with other lab automation software. It’s a nice tribute to applied maths that different groups through the world came up with the same abstraction of a pretty recent problem.

Cloning constraints as graph operations

The great thing graph representations of DNA construction problems is that you can model practical cloning constraints with simple operations:

“My assembly method doesn’t work with fragments shorter than X or longer than Y”: just remove all graph edges spanning more than Y or less than X.

“This assembly method works badly when the fragment fusion regions contain very low GC or repeated sequences”: remove the graph nodes corresponding to such regions.

“I need to force the junction between two fragments to absolutely happen at a certain location in the sequence, so I can swap parts later”. Simply remove all edges of the graph that go over that location:

“The assembly method works badly when assembly more than 10 parts”: if the shortest path has more than 10 edges, add a constant penalty weight to each edge, and compute the new shortest path, which should have less edges. Increase the penalty if necessary.
“Golden Gate assembly overhangs should all be inter-compatible” (read this other article on that subject): If the assembly strategy correspondings to the shortest path has DNA fragments with imcompatible overhangs, use a backtracking algorithm (Yen 1971, as implemented in Networkx) to iterate through the 2nd-shortest-path, 3rd, 4th, etc. until one assembly has compatible overhangs.
“Oligo assembly only works with an even number of fragments” – I actually don’t have an elegant solution for this one, but here is a stackoverflow suggestion for finding even-edged shortest path via an heavy transformation of the graph.

Harder, faster, sometimes better: the A-star algorithm

One bottleneck of the graph representation approach is that large sequences (from tens of thousands of nucleotides) will require to cost millions of edges, which will take a few minutes. This is generally acceptable for foundry operators, given the reward, but too slow to give web customers the instantaneous, flight-booking-like experience that will keep them asking for prices as they tweak their DNA designs.

A solution to this is A*, another shortest-path algorithm which can just ignore the edges of the graph that are less likely to be part to the optimal solution. It may sometimes be wrong, and miss the best path, but it is a very good approximator.

Here is an illustration (from user user Subh83 on the A* wikipedia page where A* finds a similar path that is only 10% longer path by exploring 50% less edges:

Dijsktra algorithm	A* algorithm

The speed and good approximation make A* a good choice for interactive applications, and it has been used for real-time strategy games such as Age of Empire, which require to move many units through the map without any lag:

A* is so good at ignoring graph edges because it is fed a particular piece of information from which it will evaluate, at any point of the search, how far it is from its goal. For the illustration above, the information may be the geometric distance to the goal divided by the average terrain speed. If you are looking for the fastest trip in a city, Google Maps will probably use the distance to your goal, divided by the typical speed of the city’s transportation opportunities. To find the cheapest DNA construction strategy, that information will be the number of nucleotides left, times the typical price per-nucleotide price of DNA.

The art is to decide what a “typical per-nucleotide price” is. If you pick it too low, the A* algorithm will have practically no advantage over Djikstra (it will be just as slow), and if you pick it too high, A* will assume that any move is cheap, and go for the most obvious (and expensive) ways to build the sequence.

As a first example, say your commercial providers are two companies charging 10c/nucleotide and 20c/nucleotide respectively:

Cutting through the middle, 15c/nucleotide is a good choice of typical price to feed to A*. And indeed, the assembly station running an A* search with this parameter will be more than 10 times faster, and find still find optimal solutions. A lower parameter value provides no speed advantage, and a higher value creates naive and expensive DNA assembly plans:

For problems with more sources it can get trickier, but you get the general idea. The important is that when the algorithm finds out that a 5000-nucleotide fragment of the sequence can be obtained virtually for free via PCR, it will understand that it is on a “fast lane” and that there are probably no other way but to use that free source. Let’s come back to this problem:

As it turns out, the optimal price heuristic seems to also be ~15c/nucleotide, meaning, in a nutshell “DeluxDNA or oligo assembly are considered expensive, look for alternatives when possible”. With this setting for all stations, computing an assembly strategy for a 50,000-nucleotide sequence with E. coli homologies and a distribution of patterns go from 12 to 0.8 seconds, which is the difference between a long lag and a long click, between slow iterations by trial and error, and real-time price update as you edit your sequence:

Conclusion

Fifty years ago, synthesizing a 50-mucleotide bit of DNA was a scientific achievement. Today a high-schooler can order a gene for a few hundred dollars. But for longer sequences we’re not out of the woods yet. It will take time to plan, it will hit your budget, and in some cases the success is not even guaranteed. I’ve seen this being a big factor of project paralysis in Synthetic Biology, with researchers pondering their options, planning their orders for months before taking a leap.

I didn’t scrape the surface of combinatiorial assemblies, multiplexed PCRs, genome editing, and many other cloning techniques I know less about. And you could add to this that five years from now, commercial DNA offers, cloning techniques, and even the nature of projects will be different, which makes it a lot of work for software teams to keep up with software tools. So if you are interested in algorithms and bioengineering, there should be plenty to do in the coming years.

From parts, from scratch, from snatch#

Parts assembly projects#

Synthesis from scratch#

Bit-of-both assembly projects#

Selecting a stitch#

So yeah, it’s complicated#

Step 1: representing the problem#

Step 2: finding the right cuts#

Cloning constraints as graph operations#

Harder, faster, sometimes better: the A-star algorithm#

Conclusion#