Computational Thinking in Biology: The Genome Assembly Problem

Why?

Genome assembly was one of the first problems encountered in computational biology, or

bioinformatics. During the human genome project, researchers encountered a problem where they could not sequence the entire human genome all at once, they had to cut it up, sequence the pieces, and then put it all back together like a puzzle.


The sequencers aren't 100% accurate, so copies of the genome must be created to get a

more accurate sequence. They call this copying “coverage”, researchers typically want a sequence to be covered 50 times, or “50x coverage”. The idea is that the sequencer won't mess up in the same spot every time it sequences a section, so you can use the fact that mistakes occur in different spots each time as a tool to create one master “consensus” sequence, a sum of all of the different copies of a particular section.


Sequencing machines are not powerful enough to sequence a whole genome all at

once, so each organisms genome must be “cut up” into smaller chunks. These sequence chunks are random sizes on purpose so that they are unique, like a puzzle. The result is a “graph” of overlapping genome pieces.


The sequencing machine is bad at sequencing repetitive regions. If there is a part in a

genome that is just “AAAAA” for hundreds of base pairs, the sequencer tends to throw this out, leaving huge gaps in the genome. The computer must make up for this by guessing at a “Hamiltonian path” or “a best path” through the directed graph.



How?

Genome sequencing steps:

1) The original sequence is copied many times

2) The copies are cut into smaller, randomly sized chunks

3) The small genome chunks are fed into a sequencer, the sequencer outputs "sequence reads"

4) Those sequence chunks are then fit together by a computer

a) Identify overlapping regions

b) Join the overlapping regions together

c) Determine the best path (Hamiltonian Path)

d) Discard repetitive information, or copies of the best path

e) Create consensus sequence from what's left

5) The pieces are used to create a “consensus sequence”


Activity:

Assemble a genome of rainbow colored string.

The students will be doing the job of the computer, they will fit together the many pieces of DNA (string) into a “genome graph”.


During the activity, the students will be asked questions on how they are assembling the string.

  • Are they creating one long, continuous solution, or does their solution appear to have many layers, that together would make a continuous solution?

  • Did they encounter any colors that were obviously missing? Things that should have been included, but couldn't be found? How did they make up for this?

  • Did they encounter any areas which seem to switch back and forth between two colors, and it's not clear where the region should be placed, or how long it actually is? What colors were involved here?

  • Were there any color regions which were too similar, and could not be told apart easily? How did they overcome this?

The students should assemble one group of strings, record how they did it, and then use the same method to assemble another group of strings.

  • What was their method, how did they pick the "best path"?

  • Were there any parts of the string genome they had, and decided to not include in the final product? Why did these strings get thrown out? What colors were included on the strings?

  • Did they have to change their method at all?

  • What is the most complex part about how they decide on the "consensus sequence"?


What are the results?

This activity is key to teaching a student how to turn a concrete problem into an abstract idea that can be turned into an algorithm for a computer to do. The genome assembly problem is similar to many other problems in computer science, in fact, some of the first machine learning algorithms were built to assemble a genome. While machine learning is generally the name for algorithms which require a repetitive training set, genome assembly algorithms tend to compare previous assemblies to the current working set. The point of machine learning algorithms is to make a guess based on probability, a genome assembly will never be 100% correct but if we can create something where the mistakes are easy to make up for, like having repetitive or ambiguous colors, that is a good algorithm.


During this activity, the student sees how the specific order of bases doesn't matter, the base order creates a big picture of different DNA features and that's what matters. The world is full of seemingly repetitive information like this, companies these days have giant databases of customer information. While assembling the multi-colored genome string, they can experience some of the guesswork involved in creating algorithms. The students can see clearly how what makes a good algorithm isn't that it just works, it's how it works, it's the amount of time it takes, it's how precise it is, how easy it is to replicate and re-use for other tasks, all of these things should be taken into account when solving problems later.

0 views

© 2018 by the CodeNC Fellowship.  

Proudly supported by grants from the College of Computing and Informatics at UNC Charlotte.

Join us if you believe that every student should have a fundamental right to computer science literacy. 

Join us if you believe that every student should have a fundamental right to computer science literacy.