How to quickly identify important genes

DNA is often referred to as the blueprint for life as it contains all the instructions needed to build and sustain life. If you were to look at this blueprint carefully you would see billions of genetic building blocks called nucleotides. The way in which these nucleotides are ordered gives rise to different proteins and other molecules that make up our body. We can determine the order of nucleotides through a process called DNA sequencing. DNA sequencing has recently found a high-profile place in popular culture, such as in TV shows like CSI, and companies like 23andMe. However, DNA sequencing is used for many other research purposes as well. One such application is “Transposon-Insertion sequencing”, or simply “Transposon sequencing” (Tn-seq), that identifies which of a given organism’s genes are essential (i.e. required) for survival.

Transposons are small stretches of “selfish” DNA that only care about replicating themselves; they typically do not confer any benefit to their host organism. However, they can certainly have negative effects. Transposons typically replicate by inserting copies of themselves randomly across the genome. This means that sometimes transposons will insert into the middle of an important DNA sequence like a gene and mess up the blueprint instructions. Researchers have taken advantage of this feature of transposons by growing up millions of bacterial cells and inserting a transposon into each cell. The transposon will integrate randomly into a different location in the DNA of each host cell, and these locations can be mapped across a population of cells by DNA sequencing (Fig 1a). Transposon insertions can also give valuable information about the relative importance of different regions of the host DNA. For example, if a transposon disrupts a gene that is essential for life, then the cell will die, whereas if the transposon disrupts a non-essential gene the cell will continue to grow happily (Fig 1b).

While there are many interesting applications of Tn-seq it turns out there are also many ways to statistically analyze the data! A general problem when analyzing Tn-seq data is that many of the genes identified as essential are shown to be false positives (i.e. incorrectly identified as essential) when different approaches are used for validation. Therefore, hammering out the best recipe for processing data is extremely important to standardize results and reduce false positives. Two key issues that arise when analyzing Tn-seq data are:

  1. There is a lot of random variation between experiments.
  2. There is a lot of the important information encoded in a genome beyond the genes themselves. For example, regions of DNA outside of a gene are commonly involved in regulating how a cell decodes the gene and uses its products. Most common approaches for analyzing Tn-seq data ignore these regions between genes.

Pritchard et al. addressed these two issues in a software program they developed called “ARTIST”. Rather than just taking the number of transposon insertions at face value, the authors ran simulations by randomly picking a set number of transposon insertions and seeing how their results changed. Their tool automatically performed these simulations and corrected for the technical and biological variation in these experiments. They also developed a method for identifying regions of the genome of any size that could be called as “essential” or “non-essential”, rather than individual genes.

Both of these contributions are extremely useful for researchers working with Tn-seq data. Hopefully, this post helps readers understand why work like this is important and interesting. Who knows, maybe a few years from now friends will be gathered around a laptop screen arguing over how best to analyze DNA sequencing data.

Figure 1: Big-picture view of Tn-seq analysis. (a) Extracted DNA (corresponding only to sites where the transposon inserted) are sequenced. The sequencer output is millions of short stretches of nucleotides (A, C, G, and Ts) in “reads”, which are typically 100-300 letters in length. Since many bacterial genomes have been sequenced already (and pasted together through a method called assembly) the next step is to “map” the reads, which means figure out where each read belongs in the genome. (b) Essential genes can be identified by the lack of transposon insertions (i.e. very few reads will map to them) and non-essential genes are identified by the presence of many transposon insertions.

Summary written by: Gavin Douglas

To read the full article click the following link:

ARTIST: High-Resolution Genome-Wide Assessment of Fitness Using Transposon-Insertion Sequencing 

Justin R. Pritchard, Michael C. Chao, Sören Abel, Brigid M. Davis, Catherine Baranowski, Yanjia J. Zhang, Eric J. Rubin, Matthew K. Waldor

1 thought on “How to quickly identify important genes”

  1. A formidable share, I just given this onto a colleague who was doing somewhat evaluation on this. And he actually bought me breakfast as a result of I discovered it for him.. smile. So let me reword that: Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading extra on this topic. If possible, as you become experience, would you mind updating your blog with extra details? It’s highly helpful for me. Big thumb up for this weblog post!

Leave a Reply