The Human Genome Project

Saturday November 11, 2017

I was cleaning up my notes and I found 3 DOIs (document object identifier) that pointed to three papers on DNA sequencing. Then I thought to myself, let’s do a simple and short write-up on this.

DNA sequence became a thing in 1990 when the Human Genome Project (HGP) kickstarted into existence. This ambitious project was motivated by the following goals, among others: mapping the physical location of human genes and mapping genes to specific and functional human traits. The interest to mapping the physical location of genes began in the early 20th century when an undergraduate researcher in Thomas Hunt Morgan (yes, that centiMorgan, or cM), Alfred Sturtevant, had a realization that he could map the locations of fruit fly genes. That was in 1911.

genomic research timeline

The Dulbecco’s Essay. The name “Dulbecco” is pretty well known and widespread among bench scientists who carry out tissue culture experiment on a daily basis. Dr. Renato Dulbecco won the 1975 Nobel Prize in Physiology/Medicine for his work viruses that could cause cancer. His name is preserved in the most famous tissue culture solution (we call it “medium”), the Dulbecco’s Modified Eagle Medium. Nope, he did not get that from an eagle, but he modified a tissue culture medium that was once made by Dr. Harry Eagle.

Renato Dulbecco Dr. Renato Dulbecco. Image from Wikimedia Commons

In 1986, Dulbecco published a letter in the Science journal, arguing that the scientific community needed to usher into the new era of characterizing the cellular genome. Dulbecco was working with tumor-causing viruses (also known as oncoviruses). Through his research findings, he was able to characterize the viral genes responsible for causing tumors. He later reasoned that although he and a handful of other people had elucidated the mechanism of cancer with respect to the role of oncoviruses, they still had no clue on how the oncoviruses could interact with our human DNA. Some of their data suggested that our genome or its state could influence how well a virus can cause a tumor.

To solve this missing piece, in this letter Dulbecco called for a new pursuit to obtain a detailed knowledge of DNA. He argued that knowledge of (our) genes would open new therapeutic studies. He was right. In 2017 (31 years later), FDA approved a cancer drug based on patient’s genome, not the location of cancer (we’ll save this discussion for later).

Sequencing Protocols. When it comes to the sequencing technology, it is a little bit hard to explain because (I) I do not have a sufficient intimate experience on DNA sequencing to know how it actually works on the molecular level, (II) there are tons of sequencing technologies out there, (III) I am still learning the theory behind sequencing methods & protocols and there is just a lot to catch.

Pertaining to the subject, here I have the elementary-level understanding on how the sequencing protocol was carried by the investigators when they were trying to map all possible DNA sequence during the HGP. The main method of determining the DNA sequence was done by using the Sanger’s chain-terminating dideoxy. However, this DNA sequence technique can be done if the DNA fragments were already prepared and that required its own discussion. So another question here, how did the researchers label the DNA fragments prior to examining the sequence through the chain-terminating dideoxy method?

hierarchical shotgun sequencing Source from Nature 409, 860–921 (15 February 2001)

Among other methods that were employed, one of them is the hierarchical shotgun sequence (HSS). It is also known with several other names, for example, “BAC-based” or “clone-by-clone”. BAC here refers to the bacterial artificial chromosome, a special kind of DNA vector. Since our genome is way too large for a single continuous read, i.e. a long stretch of 3 billion nucleotides being fed into the dideoxy detector machine would pose a real technical challenge, researchers fragmented the DNA strands into small 100k-200k nucleotides in length. Smaller fragment length made it easier to handle for subsequent reactions. The smaller DNA fragments were then inserted into BACs, and then the BACs got further broken randomly into smaller pieces to be placed into another artificial vector called the M13 with the size around 500 nucleotides.

Another method that was proposed and used by Celera Genomics was the whole genome shotgun (WGS). Unlike the HSS where a DNA strand is shredded in two subsequent steps before actually sequencing them, WGS was designed to skip these two physical steps. Because of that, researchers back then argued that it was better to employ HSS so that they could generate sequence data with minimal errors.

In a hindsight, HGP was not only about generating the map of our genome but it was a constant battleground looking for new technologies to speed up the processing of sequencing DNA, which then opened the grounds for what we call the next-generation sequence (NGS) method, which I might touch on another different write-up.

The Drafts. After spending quite a while, finally, scientists announced that they have solved the sequence of the human DNA. Worth noting that prior to actually sequencing the human genome, scientists performed (if you will) test-runs on model organisms first. And to make a fair assessment regarding which animal got sequenced first, we need to take a brief tour decades prior to the HGP.

The two sequencing methods (Maxam-Gilbert & Sanger) were described in 1977. Guess what, the first (micro)organism that was sequenced was the bacteriophage MS2 and it was done in 1976 by a team in Belgium, led by Walter Fiers at the University of Ghent. I do not know how they did it (have not gone deeper in my reading yet), but here is a fun fact: Fiers’s team was the team that revealed the complete sequence of the SV40 virus, which is biologically relevant for a number of reasons.

Thanks to our knowledge of the SV40, we benefit a lot. Specifically, I am pretty sure a lot of us who are doing tissue culture can appreciate the HEK293T cell line. We can talk about this later.

Then in 1977, another bacteriophage known as ΦX174 was completely sequenced by the same team. After doing a little bit of reading, the method seems bizarre yet cool to me. They did not employ Maxam-Gilbert nor Sanger sequencing method, rather relying on radiolabeling of nucleotides. I do not know how this works, but I will figure it out later.

Then starting from 1995 when the HGP was in the motion, the modern sequencing method began to dominate. The first organism to benefit from the modern sequencing method was the bacterium Haemophilus influenza, followed by the venerable yeast Saccharomyces cerevisiae and then our favorite worm the C. elegans.

Not that long afterward, the plant model organism Arabidopsis thaliana and then the fruit fly Drosophila melanogaster had their genes sequenced. Finally, in 2001, the first draft of the human genome was announced and made public.

For those who care about the specifics, here is a little bit more information: initial rough draft of the human genome was made available in June 2000, the first working draft was made available in February 2001, and the final map was revealed in April 2003. Information derived from Wikipedia.

But then, did the scientists sequenced the whole thing? Unfortunately, they did not.

There are technical reasons why they had not actually completed the sequence. To talk about these reasons, probably it would result in turning this article into a textbook. Let’s not go there.

It is sufficient, for now, to know that they are two physical forms of DNA: fairly loose structure and a super-condensed structure. The latter one is a little bit hard to deal with, but fortunately only accounts for a relatively small area of our genome. This is not to say that they are not important (they are!), but in addressing the initial question (“which genes do this and that”), this problematic region was left for future investigators to bear the grunt (that’s probably including me, oh lordy).

The STAT News wrote about this and I think you should read it.

Do you want more fun facts? Sure. The initial draft of the human genome was mosaic, in the sense that it came not from just one individual but a pool of anonymized volunteers. The investigators did not know who were the individuals.

The Projects That Came Afterwards. The HGP provided what scientists of this era need the most: technologies that work and data that can be referred to. The HGP validated the that the Sanger’s sequencing method is reliable and could be the basis to improve the sequencing method to get more data faster, cheaper, and reliable.

As much as I would like to enumerate all the projects birthed by the HGP, there is no way I am turning this into an encyclopedia. Instead, I am going to focus on the projects that I am fairly familiar with, though I will not claim myself as an expert.

Let’s briefly talk about the International Haplotype Mapping (HapMap) project. The Broad Institute at the MIT has a good explainer about this project that you can read here, so I won’t go over it.

Instead, let’s delve straight into the controversy that came with this project that seems to stem from the way it was (intentionally) designed. The HapMap’s premise was to catalog the demographic-specific changes and structures of the human DNA. Meaning that, how different genetic DNA among the Central European Population than the Han Chinese population?

Why is this question important?

Well, consider this fact that certain people of a certain demographic group are less vulnerable to a certain specific disease. For example, there is a subset of the African group is less vulnerable to malaria (while at risk for sickle-cell anemia) while there is a subset of the European group is less vulnerable to HIV infection.

So, what is the problem?

First of all, scientists do not consider race as a valid scientific construct. Scientists do not believe that race is a thing. But with this HapMap project, it would set the precedent to categorize human further into subspecies. This could open a huge can of worms, and we have enough cans of worms already to take care of. Geography already divides us. The language we speak also divides us. Which soccer team we support could spark a brutal exchange of voice and fist, so clearly we don’t need more.

But, as there’s always a but, HapMap does come with benefits. Researchers are looking for demographic-specific variants to uncover why certain metabolic diseases are more prevalent in certain people. Data from HapMap could shed some light on this. “HapMap is a necessary evil for a greater good”, as someone would point out. Perhaps this article by the Scientific American can help.

Now, let’s take a step back a little and talk about the findings from the HGP. Of the 100% gene sequenced (excluding the super-condensed region), scientists were perplexed when they found out that only 1.5% DNA can actually be transcribed and translated into proteins, roughly around 20,000 proteins.

What about the rest? 98.5% is a huge number.

Nobody has the complete picture for sure. Some call it the junk DNA (they aren’t junk), some call it biological dark matter (feels like taking physics class now), some just (boringly) call it the non-coding region (which is true). These non-coding regions are actually pretty useful. They can be the regulators, playing the role of a senator whether to allow a gene to be transcribed or not. Maybe we should talk about the endogenous retrovirus in the future because it is relevant to our topic here.

But then, why bother spending more time on the non-coding regions when we do not have the complete knowledge of the coding regions? Pondering upon priority, characterizing the functional coding regions is the wisest thing to do.

Enter, the ENCODE project.

Shortly after the final draft of the HGP was announced, ENCODE took off. The premise is simple: we have the human genome sequence as the reference, let’s annotate them all and see which gene does what. When this article is being written (Nov 2017), the 4th phase of the ENCODE just took off (Feb 2017).

Okay so now we have the HapMap that attempted to characterize the variations in the human genome as a function of demographic classification. We also have ENCODE to functionally characterize the protein-coding regions. Given the fact that the DNA sequencing technologies are getting faster and cheaper, scientists decided to catalog the variants in the human genome by sequencing as many people as possible.

Thus, we arrived at the 1000 Genomes Project, a.k.a 1KGP. I must say that I am disappointed because the name for this project is not flashy.

Despite the un-flashi-ness of its name, the findings do have its own brilliance (puns!). Scientists figured out that how frequent de novo mutations take place per nucleotide base in the germline cells (that specific), how frequent mutations lead to loss-of-function (the number is 250-300 variants and I don’t know what this implies), and other genetic events that can happen inside your body now with respect to genetic inheritance.

1KGP It does look like a fun paper to read. Read here (OA).

Closing thoughts. Reading about the human population, genetic composition, and how it affects our daily life is really interesting. Initially, I kept notes on the human genetics but then I thought I could do better by writing an actual magazine-tier article about this.

And I did. I hope you had fun reading this. I might write the sequel as I learn more into this subject. Maybe, say, 6 months from today?

Papers. Not for the faint-hearted.