Caspershire

Loft of an Eldritch Metaphor

Mutations & Strains

Posted on 30 Apr 2020

For a while I resisted the urge to write anything (read: opinions) about SARS-CoV-2/COVID-19 because things have been moving so fast I often find myself gasping for air and needing some breathing support while combing through scientific literatures, news reports, and tweets.

Often times, my friends would ask me “is this true?” or I find myself replying to their panic-laden tweets and calmly assuring them that “yeah, things on fire but that not that bad of a fire”. Lately, I hit the proverbial wall where tweets won’t cut it anymore to convey complex information and difficult concepts.

I would like to talk about the concept of mutation and classification by using this tweet as an example.

Awani’s Tweet

Journalists from Awani, if you happen to read this, I have nothing personal against you, just a little unfortunate because a friend of mine shared this with me and I thought it would be a good example to illustrate my point. To DG Hisham, great job navigating the country through this difficult time, it would be hard to find someone with caliber matching yours, but if you need my CV/resume let me know. Teehee.

Is it true that we have 3 different types of SARS-CoV-2?

Well, it is a little complicated than that.

Jump To Section:

  1. Typing, Categorizing, and Naming
  2. How Viruses Are Generally Classified
  3. Genomic Identity
  4. Genomic Identity Is Not Functional Identity
  5. SARS-CoV-2 Type A, B, and C?
  6. What Should We Do Now?

Typing, Categorizing, and Naming

It is our intrinsic nature to classify things, to put things into discrete bin. This is especially true when you are dealing with a large number of things and to refer them one-by-one would be a Herculean task.

Classification Problem

Wait… this sounds like an introduction to a machine learning seminar.

Anyway, it is natural to put a label on things to make communication easier. Label is definitely helpful. However, are they meaningful? For example, we are so accustomed with the concept of race to label human population into discrete unit of identities. I am a Malay, my reader right now could be a Chinese, or an Indian, etc.

This classification could be helpful for me to understand the demographic of my reader, but they are not actually functionally meaningful. For example, is there any fundamental difference between these races to understand what I write here? Races would be a poor predictor of literacy, hence we should find a better measurement unit. Maybe level of education?

But, to classify someone based on level of education is a little bit more challenging than using race. Why? By using race, it is quick to identify and to label a person. You look at them, and you know (with a certain degree of accuracy, of course). To identify someone’s level of education, you need to ask that person.

Translating the notion above into topic at hand, the temptation to label which SARS-CoV-2 that is causing the highest rate of infection and mortality is real and strong, and even being weaponized to a certain extent. However, it is not the virus alone that is 100% responsible for causing damage. As we all know, government policy (e.g. enforcement of lockdown), healthcare capacity (number of beds and ventilators), societal reaction (e.g. abiding the lockdown and not flaunting your privileged if you are child of VIP) have a say in determining the shape of the curve.

Thus, I would exercise extreme caution when I saw “ah this strain from Europe that came to the US is super deadly” when in reality the federal response of the US government has been a weak sauce.

How Viruses Are Generally Classified?

Okay now, real science stuff.

First, there are broad classes of virus. For example, we have influenza virus, hepatitis virus, poxvirus, so on and so forth. There are just too many of them and the scientists have not found most of them yet. Safe to say we are barely scratching the surface.

Then within this broad classes, there are subclasses. For influenza virus, we have type A, type B, and type C that could infect human. Then, we have subtypes. Within influenza A, we currently have subtype H1N1 and H3N2 that are causing disease in human. Within these subtypes, we have clades, for example H3N2 has clades 3c2.A, 3c3.A, so on and so forth. And then, we have individual strains, for example A/Kansas/14/2017 strain within clade 3c3.A.

I study influenza A viruses. This hierarchy of classification sometimes leaves me scratching wall in agony while letting out ghastly screams.

When talking about mutations, we say that strain A/Kansas/14/2017 (Kansas/17 for short, identified in 2017) has mutations that make it different from an earlier strain A/Hong Kong/4801/2014 (HK/14 for short, identified in 2014), different enough even though they both are of the same subtype (H3N2).

How different, you asked?

Well, different enough that if you were exposed to HK/14 (through vaccination or infection), you might get wee bit sick if you got infected by the Kansas/17, but not too sick. This means that you have some level of protective immunity, but not as high. Since our immune system might struggle a little bit against Kansas/17, we could say that Kansas/17 is relatively new. If it is not totally new and very similar to HK/14, you wouldn’t even know if you got infected because your immune system annihilates Kansas/17 on first sight.

In other words, if our immune system struggles recognize the new strain even though we could have encountered its variant before, it could be something new, like how Tifa took time to recognize Cloud.

But why the could be? Why the uncertainty?

Well, you need to sequence its genome to know for sure. If the genomic identity matches with something that we already know and has been in human circulation for a while (we are talking on the scale of years to decades), it could be that the said person is immunocompromised (eldery, or pregnant, or having AIDS). But let’s assume that our person of interest here is a healthy individual.

Now, let’s switch gear to talking about SARS-CoV-2.

Genomic Identity

Soon after the first full genomic sequence of SARS-Cov-2 was uploaded to GenBank public database, it was immediately scrutinized by researchers worldwide. Debates happen here and there, trying to understand what this meant. Is it a totally new virus? How did it get here? How far has it spread, etc. I would not discuss the origin and such but I would like to make it clear that broader scientific community agrees this virus WAS NOT the result of artificial genetic modification in lab.

If this is a biological weapon, it is truly a bad one.

Genomic identity

Here I plotted 3 human coronaviruses: SARS-CoV (2003), MERS-CoV (2013), and SARS-CoV-2 (2019). Plotting was done by measuring the differences in nucleic acid sequence. Essentially, this is a principal component analysis (PCA) plot where difference along the X-axis is more important than the difference along the y-axis, but we are not going to talk PCA plot and dimensional reduction analysis today.

The illustration here makes it clear that the novel SARS-Cov-2 is different than the other two deadly coronaviruses based on nucleic acid identity. However, nucleic acid identity does not tell the whole story. Throughout a complicated biological process, the nucleic acid sequence must be translated into amino acid sequence. It is like translating your internal thoughts and monologues into real action.

For example, you thought about eating but thinking alone won’t help. You need to walk to your fridge, open it, find the food, microwave it, and eat it. In other words, nucleic acid has to be processed for its effect (e.g. function) to be readily observed and measured.

If you realized, I used the term nucleic acid instead of calling it DNA. Why? Because coronaviruses do not have DNA. Their genetic makeup is RNA since it is an RNA virus, but that is a story for another day.

I bet you have seen protein structure (if you saw tweets on that somewhere). That protein structure is the output after nucleic acid being processed. For example, here is the spike protein that decorates the outer layer of coronaviruses, which earned it the epithet corona (latin for crown).

SARS-CoV-2 Spike Protein

Having said all these, having access to nucleic acid sequence does help a lot. It facilitates pandemic surveillance, it informs treatment options, and it guides vaccine design.

Still remember my little story on naming things above? We are going to get back to that in a second.

Genomic Identity Is Not Functional Identity

At time of writing, there are more than 15,000 nucleic acid sequences uploaded to the curated database EpiCoV, hosted by Global Initiative on Sharing All Influenza Data (GISAID). I would love to talk about GISAID, its inception, its role during H1N1 pandemic 2009, and the challenges it faces while serving the scientific community. But, this is a story for another day.

Armed with such wealth of data, the scientific community is working so hard to make sense what is happening. There are so many questions that have not been answered.

  1. Why is it pretty good at spreading? How stable is it in environment?
  2. Where else in the body it invades? Lung? Heart? Kidney?
  3. How does the immune response dynamics would look like after a certain period of time?
  4. (more questions here)

When I was writing this, SARS-CoV-2 was all over the world now. The transmission spiralled out of control and the world was working hard to contain it. In some parts of the world, the danger it possessed was not threatening anymore (e.g. New Zealand) but fear and anxiety was still looming while expecting the arrival of second wave. The second wave of 1918 H1N1 pandemic was a few times deadlier than the first one, we must not the repeat the history.

Going back to the data that is available for analysis, it is true that 15,000 sequences sounds like a large number. It really is. But it severely lacks 2 key information: temporal information and capability of SARS-CoV-2 to evade immunity. The latter is a little challenging to explain, so I would go with the former for today.

In the grand scheme of things, SARS-CoV-2 is still too young. Influenza A virus has been studied since 1930s but we are still learning about it (we got good vaccine!), HIV has been studied since 1980s but we are still struggling to design good vaccine (we got good treatment!), dengue virus has been studied since 1940s but the vaccine is not quite 100% working (but we do know how to avoid getting infected!). Bottom line is, we need more time to understand the behavior of SARS-CoV-2.

It is a little immature come up with anything conclusive about the true nature of SARS-CoV-2 at this point. What we do have now (among others) are good epidemiological studies to characterize the virus on the level of human population. This enables us to answer question “why the virus spreads faster in this area but not another”, which again not 100% because of the intrinsic nature of the virus, but because the societal response also plays a key role.

In pursuit to understand if the circulating SARS-CoV-2 has significantly evolved, it is very challenging to resist the temptation to classify them based on sequence data that we currently have. As mentioned previously, nucleic acid sequence information cannot inform a lot and there are caveats when trying to interpret the data solely based on sequence information.

This is where high-throughput analysis could lend some help. Such wealth of data could allow scientists to generate many hypotheses about the virus. What this means is that in this sea of data, one could spot many interesting small chunks of peculiar thing. “Oh, this mutation at the genomic position of 23,667 is different, what does it mean?”

But, spotting something peculiar doesn’t mean it is certainly functionally different. It is like having a friend that wears Apple Watch but using Android-based phone, eating pasta with sambal belacan, doesn’t believe in email and still faxing documents. That is certainly peculiar, but not any less human than the rest of us.

The next logical step would be testing the hypotheses. Do the experiment, then do the analysis. Here, types of experiment matter! You don’t just sequence your way out because sequencing is not the answer to everything.

I wrote about a brief history of sequencing here: The Human Genome Project

SARS-CoV-2 Type A, B, and C?

So, is it true that we have three types of SARS-CoV-2 right now?

The answer is… no, not really. So, where did that statement come from? If I got my detective work done right, it originated from the Daily Mail.

Three Distinct: Daily Mail

If you are a regular Redditor and you subscribed to the r/worldnews subdreddit, you would know to never take things that came from the Daily Mail seriously.

Reddit and Daily Mail

But let’s say that the Daily Mail is right on this one (no, not quite), let’s move on and analyze the paper that provides the claim.

The paper that proposed the classification was published on 17 Mar 2020 by the Proceeding of the National Academy of Sciences (PNAS), authored by Forster et al. On 9 Apr 2020, the paper came under fire because the analysis done by the authors was not the right kind of analysis. Two researchers that I know of, Nick Loman and Andrew Rambaut, tweeted that this paper should not be taken seriously.

To quote Rambaut:

What really bothers me is that these authors pull down some data from #GISAID run it through an easy to use software package, make some very inappropriate choices for this virus and publish what they get out.

But, damage was done. Media picked that up, spiced it up a little, and let it run. Mavian et al authored a perspective on the whole issue of trying to classify things by stating the following:

Our analysis clearly shows severe limitations in the present data, in light of which any finding should be considered, at the very best, preliminary and hypothesis-generating. Hence the need for avoiding stigmatization based on partial information, and for continuing concerted efforts to increase number and quality of the sequences required for robust tracing of the epidemic.

Having said all these, I recognize that there is some degree of usefulness to utilize this A, B, and C type classification for the sole purpose of tracing the epidemic, but the usefulness stops right there. These type classification does not inform whether any of the type is more dangerous than the other. I would go as far as saying that if you got infected with one of the three types, you would develop antibodies that should be protective against the other two types. There is not enough evidence yet to say these three types are really that different.

While viruses are generally classified by how their mutations could cause differences in immune response/degree of immunity, the differences in genomic identity of the 3 “types” SARS-CoV-2 currently does not prove that they have any functional consequence. Hence, the current classification of SARS-CoV-2 virus is not really meaningful.

What Should We Do Now?

Fact-checking is difficult and I get that. Everyone is trying to be an expert to score internet cookie points (I might be guilty of this too, but hey I have a Masters’ Degree in Microbiology haha!), muddying the water and causing confusion.

I have no magic pill, no panacea, no antidote to remedy this situation. What I cling on was an advice from a scientist. She said:

If you prepare to say something, you should plan to be wrong. And when you think you are wrong, there is no shame in feeling that. And when that happens (that you know you are wrong), you better have plan already who to ask to get clarification.

I am in a unique position where I have an immediate access to factually accurate information (I can ask my thesis mentors, or senior colleagues, or friends in the same field as I am), thanks to my proximity to people with domain-specific expertise.

Now, I am playing my part to bring myself in closer proximity to the public, and I would never know if I could do a good or a bad job at that. All I hope I am not causing any further confusion, because that would not be cool.