Professor Sir Mike Stratton

27 Jul 2023

Prefer to listen to the interview? Head over to our podcast 'Conversations with COSMIC' to hear the full interview with Prof Sir Mike Stratton and follow us for new episodes coming soon!

Professor Sir Mike Stratton is now a Senior Group Leader in the Cancer, Ageing and Somatic Mutation Programme after having stepped down from his position as Director of The Wellcome Sanger Institute. His primary research interests have been in the genetics of cancer, with his early research focusing on inherited susceptibility. Mike was responsible for mapping and identifying the major high-risk breast cancer susceptibility gene BRCA2 and subsequently a series of moderate-risk breast cancer and other cancer susceptibility genes.

In 2000 Mike initiated the Cancer Genome Project at the Wellcome Sanger Institute through which he discovered somatic mutations of the BRAF gene in malignant melanoma and several other mutated cancer genes in lung, renal, breast and other cancers. He has described the basic patterns of somatic mutation in cancer genomes revealing underlying DNA mutational and repair processes. He is also a Fellow of the Royal Society (FRS) and was Knighted by the Queen in 2013.

Mike was integral in the conceptualisation and subsequent building of the COSMIC database when it began in 2004. We caught up with him as he prepared to step down as Director of The Wellcome Sanger Institute to explore the history of COSMIC, how the genomics landscape has changed since the database started and exactly how it had intersected with his career.

Today, we're going to dive into your experiences with COSMIC given that you were there with us from the very start. Could you talk us through the beginnings? What made you come up with the idea for COSMIC?

These were the very early days of the Cancer Genome Project at the Sanger Institute. We are talking about sequencing tiny segments of DNA and harvesting tiny numbers of mutations, compared to what we could imagine today. Nevertheless, there were more mutations than we could keep in our heads, and more mutations than we could interpret, because at that time, there was only one database and this was of P53 mutations. There were certainly no databases for somatic mutations of multiple genes.

We needed to match the mutations that we were finding in our early screens of cancer genomes by sequencing with what already existed, and we couldn't do that, other than by searching through the literature for a paper that had found mutations in a particular gene. With no database aiming to be comprehensive, and the anticipation that over the forthcoming years, there were going to be orders and orders of magnitude more somatic mutations from cancer genomes, coming out of our projects at Sanger (and others elsewhere), navigating our way through all of those was going to be an increasing challenge and really beyond us going through mutations one by one. With this in mind, we decided to establish a database to hold the mutations that we were generating, and to bring in from the literature, the mutations that were already there so we could check the data that was coming out of our cancer genome sequencing and that of others against what already existed.

So this was the reason for doing it, it was completely pragmatic and absolutely necessary for us to begin to interpret our own data.

You've been a pioneer in mutational signatures and leader of the Mutographs project, an area of research that's really taking the scientific community by storm. It's one of the most popular downloads and pages on the COSMIC website, so for those who aren't familiar, can you briefly summarise this work?

Mutational signatures are essentially patterns of mutations. Mutations come in a number of different types, there are six subclasses of base substitution mutation,

T to A, G to C, T to G, C to A, C to T and C to G. Those are the basic classes and we can elaborate on that classification. There are small insertions and deletions, there rearrangements, there are copy number changes, there are all sorts of things that are happening in cancer genomes.

The fact is, what is happening in one cancer genome, (i.e the types of mutations that are occurring), is not the same as what is happening in the next cancer genome or the one after that. There are patterns of mutations that occur, and these patterns of mutations are called mutational signatures. Each mutational signature is a pattern that you can find in many cancer genomes, but not all and sometimes if the pattern is rare, it's only in very few cancer genomes.

Each mutational signature, by definition, has an underlying mutational process. This bio mutational process can range from an exposure to mutagenic agents such as radiation or a chemical, to a defect in DNA repair, for example. So, by building up a set of mutational signatures across 1000s of cancer genomes, you're essentially assembling a compendium of the mutational processes that have generated those mutations in the first place.

When you started out in signatures, did you ever expect it would have this level of impact?

Right from the beginning of sequencing cancer genomes, we were aware from the relatively limited amounts of data that were already available, that there were these different patterns of mutations in different cancer types. We also knew that those patterns related to likely exposures.

So, if we looked at the pattern of somatic mutations in lung cancers from smokers, they had a particular pattern with a predominance of C to A mutations. On the other hand, if you looked at skin squamous carcinomas that had been caused by ultraviolet light, they had a rather different overall pattern of C to T mutations. At this stage, each tumour was basically giving us one mutation, so we were having to aggregate data across numbers of cancers in order to see these patterns. Clearly what those patterns were telling us about were the likely exposures, in other words, the original causes of those cancers. Some of those patterns were relating to tobacco carcinogens (chemicals within tobacco smoke), others were relating to the DNA damage caused by ultraviolet light, others relating to defective DNA repair that is sometimes found in cancers.

The inspiring element of that, which was evident to us right from the beginning, is that those patterns took us back in time. They're like an archaeological investigation of the exposures that caused the cancers in the first place, and exploring those early stages of cancers in that way, seemed a real opportunity to understand the fundamentals of carcinogenesis. We always felt that this type of analysis of cancer genomes would be a very important readout of cancer genome sequences. Subsequently, when we started getting rather large numbers of mutations from individual cancer genomes, hundreds, thousands, tens of thousands, hundreds of thousands of mutations from single cancer genomes, we developed the mutational signatures framework. This depended on, as I've alluded to, doing whole genome sequences, finding enough mutations and then the development of mathematical approaches to deconvolute the signatures from each other.

When we started, it was uncertain how complex that landscape of mutational signatures was going to be, were there going to be 10 signatures? 100 signatures? 1000 signatures? We really didn't know until we developed those algorithmic approaches to the huge volume of data that came from whole genome sequencing.

What turns out is, it is quite a complex landscape. There are many different signatures, let's call it about 100, base substitution signatures and clearly there is plenty of complexity to explore. Many are still unknown. It is giving us an insight into that compendium of mutational processes that are contributing to cancer, and is telling us which of those we don't understand, we don't recognise and subsequently, potentially causes of cancer that are we don't understand, and we don't have evidence for.

With this complex landscape constantly unfolding, what are your hopes for the future of this tool for mutational signatures? And how do you hope they'll be used?

I think we are moving forward now, to a complete compendium of mutational signatures. It'll never be absolutely complete, because you can't exclude the possibility that there is a rare exposure somewhere in a small number of cancers, and until those cancers enter the sequencing pipelines around the world, we won't see that. We are moving towards a more or less complete compendium of the mutational signatures that are operative in cancer, but also in normal cells. That in itself is going to be a staging post. This leads in a number of directions, such as basic biological understanding of mutation formation.

Mutagenesis is a fundamental biological process. None of us would be here without mutagenesis. Our understanding of it, of what is happening in the natural world has been quite rudimentary, so we need to understand it better. By generating that compendium, that full repertoire is the first stage of research that will tell us about all the mutational processes operating in humans, and actually across the whole of the tree of life as well.

One of the outputs, which I've already mentioned, is that many of the mutational signatures we are already seeing don't have a cause. It is providing us a way of looking at the exogenous exposures and lifestyle factors that we believe must be contributing to cancer incidence across the world and will continue to do so. Almost every type of solid tumour varies in its incidence across the world, and the only real way of interpreting that is to say: there are different lifestyle and exogenous exposures that are causing those differences, and (at least in principle) some of those exposures, if they are mutagenic, should be detectable using the mutational signatures approach. That is the underlying principle of the Mutographs, Cancer Grand Challenge, and subsequent studies that we hope will follow that. It is absolutely dependent on getting as many genomes from cancers or normal tissues as possible, deconvoluting the signatures out of those, and then potentially associating those with those differences in cancer incidents around the world. In Mutographs, and in other studies, we are finding quite common exposures in some parts of the world, which we have no idea what they are, but they are geographically limited.

So that's one big area where mutational signatures will be critical in exploring, but in addition to that, some of the defects of DNA repair that occur in cancers, they offer us opportunities for the activity of certain therapeutics, for example, PARP inhibitors, and other platinum agents. These work well in cancers which have defects in homologous recombination based DNA repair, those associated with defects in BRCA1 and BRCA2. So, those defects and DNA repair, they leave their own signatures on cancer genomes, which can be detected by sequencing those genomes and matching to what we have in the COSMIC database. This can then tell us 'this is a cancer, which has got a defect of homologous recombination repair that you would find in a BRCA1 or BRCA2 mutant cancer and therefore, this cancer could be treatable with PARP inhibitors or other such chemotherapeutic agents’.

That's one other way in which mutational signatures can be used, but there are multiple applications that people have used and are further developing over the last few years and continually.

Could you share an example of how you use COSMIC to help answer a question in your research?

I certainly use the mutational signatures part of the database regularly! Further to this, I use COSMIC as my first port of call in interpreting a set of mutations that I've found from a project on a particular type or set of cancers, in order to begin to understand whether the mutations we're looking at are drivers or not. For 20 years, I’ve logged on to an individual gene, checked the profile of mutations from our data with the mutations that are on COSMIC, and checked the mutations at each different amino acid residue across the protein encoded by that gene. Checking whether the mutations we have found are frequently reported, if they're suggestive of being driver mutations or not.

In the old days, I used to spend whole afternoons, just working my way through a series of genes without anything necessarily from our own projects! Looking through gene by gene, looking at the patterns, the distribution of the mutations across the gene, hotspots of mutation in some parts of the gene or the protein, working away through those, letting the data speak to me as to whether mutations in the particular gene i’d looked at through the course of the afternoon were contributing to cancer development. So I immersed myself in COSMIC for good long periods of time, shutting my door and almost cutting myself off from the rest of the world!

Losing yourself in COSMIC data can be slightly addictive once you dive in! Of course, all of this data comes from COSMIC’s manual curation process, which is embedded as one of our core strengths and values. Increasingly though, we are facing challenges regarding AI or machine based processes and whether it could be a better solution. I'd be interested to get your take on why cancer curation and specifically manual curation is so important still?

It is useful for us to have the aspiration in COSMIC, that we have had right from the beginning: to register every somatic mutation ever found in cancer or in normal cells, in COSMIC. We may never achieve it, but let's have that ambition in our minds. The somatic mutation data that comes to us, comes in so many different forms, with a lot of different types of metadata, and therefore, it lends itself at times to being better curated manually. That does lead us to think about whether we, collectively, should or could develop a set of standards for the way that mutations are registered, written about and the way that they are documented in papers. That would make it easier for us to automatically upload them. The fact is that we do have some standards in place for larger datasets (e.g whole genome sequences, whole exome sequences, things from TCGA or from ICGC), and of course, these do make critical contributions to COSMIC, but there is a large number of small studies, which are still studying a few genes where those standards are not adhered to. These are still very important studies because often, they are studies that are directed at particular cancer genes and driver mutations, and so, they have a higher proportion of the mutations within them that are biologically active (as opposed to the wealth of mutations in a whole genome sequence that are just passenger mutations). This means that these small studies are often associated with rich biological data about the patient, about the cancer and are still very important studies to be brought into COSMIC. From the perspective of our users, the fact that a particular mutation in a rare type of cancer that has only got 20 samples sequence is associated with that particular type of cancer. That's an important piece of information to somebody out there and an important piece of information for us to get through to that broader audience.

I can't, at this moment, see a way of achieving this through automated approaches, as opposed to manual curation. We continue to do that manual curation, both for the mutational data and the metadata for as long as it's necessary to achieve that aspiration of having every mutation in COSMIC.

Of course, this isn't the only challenge that databases are facing today. Databases aren't exactly considered ‘exciting’ science at the moment, and due to this, they face funding challenges. In your opinion, do you see COSMIC as as relevant now as it was when you launched it? And how does it continue to contribute to the field of cancer research?

I absolutely do believe it is as relevant, it is more relevant now. When we started, it was relevant to a relatively small group of cancer genomicists who were exploring the cancer genome. Today, it's relevant to everybody and anybody working clinically, or in cancer research. At some point or other, everyone in these fields needs to understand the mutations in the cancer or gene they're working on, and will want to know, is this particular gene mutated? And how? What is the pattern of the mutations seen there? And that's where COSMIC and of course other databases, have their importance.

You say correctly, that databases are not necessarily exciting, and so by allocating a body of money to them, you're not necessarily buying into the possibility of a really amazing discovery that's going to emerge in three years. Despite this, I think everybody agrees that they are obviously critical in a world of exponentially increasing amounts of data. As far as we can tell, that exponential increase is going to continue for quite some time. We need to be able to hold that data, organise it, and present it if people are going to be using it, and they do. People use it all the time! It will become ever more embedded in just the thinking about cancer, so they certainly need to be funded.

They need to be seen as and funded as infrastructure, rather than as discovery projects. We recognise in the world of research that you do need infrastructure, you need infrastructure of all sorts, the internet, buildings, laboratories, you need all of these things in order for research to take place. Databases are part of this as globally accessible entities that store data such as the European Bioinformatics Institute, a place that has made it its business to provide that sort of storage.

The problem with databases, however useful they are, is that they need support forever. That's an unusual perspective for funders to have to take, because part of their ability to use, deploy and redeploy their funding is to say, ‘Here's some money for somebody to do some work for five years. If it's good, we may give more, if it is not so good. We'll send it elsewhere’. Databases need to be supported forever and there has been some recognition of that, many of the biggest funders have ‘bitten that bullet’, but as the databases get bigger and there gets to be more of them, they require more maintenance. So there is always this unease about funding new databases, and even maintaining the funding of old ones.

That's definitely where COSMIC’s model comes in, where we are funded through the commercial customers who make profit through our data. They truly help us to keep running long term. It's one of our core values to be built for longevity, so people can really trust that we will be here as long as we're needed.

We'd be remiss if we didn't take some time to reflect on the announcement that you're stepping down from directorship of the Wellcome Sanger Institute. Can you share some reflections with us about your time leading the Sanger Institute and in particular, any standout moments for you?

The core theme and cultural element of Sanger is large scale genomics. This embodies doing grand projects that most other people and organisations in the world cannot do, or even imagine.

That’s the thing that Sanger and other organisations like it stand for:

Doing things that most others cannot do. It is key, that defines us, that defines which project we should be working on. Of course, we started that way, the Institute was founded in order to sequence the reference human genome, and that was something that we did in collaboration with five, six other groups around the world. We did a third of the human genome at that time, and it was an absolutely monumental project. This absolutely defines that culture and that mentality, and what goes with this is having large groups of people working together as part of a pipeline of activities. From bringing in samples, data generation, computational analysis, the whole picture was established in the days of the human genome.

This element of Sanger has reiterated itself over the last 23 years since the human genome draft was announced, and each time one embarks on a new project, there's a pulse of excitement around the place that we're on to another big horizon that we can see in the distance, illuminating a landscape that is currently in the dark, that the genomes are going to put into the light.

Embarking on sequencing cancer genomes, absolutely embodies the same mentality. The number of things we didn't know about cancer genomes was quite extraordinary. When we started sequencing them even at relatively trivial amounts of DNA sequence, we would suddenly find a mutation and it would cause great arguments internally, because although these mutations weren’t that common, they were coming out reasonably frequently. “Surely, there couldn't be that many driver mutations?” some of us said, and others said, “Well, surely there can't be things like passenger mutations.” What are now completely intuitive concepts to us, were things that we were discovering on a day by day basis by embarking on these very large scale projects.

If I have to choose one that evokes that same sense of excitement and challenge, it was what we encountered in the last three years with Coronavirus where we were all overtaken by a pandemic. In our parasites and microbes programme at Sanger, we had the view that genomic surveillance of infectious microorganisms as they spread, would be a key way of managing the spread, understanding and management of the organism as it evolved. The notion of doing it on this scale was not something that was really thought of, or imaginable, by others.

When the pandemic hit, we didn't know how we could make it work, but it's the usual combination of logistics, which is not necessarily part of ordinary science, and academic science. Logistics and iterative technologies applied, then reapplied, improved and improved over time. Ensures data flows out to the right people, in this case, it was the health security agency. As a result of this, Sanger ended up sequencing more than 2 million of the 10 million Coronavirus genomes that were sequenced in the world over that time. 20% of them were sequenced at Sanger in the UK, from that process that was put in place. So it's about vision, a sort of grandeur, the sense of what you need to put your elbow to if you're going to achieve, then the benefits, what you harvest and reap by getting data at that sort scale, more or less in real time.

Could you summarise the challenges addressed by cutting edge cancer research specifically, and how that's changed during your time at Sanger?

I think that the goals of cancer research essentially remain the same.

We need to understand what the causes of cancer are, how cancers work, the machinery inside them etc. Then finding those Achilles heels that allow us to develop new therapies, detect cancer early and prevent it. Those were things that were talked about at the beginning, those are the things we have talked about for the last 50 years and those are still the things that are in front of us.

Of course, some obstacles have been overcome. We can now sequence whole cancer genomes, essentially at will, to find mutated cancer genes and mutational signatures, these were completely unimaginable capabilities 20 years ago. In 2003, the early days of the Cancer Genome Project, we could not predict that we would get to this scale of throughputs. We would have been very happy to be told that's where it would go, but we had no way of predicting that! So that ability to see into the genome of any cancer, more or less at will, is being taken advantage of by Genomics England on essentially all paediatric cancers in this country as part of the routine diagnosis of an individual's cancer.

So that's where we are today. Alongside that, genomics, mutations, driver mutations and cancer genes have embedded themselves in the minds of all people working on cancer, whether that's in research, clinics, or in therapeutic development. So those are the obstacles that have been overcome, that's where things have changed for sure. However, we haven't found the causes of all cancers. We haven't found treatments or preventions for all cancers. Although there have been remarkable advances in that area, the search continues for those critical insights through genomics, through building up ever larger sets of genomes that provide us with ever greater powers of inference, but also through other approaches.

Thinking of some of those who may be listening and just starting out in their career. Is there any advice you'd like to share?

It is a reasonable question: What do early career researchers feel as they embark on a career in research?

There is almost always deep uncertainty and countless questions in the minds of early researchers. How do you think of a new and interesting idea? How do you make a new, interesting discovery that will make a material contribution? How do you, as one small human being, have the wherewithal to make a substantive contribution? When you look at that person over there, and you think they're more intelligent, or better organised, or has better computational skills than yourself. How do you do it? Well, you know what's inside of you, you know your weaknesses, and often you under assess your strengths. Everybody starts out in that way. Nobody starts off confident, nobody starts off knowing where they're going. We all somewhat fumble around in a fog, working out what we want to do, and what we think is interesting.

There is no single route to success as a young scientist. So whether you have 10 ideas a day or one idea a decade, that's how you as an individual work, and you have to learn to believe in those different ways of working. Your individuality is what you have to contribute to science. You can't pretend to be a different scientist, you can only be yourself as a scientist and work in the way that you are comfortable with. You have to believe in that. Even though most of the time, it's rather difficult to believe in oneself in the face of the welter of stuff that's going on elsewhere, that's the thing one has to hold on to. That reassurance that everybody starts out that way, when it was eluded me, I found that helpful to know.

We'd love to know what's next for you. Would you continue to use COSMIC in the future?

Stepping down from being director of Sanger, I've thought about what I might do from here. The thing that still fires me, still excites me, still interests me, still really does get me up in the morning, is doing fundamental genomic science about cancer cells and normal cells. You give me a good mutation to look at, that is asking a question of me, I can spend a couple of days worrying over that single mutation. That's what I will now have more time to do! In doing that, I'll absolutely be looking at COSMIC every day in order to help myself come to some conclusions about those mutations.

Any chance for a return to COSMIC?

Well, I think probably the contribution that I made to COSMIC probably is wildly outdated now! I used to meet the team every week or so, and most of the time, they didn't bother me with the mutations. They bothered me with questions like “this cancer has been given a strange name, can you work out, Mike? What do they actually mean by this?” That's what we used to meet weekly to discuss, but after a while, they found their own ways to do that. So I don't think I'm much use to COSMIC, but COSMIC is a lot of use to me.

And I'd say thank you so much for your contribution to COSMIC and taking the time to speak with us. Thank you very much for discussing with me, it's been a pleasure. Thank you.

COSMIC News

Professor Sir Mike Stratton

About

Latest Posts

Tags

Useful Links