Aired:
August 23, 2024
Category:
Podcast

Swimming in the Deep End of AI

In This Episode

Explore the cutting-edge intersection of AI and life sciences with the 'Life Sciences DNA' podcast, sponsored by Agilisium Labs. In this episode, host Dr. Amar Drawid engages with Brendan Frey, Founder and Chief Innovation Officer of Deep Genomics, to discuss how AI is revolutionizing therapeutic development.‍ Discover how Deep Genomics leverages deep learning to analyze massive RNA biology datasets, paving the way for innovative therapies. Gain insights into how these advancements are unlocking the vast potential of RNA biology, enabling predictions of therapeutic outcomes with unprecedented accuracy.

Episode highlights
  • Explore the transformative role of AI in drug development and therapeutic innovation.
  • Focus on Deep Genomics' pioneering use of deep learning in RNA biology analysis.
  • Highlights how AI technologies, such as machine learning and generative AI, are enhancing drug discovery.
  • Delve into the strategic decision to connect the genome with the molecular determinants of disease.
  • Examine how these innovations are shaping the future of personalized medicine and healthcare.
  • Provide an overview of AI's role in revolutionizing the life sciences industry.

Transcript

Daniel Levine (00:00)

The Life Sciences DNA podcast is sponsored by Agilisium Labs, a collaborative space where Agilisium works with its clients to co -develop and incubate POCs, products, and solutions. To learn how Agilisium Labs can use the power of its generative AI for life sciences analytics, visit them at labs.agilisium .com.

We've got Brendan Frey on the show today. For viewers not familiar with him, who is he? Brendan Frey is an entrepreneur, engineer, and a scientist. He is the head of platform, chief innovation officer, and founder of DeepGenomics, and also co -founder of the Vector Institute for Artificial Intelligence. He's made fundamental contributions in the field of deep learning, genomic medicine, and information technology.

And his work with Geoffrey Hinton on the wake sleep algorithm actually helped to launch the field of deep learning. He's also advised the leadership of Microsoft research and machine learning and deep learning technology. And he was a member of its technical advisory board. And what is Deep Genomics? Brendan was a pioneer in developing AI systems that could accurately predict normal and pathological cell and genome biology

to facilitate biomedical breakthroughs. So this work led to the founding of Deep Genomics in 2015. It was the first AI system for predicting pathogenic mutations and identifying therapeutic targets and the development of new therapeutic candidates for patients with genetic disorders. Well, what are you hoping to hear from Brendan today? The company is focused on RNA therapeutics. So I'd like to understand why RNA?

And it's also BigRNA AI company. I would like to understand the AI approach that it's taking. Well, before we begin, I want to remind listeners that if they want to keep up on the latest episodes, they should hit the subscribe button. If you enjoy what we do, please hit the like button and share your thoughts in the comments section below. With that, let's welcome Brendan to the show.

Brendan, thanks for joining us. We're going to talk today about RNA, AI, and Deep Genomics' efforts to use its AI platform to create a pipeline of therapies. When you say AI for drug development, can you talk about what do you mean? Do you mean machine learning, deep learning, automation, generative AI? Can you talk a bit more about that? Yeah, sure. So first of all, I'm

pleased to be here. Thanks for inviting me to be on the program. So when I talk about machine learning or AI for drug discovery, I do mean machine learning, primarily machine learning. We also create AI to sort of put the different machine learning pieces together. And in our case, it's directed at molecular and cell biology. And so a lot of people, you know, may be people looking at AI or machine learning for electronic health records or, you know, making clinical trials more efficient or things like that. At Deep Genomics, our focus is on the biology.

To make a good drug, you need good biology and you need a lot of other stuff, but our focus is on the biology. And when you are talking about machine learning, are you using also, I mean, there are a lot of methods like the support vector machine, like a lot of tree based methods, but also, are you also using deep learning, like the neural networks or so? So can you tell us a bit more about that? Yeah, sure. You know, we're definitely using deep learning. In fact, back in the late 1990s,

I was part of the group here in Toronto with Geoff Hinton and Yann LeCun with Bengio and others who were working on what now we call deep learning. Back in those days, we didn't have enough data, we didn't have enough compute in order for it all to work. But yeah, I was at the ground level when it comes to the innovation and the creation of deep learning. A lot of people chose to focus on things like computer vision, on speech processing, text analysis in the early 2000s.

And I decided to focus on connecting the dots between the genome and the molecular determinants of disease and how we could treat those effects. Why did you decide on the drug discovery aspect? Yeah, so at that point in time, as I said, we were all working on what's now called deep learning. And it was clear that data, we needed huge amounts of data and also a lot more compute. I remember as a graduate student,

it would take a month or two months to do one experiment on a small data set with a pretty small model. So iteration time was just too slow. But I could tell that these different fields like computer vision and speech analysis, text analysis, we would see huge amounts of data. Cell phones were starting to circulate and devices were being used to record images and texts and so on. But then in 2002, my wife at the time and I found out that the

baby she was carrying, she was pregnant, there's a genetic issue. And we saw a genetic counselor and genetic counselor said that it could be nothing or it could be basically it could be nothing or it could be a disaster. And so I was faced with this pretty difficult information, my partner and I were faced with this and having to try to make a decision of what to do basically with information that was not very helpful. And so I kind of realized that, okay, we can sequence the genome, we have the ability to sequence people's DNA,

measure mutations, but there's a huge gap between our ability to do that and to interpret the consequences of those mutations in the context of disease and then also to identify ways of treating those mutations. And that's really what got me into the field. And so from 2002 onward, that was my focus. And between 2002 and around 2015, my group made some of the most fundamental contributions in the field of AI or deep learning for genome biology and drug discovery,

publishing a lot of papers in leading journals like Nature and Science. And then in 2015, the technology was advanced enough that we spun out Deep Genomics. We'd love to know more about the progress and the innovations that your team made until 2015. So can you talk more about that, please? Yeah, sure. So when we launched the company in 2015, our machine learning and AI could evaluate mutations and predict their mechanism and also predict their

probability of pathogenicity and do so more accurately than a typical molecular geneticist. So our success rate was higher than a typical lab that you might send your sample to. And so that was really the first signal at around 2015 that these deep learning systems were advanced enough that we could actually have an impact on medicine and make a difference. And back then we were using fairly, you know, if you rewind the clock to 2015,

the kinds of models we were using were state of the art and massive models. But the models we're using now, for example, at Deep Genomics, it's all the latest architectures like Transformers and Mamba, Resnets and Convnets, different kinds of architectures like that. And our machine learning architectures are truly huge. We're talking about tens of billions of parameters or even a trillion parameters. So really BigRNA, models now.

Okay, and we're going to come back to that. But I wanted to ask you more about how is AI changing drug discovery and development process and to what extent, where are we in this revolution? Yeah, that's a great question. And actually I have, you know, I would say that really the first 10 years of AI for drug discovery have not delivered according to expectations.

And the good news is we know the cause and recent AI innovations will overcome this cause. But I think it is worthwhile to just take a look at kind of the whole field, the whole community of AI drug discovery companies and what's going on there. And so in terms of examples of failures, I'll talk about the field more broadly and then talk about Deep Genomics. Andrew Dunn had an interesting article in Endpoint News last fall.

He talked about Recursion Pharmaceuticals, Exscientia, and Benevolent AI, who have all seen failures or setbacks in their clinical programs. One quote is, over the last year plus, the first handful of molecules created by artificial intelligence have failed trials or been deprioritized. And I thought that was a great article, actually, kind of a level set in terms of how AI is doing. Now, in the words of Exscientia's CEO, he said, we don't want to be one of those companies that keeps pushing a program forward

because that's the only thing they have. So first of all, I think it's important to laud these companies for advancing programs to the clinic and then also realizing when they're not working out. But I guess the comment I think that's important for all of us to acknowledge is that when introducing a new disruptive technology, it's important to learn from failures, conduct root cause analysis, and adjust the technology to work better. And we've seen this over and over again from the invention of the light bulb,

which required many iterations to get right to things like computer storage devices, all sorts of iterations on computer disk drives until the technology was perfected. And I think this is true for Deep Genomics and also others using AI for drug discovery. Now, just to kind of bracket your comment there, I will say an advantage of the AI approach compared to other approaches, traditional approaches,

is that because we select our targets and our molecules using a process that's comprehensive, that's also reliable and reproducible, that's important, reproducible, we have a wonderful opportunity to trace back failures and figure out what went wrong and fix it. And what I mean by comprehensive, I mean that AI is used to numerically evaluate thousands or even millions of possibilities and quantitatively evaluate them too, put numbers on it.

Reliable, mean that AI does precisely what you ask it to do. So there's not a lot of variability in terms of which question it's answering. It will answer the same question over and over again. And by reproducible, I mean that if you run the same AI a second time on your data, you will get the same answer. Of course, if you improve the AI, you'll get better answers. But the point is, if you run the same version of that AI system, you'll get the same answer. And these three things are things that humans aren't actually so great at: reliability, reproducibility,

and being comprehensive, but AI is. And so it actually gives us a really good framework, all of us, the whole community, for tracing back failure, tracing back errors, and figuring out what's going on. And Deep Genomics, of course, we have our own special take on that compared to others. OK. And why have you been focusing on RNA? So can you talk more about why RNA is a...

There are some therapies out there I wanted to ask you about that as well because with RNA, the BigRNA thing was the COVID vaccines that came out. These were the RNA therapies. So what made you focus on RNA? Are they better for analysis or so? Yeah, great question. So first of all, just a couple of other examples, RNA therapies to answer the middle part of your question there. You mentioned the COVID vaccines, which are obviously vaccines.

Another couple examples would be Spinraza for spinal muscular atrophy from Biogen and Ionis, which has saved thousands of lives of babies born with spinal muscular atrophy. And that's an oligonucleotide, so it's a sequence of 20 letters. It binds to RNA and manipulates RNA, manipulates gene regulation by doing that. Another example would be patisiran from an Alnylam

for polyneuropathy and that's also a different kind of oligonucleotide but also an RNA therapeutic. So those are a couple examples. Now, as to why we're focused on RNA biology and this does reach back to when I got into this problem in 2002 in terms of orienting deep learning at genome biology. And genome biology is basically kind of if you look at the genome sequence of DNA letters, everything that flows from that.

Right. And so things like gene regulation, so transcription, RNA processing, so splicing and polydentilation, processes like that, and then also translation into proteins. So that's kind of like genome biology. And the important thing is that it's all digital. RNA sequencing data sets, it's digital information, it's sequences of letters, DNA is digital, sequence of letters, RNA is digital, sequence of letters. And so really we think of it as being like the software of biology.

You know, the DNA in different cells is the same pretty much, but obviously your brain cell is very different from a liver cell. And the reason is RNA. The RNA is expressed differently. It's digital. It's really the software biology. The other reason why it was attractive to me as an AI researcher is there's more data when it comes to RNA biology than any other area of biology, whether it's proteomics or whether it's...

intercell signaling or something like that or presentation of antigens on the surface things like that. So we have more data. You mean like there's a lot of expression data, gene expression data, right? That's what you're talking about? Yeah, and it's not just gene expression data. It's also, for example, data profiling how proteins bind to RNA or data profiling structure, structural components of these molecules and

things like microRNAs, which are another kind of molecule that's floating around in the cell, how those interact with RNA. So there's lots and lots of data, over 100 petabytes of data, so just truly a huge amount of data. So those are really the reasons that I and Deep Genomics is focused on RNA biology. Having said that, we are branching out to look at other areas, branching out to develop models for DNA, for protein, for

cell state modeling and things like that. Okay. Yeah, I can see that. I mean, having worked in this area, a lot of times we're doing a lot of experimentation with the gene expression, et cetera, but then we're saying in the end, okay, well, it's going to become a protein and then we focus on the protein as a target, right? Rather than RNA. But what you're doing is that, okay, well, you don't want to necessarily make that assumption and you're

focusing on the RNA itself because that's the data that is available rather than making any assumption on that. Is that right? Yeah, yeah, and actually you touched on something there that I forgot to mention, which I think is really poignant, which is you can treat the root cause of the disease at the RNA level. And so, for example, small molecules that interact with protein, it's kind of like...

It's kind of like you've got an engine and it's not working properly so you smash it with a sledgehammer. There's really only one thing you can do with it, which is kind of wreck it. Whereas if you want to do something, like if there's a component in that engine that needs to be swapped out or changed, that's a more subtle change. Can't do that easily with a small molecule. Whereas if you look at RNA biology, that's the kind of thing you can do. And so for example, spinal muscular atrophy, there's a mutation which causes one of the copies of the gene to not work properly.

If you can adjust the effects of that mutation by precisely manipulating RNA biology, you can produce a functioning protein which then can be used to treat the disease. And so it's a bit like, you know, you've got this engine and now we've got a whole bunch of different kinds of tools and screwdrivers that we can use to manipulate all the different parts of that engine to make it perform the way we want it to perform rather than just clobbering it with a small molecule. Okay. Okay.

Let's talk about Deep Genomics now. And so can you tell us more about how the Deep Genomics platform technology work? And you've been talking about like AI and like, you know, deep learning or machine learning aspects. So can you give like a sense to us about what is done in Deep Genomics? Yeah, great. Yeah. So as I said, when we spun it out in 2015, we have machine learning systems that could accurately pinpoint the genetic determinants of disease. And then also

predict certain mechanisms of action for potential molecules. And for example, I mentioned splicing for spinal muscular atrophy and our machine learning systems were really good at predicting splicing. Now, I'll put all of that in context of what I started saying before in terms of things have not gone as we'd hoped for the entire community for the last 10 years. Now, Deep Genomics is different as we were just talking about compared to say Recursion, Exscientia and Benevolent AI and others.

Because of our focus on RNA biology, it means that we actually have a really good opportunity to understand where our predictions are going right and where they're going wrong. And the reason for that is because it's always in the realm of genome biology. We can measure things, we can do experiments, we can find out what's going on. So for us, rather than identifying drug candidates using our AI and then rushing to the clinic, we took our AI predictions

and comprehensively evaluated a wide range of predictions using in vitro and in vivo models. And it's, I guess I'll ask you to kind of reflect on that for a minute because a lot of biotechs, what they do is you kind of find a molecule that has some evidence and then you start to go into IND enabling studies, you know, preclinical research and IND enabling studies, you kind of move it toward the clinic. Our approach was different. Our approach was because these predictions are testable in the lab,

let's do a large number of in vitro and in vivo studies to do validation. Just for our audience, can you explain to them what is in vitro and in vivo, please? yeah, sure. Great. Yeah. And so one of the nice things is, as you mentioned about RNA biology and RNA therapeutics is that they're very precise in terms of fixing a particular biological mechanism. And so now we can test those using cells in a Petri dish or

what we really do is use robotics or we're looking at many thousands of cells, thousands of wells. And then also in vivo, which means putting them into mice and then measuring what's going on in the mice in terms of the RNA biology. So this gives us an opportunity to ask the broader question. So instead of asking the question, does this one molecule look like a good drug in the clinic? We can ask the question, does this system, this AI system for generating molecules,

overall, how is it looking? Overall, how well is it doing? And so the goal is to understand the characteristic of the predictions. And that's what I mean by that is, they accurate for some tasks, but not others? As you know, for a drug to be successful, there's a lot of things you got to get right. You got to have good target genetics, you got to have good target biology, you got to have a mechanism of action, your drug has to activate that mechanism of action.

You want your drug to not have off -target effects. So for safety and tolerability, you want your drug to be manufacturable and all these things you've got to get right. And so we can ask the question, was our AI accurate for some tasks and not others? Is the AI consistent? Does the AI get better with more data? You know, we always just assume, the naive assumption is more data and the AI will get better.

But you'd be surprised at how many AI systems out there don't actually get better with more data. And so asking questions like that. The other thing I think is worth noting is that when it comes to RNA therapeutics and whether it's oligonucleotides or whether it's RNA editing or DNA editing or gene therapy or mRNA, the vaccines you mentioned, the space of possible drugs is very sparse.

What I mean by that is like digging for gold. There's just very few nuggets, very, very few targets and molecules that are actually going to succeed and work. And because the space is so sparse, typical ways of using AI don't really quite apply. So most AI systems are focused on kind of the average case. So they make predictions that are kind of accurate on average. Whereas for drug discovery, you actually need AI systems that can find the outliers, the exceptional cases,

ones that are quite different from the average. So we can ask all those kinds of questions and that's what we've been doing for the last two years to improve the AI platform. And that's resulted in us taking a very different approach. And how do you solve the problem of you have, I mean in biology, you just have limited number of samples and you have a lot of variables, right? Like the entire set of genes, the entire set of even the mutations, right? So how do you try to solve that problem?

Yeah, that's a really good question and it points to really how our focus has changed in the last year and a half. So two years ago, we would proudly talk about how Deep Genomics had 40 different machine learning models. We had one machine learning model for each of these different things you were just talking about. So we had one machine learning model that was good at predicting transcription. We had one machine learning model that was good at predicting splicing. We had one machine learning model that was good at predicting...

polyadenylation, all these different processes that go on. We had a machine learning model that was used for predicting off -target effects. And that's actually pretty common in terms of what's been going on in the AI community. It's actually one of the, when I talk about the root cause analysis and what's going on in the entire field, if you look at the last 10 years, you'll see a lot of companies having built bespoke AI systems to solve just sort of one narrow problem within the whole range of problems you need to solve to have a good drug.

So you'll have one company that's focused on AI for small molecule protein interactions. You'll have another company that's focused on analyzing genetic data to find targets. But they all kind of have specialized AI. Now, Deep Genomics, we were excited because we had 40 different machine learning models. But what we learned, and this is the key learning for Deep Genomics, is that was just untenable. 40 different models, each of those had its own training set.

Each of those had a carefully engineered pipeline for generating training data and then a pipeline for training the model. Each of those models had a team that had to coalesce around that model to build it. Each of those models had to be validated on its own. And then you had to have experts that could apply these 40 different models to these different problems and then interpret all of the outputs. And so what Deep Genomics has gone through, and I think you're going to see this in the rest of the community as well,

is kind like the Chat GPT for drug discovery. And what I mean by that is instead of having 40 different models, Deep Genomics now has one unified model for RNA biology, which we call BigRNA and it's a single model, not 40 models, a single model. It's trained on one really, really big data set. We're talking about over a trillion training cases. It's a very, very large model. It's, as I talked about this before,

on the order of 10 billion parameters, we expect that to scale up to about a trillion parameters in the next eight months or so. This is truly huge. We're starting this, these models are starting to have a size that's comparable to the brain, the human brain in terms of the number of synapses. You think about parameters and synapses. And so BigRNA for us, the single model, this one foundation model that can do all of these different tasks, all of these different tasks,

has been transformational. It means we can scale up the amount of data and let's retrain the model. Instead of having to generate 40 different data sets, we have one data set, we scale it up, more data, 10 times more data, we get a better model. We can also scale up the complexity of the model, use more sophisticated machine learning techniques. We talked about transformers and Mamba and other kinds of models. And we can evaluate it using one suite of evaluators basically applied to that single model.

And so we can ask the question, is it getting better for drug discovery? Is it getting better identifying targets? Is it getting better at designing therapeutics? Gotcha. What, see, as you said earlier, for something to become a really good drug, there are a lot of things, a lot of checks that you need to have. And let's say you're working to identify a drug for a specific disease.

And then let's say you identify some of these potential candidates. But then one of the later parts is about, how is it going to like, how is the body going to be handling that? A lot of the pharmacokinetics and pharmacodynamics, is that kind of common for all diseases? And you have like a kind of like a common model or a sub model to identify whether this will be handled well in a body.

Yeah, and so you're right, there's many different things you have to get right and you're alluding to one of them. And what I would say is, I guess I have two comments there. First of all, Deep Genomics, BigRNA for example, is focused on RNA biology. And so there are of course other areas like protein biology or DNA. And so to address that, I think what we're going to see...

is companies needing to go even broader than just sort of a foundation model for one thing, but a foundation model for RNA, DNA, proteins, cell states, which is what we do at Deep Genomics now. And so our mission at Deep Genomics is really the best foundation models for everything. And so we look across the entire field and at academic labs that are generating new foundation models for, as I mentioned, for protein or for cell states.

And then we incorporate those or ingest those into Deep Genomics into our mega foundation model, if you like, a foundation model that's composed of all these other foundation models. That's the first kind of way to answer your question. But the second part has to do with adapting those models to solve custom problems. And so if you're looking at a particular delivery system or if you're looking at a particular disease and there's just different

context in which you're going to want to make predictions, the foundation model needs to be tuned so it does a better job. For example, if you're looking at a neurodegenerative disease, there's usually a context around that neurodegenerative disease such as certain proteins are not being expressed. And so now when you're asking BigRNA for example, to make predictions for how we could target genes, identify patient mutations, identify targets, and then

develop molecules against those targets, it has to be in the context of the fact that those proteins are not being expressed properly. So that means fine tuning BigRNA so that it can make predictions in the context of that neurodegenerative disease. And so I think that's, to answer your question there, there's a BigRNA component, which is, all right, you got this foundation model for a particular area, in our case, BigRNA RNA for RNA biology. Can you fine tune it to solve a specific problem within a disease context?

Okay, okay. Now you talk about foundational model. These days, when you talk about generative AI, you hear about a lot of these foundational models, right? A lot of these LLMs are the large language models. Now RNA, I mean, it's talking a language of, you know, A, C, U, and G, right? And so what are you think like, what do you think about these kind of

large language models, are they applicable in RNA therapeutics or in therapeutics in general? Surprisingly, yes. And so, for example, BigRNA is based on a transformer. In fact, we have different versions of BigRNA and we made one version public. We wanted to show that BigRNA could solve important problems in realistic drug discovery tasks.

So we wrote a paper and put that online last September and we looked at 10 different tasks in RNA therapeutic drug discovery ranging from target identification to designing molecules and so on. And the reason we put that paper online is because we wanted to sort of answer the question of how well does this foundation model really solve important tasks in drug discovery and provide people with the data, provide people with the evidence. That version of BigRNA is based on a transformer,

which is a particular kind of deep learning architecture. And transformers, as I'm sure you know, but the listeners may or may not know, transformers were first invented for application areas like computer vision, and then also for natural language processing, as you alluded to. And roughly speaking, transformers are able to look at very long sequences of letters,

many, many thousands of letters in a sentence, if you like, and then identify different components, different words, and also different words that go together in terms of meaning. Like you might have an adjective and noun, you might have a pronoun and so on in English. And then compose those together when making a prediction, in the case of natural language, for example, for translation to a new language or what have you.

And what's interesting now, DNA and RNA is not the same as English. First of all, in DNA, there's a huge amount of redundancy. There's a lot of nucleotides in DNA that don't matter. You can change them and nothing really happens in the cell. For most part, that's not true in a human written language. The redundancy is far lower. Obviously, the structure of things like words is very different. DNA is not like literally like words as we think of as in a human language.

However, what's interesting is those same architecture, the transformer architecture, we found when we gave it a huge amount of data, as I said, we needed to give it a lot of data, a trillion data points to build this model. What was interesting is, yes, the transformer was able to accurately predict RNA biology. So it could predict the effects of patient mutations and the predicts of oligonucleotide therapies and other kind of therapeutic interventions, even though it had never seen any data on those. So we never trained it with patient mutations.

and we never trained it with RNA therapeutics. And yet once we had trained it just on the natural genome and all of the RNA -seq data out there, trillion, up to trillion data points, it was able to make predictions for these tasks that it had never seen before. That's called zero -shot learning, and it's something that foundation models turn out to be really good at. Yeah, yeah. So it was able to deal with the huge redundancy that's there. It was able to ignore that to really focus on the functional elements.

Yes, yes. And so a lot of the new AI systems, they use a method called attention. And attention, roughly speaking, is kind like you could think of it as a pre -processing method that goes through the data that's available and looks at the input signal and just looks for correlations between things that seem to kind of pop out and stand out. And so if you have a component, it could be like a sequence of nucleotides in the DNA

that is just always uncorrelated with anything else, doesn't seem to relate to anything else, then the system will pay less attention to that. So it's kind of looking at things that are correlated. Correlation is a very simple way of measuring importance. And so this attention mechanism focuses on things that tend to be correlated with something. By doing that, it can sort of pull out the things that seem to be most relevant for the prediction task. And that was what transformers introduced.

And after transformers other architectures have come up with different ways of doing a similar thing. So whether it's Mamba or ConvNeXt is an architecture that came out of the Facebook lab. Mm -hmm. Gotcha. You talked about a trillion data points. Can you tell us more about what kind of data you're using? Yeah, so for BigRNA each training case is a gene

Say for example, 200 ,000 nucleotides, so truly a long sentence in terms of DNA. And then the thing we're predicting is 3 ,000 different molecular biology tracks. Now what I mean by a track is a set of measurements that are aligned with the DNA. And so an example of track would be RNA -seq data. And so RNA -seq data tells you, if you look at a stretch of DNA for each position of the DNA, it tells you whether you're seeing any

bits of RNA in the cell that correspond to that location of the DNA. And so, for example, you might, if this is the DNA sequence, the RNA, the RNA -seq data might have 0, 0, 0, 0, and then pop up to 100, then pop back down to a 0, and then pop back up to 100. And every time it pops up there, it's saying that in the cell, there are little fragments of RNA that correspond to those regions of the DNA. So it's kind of like saying that those are the, if you're thinking about what's going on inside of the cell,

It's like saying, here's the DNA, but here are the parts of the DNA that actually do something inside of the cell. So that's what the RNA -seq data tells you. And that's what a track is. So the training data for Deep Genomics, for BigRNA would be the DNA sequence and then 3000 different tracks. And so it would be like the RNA -seq data in brain, RNA -seq data in liver, RNA -seq data in heart. But then there's tracks corresponding to things like protein binding data. So does the protein bind to this region or not?

Or does a micro RNA bind to this region or not. Okay. And those 3000 different tracks, if you like, that's kind of a rich description of what's going on inside of the cell. And once trained, now you can ask BigRNA here's a patient mutation. Does it change any of those 3000 different molecular phenotypes in the cell? Anything change in the cell because of that mutation? And we can make predictions based on BigRNA and we've shown that they're very accurate.

Then you can also ask the question, would this oligonucleotide, if we fed that in to the cell, would it change how the DNA is being interpreted within the cell so as to set things right again to correct the effect of the patient mutation? And we've also shown that that is quite accurate, quite accurately predicted using BigRNA and you talked about the 3000 tracks,

let's say, RNA-seq data in multiple organs or so, like tissues, but then putting money in data, you said, right? But how much of that is actually available to you? Like how much is publicly available? And are there a lot of experiments that you do to create data in a lot of those tracks? Because I'm imagining it will be hard to get data in all 3000 tracks, right? Yeah. Yeah. That's a great question.

So BigRNA RNA and Deep Genomics more broadly, I should say uses huge amounts of proprietary data as well as public data. And so some of the RNAseq data sets that you just mentioned, yeah, we can get that from the public repositories, like GTEx is one data set that a lot of people use. But we also generate a lot of proprietary data. So for example, I mentioned microRNAs and how microRNAs interact with genes.

And micro RNAs are very important in terms of their effects on gene regulation. There isn't a lot of micro RNA data out there. And so we've generated very large amounts of micro RNA data. So really it's a mixture of the two, both kinds of data, proprietary and public are growing exponentially, which is exciting to see. And if we plot, we look at BigRNA and we say, okay, what's the performance of BigRNA for example, for identifying novel patient mutations that could lead to drugs,

and we plot that as a function of how much data we fed into BigRNA so we'll train a version of BigRNA with a certain amount of data and then with twice as much data and then with four times as much data and then 16 times and so on. What's exciting is we see the performance with a lot of AI, you'll see the performance plateau. And so it's just not going to get better. BigRNA we see the performance going like that. So it gets better and better. So that's really exciting. And it tells us that the more of this data that we

and also public entities generate, the better these foundation models will get. Great. And is that because this is transformer based because in neural networks, as the data grows, the accuracy keeps increasing, right, which is not the case in some of the tree based methods or so. So is that the property of that? Or you're also seeing that like other even neural network based methods,

a plateau off after some data, after the increase in data? Yeah, first of all, you're right. Neural networks are very flexible models. They're very, they're capable of modeling very complex relationships. And so that is important that we're using deep learning. And I talked about the number of parameters and that being really important. That's a large number of parameters. I will comment the other thing is very different now than even just five years ago and for the prior 30 years

when it comes to statistics and machine learning, is that the old intuitions about overfitting your training data. Basically by that, I just mean if you have a parameter, a model that's too complicated, it can memorize the training data, but then it's not really good at predicting new things. And, you know, it's like if you have a set of data that looks like this,

If you've got data that looks like this, right? And you fit a straight line. Okay, so straight line kind of looks like that. Okay, so that's your predictor. It looks pretty good. Now, if you predict, if you train a model with too many parameters, it'll come up with a prediction like this,

which goes through all of the data points, each and every data point perfectly. But that's a crazy curve, right? You wouldn't think that that would be useful. And so for really for probably around a hundred years actually, that was the focus of machine learning and statistics, the bias variance trade -off. And a lot of research and a lot of theory went into trying to understand that. The wild thing with

the latest generation of deep learning models and this is really just the last three, four, five years, is that we can train models where the number of parameters exceeds the number of data points. And this does not happen. This overfitting thing doesn't happen. And so the community doesn't quite understand it yet, what's going on. But one other thing that we're seeing happen is not only

does it not just memorize the training data, it's actually able to make predictions that are useful. But these machine learning systems are able to make predictions for things that they weren't even trained to predict. so I gave you an example with BigRNA It had never seen oligonucleotide data, and yet it's able to predict the effect of an oligo. So that's truly exciting. And some people are talking about that as like emergent intelligence, where these machine learning systems are trained on massive amounts of data, and they have a huge number of parameters

and they're behaving in ways that we hadn't expected them to behave in terms of predicting new things and accurately predicting new things too. Very interesting. So this is really becoming the wild west in machine learning at this point. Yeah, yeah. Yeah. So, yeah. So coming back from AI to more like, you know, the drugs. So your model is that you develop these new treatments to partnerships, right? So can you talk

about like what is your model? Yeah, so two years ago, our plan was, like many of these other companies, go to the clinic with three or four programs. And as I said, we realized that our models and actually others in the community were not performing as well as we'd hoped they would perform. And so we had a choice at that point in time. We did have programs and we could have said, okay, we're just going to proceed to the clinic with those programs at risk.

But with the knowledge that the AI, the quality of the AI at that time was not sufficient to substantially change the probability of success of the program. So typically in drug discovery, probability of success is around 5 % or 10 % going from an IND to a product.

We, of course, wanted to be much higher, right, at 80 % or 90%. But when we evaluated our AI systems, we realized they were not where we needed them to be. Again, because we were not solving all these different tasks properly. Now, a lot of companies in the AI community then just went ahead to the clinic, and they're taking on that high risk of failure. We decided to do something different, which was take a step back, understand why the machine learning systems weren't doing what they needed to do,

and what we needed to change to make them work better. And that's what led to the foundation models we've just talked about. And so what we decided to do about a year ago, because of the success of BigRNA and also as we thought about our new mission to be having the best of all foundation models, so not just BigRNA for RNA biology,

but foundation models for protein design, foundation models for DNA, foundation models for single cell data and so on, and disease specific foundation models. We realized the best way we could offer value to the entire community was for us to focus on this breakthrough, focus on what we're really good at, focus on the discoveries we'd made, which is really the foundation models for biology. And so that led us to reconceive our corporate strategy as being one of partnership.

So advancing programs with partners. And that's what we've been focusing on is that new model. Gotcha, gotcha. Now, one basic question I haven't asked you is why the name Deep Genomics? Well, it's a double entendre. Of course, the first meaning is deep learning, as you pointed out at the very beginning of our conversation. And the second one is that to understand how the genome generates disease, you have to go deep.

So the traditional approach in the 2000s was genome -wide association studies. You just correlate mutations with the phenotype, like cancer or no cancer, whatever. And our point was, no, there's many layers in between, many, many layers. And you really need to model all those layers of biology in order to really do a good job of discovering targets and also designing molecules. Now, Deep Genomics has raised about $227 million to date. How far

will this funding take you and what's your plan to raise additional capital?

Yeah, so I won't comment on the specifics in terms of the money in the bank and so on, but we have a lot of cash in the bank for multiple years of runway. And so that puts us in a strong position. And I think what you're going to see increasingly in the AI drug discovery field is major partnerships. And by that I mean like a transformational deal where two companies become highly aligned in one disease area, for example,

or M&A. And so if you look at recently, Recursion Pharmaceuticals merged with Exscientia. So you're starting to see M&A in the AI drug discovery space. And the reason for that is when I mentioned before, there's all these different things you have to get right to have a successful drug. And companies have been focusing kind of narrowly on one aspect. Through M&A, you can acquire other aspects. And so Deep Genomics has a lot of cash.

And M&A is something that we're open to considering as well to acquire new data sets or new technologies or to just accelerate the pharmaceutical industry. That's what we want to do. We want to help patients in the best way we can. Our expertise is this AI and these foundation models. And so we'll take a path that's best in terms of using our technology to create drugs for patients.

So we're now seeing like this emergence of these tech biocompanies, right? But there is still a gulf between the culture of a tech company and a biotech company. So how well integrated are those? And with your experience, you've seen that, right? Like the difference. So how do you have these tech and the biotech teams come together and work as one company and have a similar culture.

Yeah, that's a good question. And I think it comes down to company values. I think it's hard for an existing company to kind of bolt AI on the side in a meaningful way. You really kind of need to have it within the employee as like a core value. They need to kind of, they kind of need to believe fundamentally that this is important. And so at Deep Genomics, for example, we say multilingualism is a core value.

And the reason we use the word multilingualism is sort of understanding the language is the first step, of course, to being somebody who understands it really deeply. And so that's kind of, that's why we focus on multilingualism. And so we asked our biologists to understand, for example, deep learning and ask different aspects of deep learning systems, gain intuition for how much training data you need to build a system because,

the common error that will occur in the field is people say, I've got this great data set, I'm going to build a deep learning system. And then it turns out the data set is 1000 data points. And that's just not enough to build a deep learning system. And so having our biologists and our chemists at Deep Genomics and our BD team, really, really everyone kind of understand the different aspects of deep learning and machine learning, its capabilities, and also its limitations and the different kinds of systems that are out there in different companies or in the academic community

is an important aspect of multilingualism. And then the other way around as well, ensuring that our computational people, our AI people, our software engineers, our data scientists, our statistical geneticists, they all really take to heart the real issues that are present in drug discovery. The real deal, what it really takes to make a good drug and to have a good target. So that the machine learning systems and so on that they're creating is actually

aimed at actually producing value in the drug discovery community. Great. Brendan Frey, Chief Innovation Officer and founder of Deep Genomics. Brendan, thank you for your time today. It was a pleasure, Amar. I enjoyed speaking with you. Thanks. Well, Amar, what did you think? There was a fascinating discussion about this pretty interesting company. We talked a lot about AI and then basically machine learning,

how that has evolved and really how machine learning is now being used in drug discovery in a lot of detail. And also I think we went quite a bit deep into the machine learning aspects here. So yeah, was fascinating discussion about learning more about how AI and machine learning is being used in drug discovery. Brendan said AI for drug discovery hasn't delivered in the first 10 years.

Did that surprise you to hear him say that and do you think he's right? I think so. I mean, with what we've seen is any new technology that comes in, there's a lot of hype about it and people think that's a silver bullet. That's going to change everything and save the humankind, right? We even saw that even like you go back to 2000 when we had the human genome project, right? People thought that

when the entire genome was sequenced, we were going to be able to get drugs for all diseases. It's always like, each of these innovation helps in creating more drugs. And that has been happening. But it's always when you have the new technology or new innovation, it takes time for us to understand that. It takes time for us to really know how to apply that and then really focus on

applying that to the right problems, right? So it takes time. I'm really not, I would have been surprised if we had like, you know, 50 drugs coming out of AI in the first few years. I think this is how it's going to be. There's going to be a lot of failures, we will learn from that. But what you're going to see is over the next few years, next five to 10 years, there will be some good drugs that will be coming out of AI. You did ask about

the company's focus on RNA . RNA in some ways makes sense in that he likened it to the software biology. He talked about the fact that there's a rich amount of data around it. Is this well suited for AI -based drug development? It is. See, what he talked about was a lot of the data that we can quickly get is RNA data.

And of course, I mean, you know, RNA therapeutics is important. Protein therapeutics is important. Cell therapeutics is important. All of these are pretty important. And what matters in the end is what is a drug that we can actually give to patients to help them. Right. But in terms of RNA therapeutics, because so much of the data that's available is RNA data, what they're doing is that, OK, well, using that and then you don't have to make a leap of faith from, OK, well,

if the RNA is doing this, then the protein is going to do the same thing. It's because fundamentally, right, from DNA, the RNA comes and from RNA, there's a protein. And a lot of times in the traditional drug discovery, you're trying to find the protein targets and then you create drugs against those protein targets. So what they're saying is that, well, the assumption that we have to do, like from RNA to protein, it's not necessarily a one -to -one,

you know, from one RNA, I mean from RNA to one protein, right? There's a lot of things that happen, proteins get modified, there's a lot of post -translation modification that happen, there are a lot of things that happen. So just because we understand the biology of the RNA doesn't necessarily translate to, okay, exactly the same thing is going to happen to the protein. So with the AI, the limitation right now, because not a whole lot of protein data is available, you're going to make assumptions based on the RNA data about the protein.

What he's doing is that okay well, I have RNA data. So let me just, you know, do the prediction on an RNA therapeutic for that which is fine I mean and that's you know one way of doing things, but of course, I mean we need to create RNA therapeutics but we also need to continue to create other modalities of therapies as well. But I like the idea where well, let me just focus on the data that I have right? Brendan also talked about how different machine learning models are good at predicting

individual aspects of a potential therapy. Companies focused on developing systems and developing things that are good at one thing. He said this was untenable as Deep Genomics had 40 different models with 40 different teams and the need for 40 different validations. The company is now using a unified model for RNA biology that's trained on a single large data set. Do you expect others to follow suit and do the same?

What do you think this might mean for the value of AI and drug development? I would say so. I mean, 40 models, that's just way too much. And I know there might be a lot of accuracy that each of these models would be providing, but just going from one model to another to another and then trying to see if the final result, if that's good. I mean, that is a hard thing. So I completely get it that they have

consolidated these models. Now, of course, it's not that easy to consolidate the models. So I'm sure they had to do a lot of work to consolidate those because you want to make sure that the consolidated model is actually performing the tasks of each of those models. So that's a pretty difficult problem actually to solve. But it's also having 40 models and then having the data sets and then see for each of these models

you need to have like what we call ML ops, so machine learning ops. So you need to make sure there's like the full operations that are set out, validation that needs to be done. There are a lot of things that need to be done to productionize a machine learning model and doing that for 40 models is just so much work. So I can see this. I can also see other companies going to smaller number of models as long as they're able to consolidate those. It could be one, it could be two or three, but

yeah, it's going to be a small number rather than a large number like that. One of the things that really struck me is he talked about the ability of the Deep Genomics system to predict the behavior of RNA mutations that it had never seen. He talked about this being zero -shot learning. Yeah. And the massive amounts of data it takes to be able to do that. So we talked about performance plateauing with normal AI, but what they've seen with

the BigRNA is that they've been able to make that curve continue upwards. What did you make of that? So see, this is some of the new territory that we are coming in in machine learning fields. Of course, with the new networks themselves, they've been around for a long, long, time. But the breakthroughs came in 2010 when

so much data became available and then because the neural networks themselves they need huge amount of data and they need huge amount of computations time, neither of those were there early on, even 20 years ago. But then what happened in 2010s is we got tremendous amount of data that started to become available for computer vision, photos and stuff, but also a lot of social media data.

It's a lot of text data there, but then also at the same time, we got these graphical processing units or GPUs, which are very powerful computing units. And that's what really rekindled the field of neural networks. And then that became deep learning basically. And that's when we started noticing these things where unlike other machine learning methods,

the performance of deep learning actually keeps getting better and better with more and more data, which is something interesting. I mean, a lot of the machine learning theory that has been developed previously, that actually, this went against that. That theory is more like, there's a limitation, you get saturation of performance, but then with neural networks, that actually has been different. And then now we're seeing this, like he talked about, this emergence, right?

Usually machine learning method predicts from stuff that it has seen, but now we're seeing this behavior where it's actually predicting stuff that it has never seen. And this is really uncharted territory at this point. See what's happening in the area of machine learning. And this is very much related to generative AI, by the way. I mean, the transformers and stuff, generative AI is very much related to that. So in this area right now...

In terms of practice, we're going ahead and ahead and ahead. But in terms of the theory, it's going to take time because we need to really work on theory. And we're going so fast in the practice that the theory is going to take a long time to catch up. But it's going to be interesting to see how that's going to be. But this is really, as I said, the wild west at this point in machine learning. Well, it was a great conversation. I'm looking forward to the next one. Thanks so much.

Thank you.

Thanks again to our sponsor, Agilisium Labs.

For Life Sciences DNA and Dr. Amar Drawid, I'm Daniel Levine. Thanks for joining us.

Our Host

Dr. Amar Drawid, an industry veteran who has worked in data science leadership with top biopharmaceutical companies. He explores the evolving use of AI and data science with innovators working to reshape all aspects of the biopharmaceutical industry from the way new therapeutics are discovered to how they are marketed.

Our Speaker

Brendan Frey is the Founder and Chief Innovation Officer at Deep Genomics, a trailblazing company at the forefront of AI-driven therapeutic development. With a visionary approach, Brendan has been instrumental in merging the fields of AI and genomics to transform how therapies are discovered and developed. Over the past 15 years, he has co-authored more than 200 influential papers in machine learning and genome biology, with numerous publications in prestigious journals like Nature, Science, and Cell. Brendan’s groundbreaking work continues to shape the future of precision medicine, making him a pioneering figure in the biotech industry.