Driving Insights from Real World Oncology Data with AI
In This Episode
In this episode of Life Sciences DNA, Dr Amar Drawid talks with Dr Aaron Cohen, Head of Research Oncology Clinical Data at Flatiron Health, about how artificial intelligence and large language models are transforming oncology research and real-world cancer care. They discuss how AI turns unstructured electronic health records into actionable insights, improving patient outcomes, clinical decision making, and drug development. The discussion highlights how AI-driven real-world data is revolutionizing cancer diagnostics, treatment planning, and the future of precision oncology.
- Real World Data for Better CareAI connects clinical trial findings with real patient outcomes, helping oncologists personalize treatments and assess effectiveness using real world evidence from oncology practices.
- Turning Notes into InsightsLarge language models analyse clinical notes, pathology reports, and biomarker data, converting unstructured text into structured insights that reveal disease patterns and treatment responses.
- Combining Human and AI StrengthsFlatiron blends expert-guided rules with AI extraction to process millions of oncology records, maintaining near-human accuracy while scaling data analysis for research and care.
- Testing and Trusting AI ModelsBy comparing AI outputs with expert-curated data across 14 cancer types, Flatiron showed that AI can match human precision in identifying cancer progression.
- Advancing Cancer Research and Drug DevelopmentAI-driven extraction of real world data accelerates drug development and clinical research, enabling faster validation of therapies and better-informed treatment decisions.
Transcript
Daniel Levine (00:00)The Life Sciences DNA podcast is sponsored by Agilisium Labs, a collaborative space where Agilisium works with its clients to co-develop and incubate POCs, products, and solutions. To learn how AgilisiumLabs can use the power of its generative AI for life sciences analytics, visit them at labs.agilisium.com. Amar, we've got Aaron Cohen on the show today. Who's Aaron?Amar Drawid (00:28)Dr. Aaron Cohen is head of research oncology clinical data for Flatiron Health.Daniel Levine (00:33)And what is Flatiron Health?Amar Drawid (00:35)Flatiron Health is a health tech company focused on transforming cancer care and research through technology and data science. The company collects and analyzes real-world clinical data from electronic health records across a network of oncology practices and academic cancer centers. And it offers products and services to support researchers, providers, and biopharma companies with clinical trials, drug development, and care optimization. They're curated, they use that to help accelerate research and support clinical decision making.Daniel Levine (01:08)And what are you hoping to hear from Aaron today?Amar Drawid (01:10)So, Aaron recently did a study on the use of large language models to extract and analyze clinical details from unstructured data in electronic health records. So, I want to talk to him about the study, but also about where we are in terms of using AI to do this kind of work. And then the challenges, of course, that we're going to face, we're facing doing that and how it's going to better enable the use and access to real world data and of course how it's going to help us improve everything from let's say patient selection for clinical trials to outcomes for patients in real world settings.Daniel Levine (01:52)Before we begin, I just want to remind our audience that they can stay up on the latest episodes of Life Sciences DNA by hitting the subscribe button. If you enjoy the content, be sure to hit the like button and let us know your thoughts in the comment section. And don't
forget to listen to the podcast on the go by downloading an audio only version of the show from your preferred podcast platform. With that, let's welcome Aaron to the show.Amar Drawid (02:21)Aaron, thanks for joining us. We're going to talk today about how Flatiron Health has used large language models to extract clinical data from unstructured electronic health records or EHRs and the potential this technology has for changing cancer drug development and patient care. Let's start with Flatiron Health. Can you tell us a bit more about Flatiron Health?Aaron Cohen (02:44)Yeah, happy to, and thanks for having me on. I'm really excited for the discussion. So, Flatiron Health is an oncology-specific health tech company. We own an electronic health record called OncoEMR. It's widely used across the country by mostly community oncologists. And through our business service agreements with them, we get access to de-identified patient data that gives us a sense of how patients in the real world are doing with their cancer care. And that over the last 10 years has proven to actually be very insightful and valuable in better understanding how patients are doing, as opposed to say on a clinical trial where patients tend to be younger, healthier, higher socioeconomic status. That's not always the case in the real world. And so, for example, when I'm seeing a patient, one of the first things I have to do is think, okay, this is the drug I want to givethe patient. This is the study that the drug was approved on. How similar is this patient to the patients that were on the study or not? Would the patient have met criteria and not met criteria? And that's a challenging thing for a physician to need to do. So, with some of the things that Flatiron does with, I would say, advances in real-world evidence and understanding how valuable it could be -that's starting to be able to answer some of those questions.Amar Drawid (04:07)And when you talk about EHRs or EMRs, the electronic health records, what type of data do we find in those usually that you can do research on?Aaron Cohen (04:19)There's a lot of different types of data, I would say. You know, I think for the purposes of this conversation, it's helpful to distinguish between structured data and unstructured data. Structured data, mostly I think of as things when I'm using the HR point and click and readily usable, analyzable. Unstructured data, pretty much everything else. And so if you've been at a physician's visit recently and ever seen them, hopefully making eye contact, but if not staring at the computer typing, all that's free text, all that's unstructured and that is the
details of the visit and how the patient's doing. But all of that is unstructured and not readily usable. So, there's clinical notes with that sort of free text. There's PDFs, biomarker reports, like molecular testing, biopsy reports, staging information. There's communications with social work. So there's really all sorts of different types of things in there and really figuring out how to sift through all that and make use of it is harder than it sounds.Amar Drawid (05:26)Yeah, absolutely. But then the structured data, which is the tabular data, which is like, you can capturea lot of, let's say, the volume, those are a lot of the other measurements, and those are in a table format, and those are much easier to analyze, whereas the notes are much, much harder to analyze and get the right insights from them.Aaron Cohen (05:48)Correct, yes. When I was training, I didn't put a lot of thought into how I entered data into the EHR. Most know that EHRs primarily exist for billing purposes and to justify why you're making a certain decision. And when you're training, you kind of learn what your process and workflow is going to be for having a visit and documenting it. And I would say one of the main things I paid attention to were ICD codes because it was important for billing. And ICD codes are a structured data point that give you information about a patient's diagnosis or problem. But that's one of the, I would say, few structured data points that is readily available in the EHR that actually gives you some idea of what's going on with a person's cancer. And there's a lot of different components of cancer care and how a patient experiences cancer. And that ranges from when the patient's initially diagnosed and what the imaging says and the biopsy and the biopsy results, maybe they have molecular testing on it, all of those things notstructured. They're all in those PDF reports I referenced earlier. They're all, maybe I had the resource elsewhere and I'm just transcribing it into the EHR. So certainly, in my new role, it's been interesting to understand the value of being able to put data into the EHR in a more usable format from the get-go. But even still, just how it's set up really currently lends itself to typing and just unstructured data entry.Amar Drawid (07:38)Okay. And as you mentioned, right, EHRs were designed for insurance, billing purpose, and not for research. As you use this information for research purpose, what are some of the challenges that you have to deal with?Aaron Cohen (07:46)
There are several, I would say. And rightfully so, because it's not on clinicians' minds how what they're documenting might be able to advance cancer care or improve patient outcomes. There is messiness, there is missingness, there are contradictions in the chart. I'm guilty of this too, like copy pasting. So, you know, pull forward information from the last visit I saw the patient. Hopefully that gets updated. But with all the things going on with the, you know, with a visit and all the patients that someone needs to see, you know, that information may just not get updated. And so as a result, you have to kind of solve the puzzle of what's the salient information, what's missing. If something's not there, is it because it didn't happen and the answer is no? Or is it unknown? And there's just nuances and things like that. At Flatiron, we've been thinking about this for a decade now, how to guide humans through the EHR and pull out these complex details I referenced earlier. We've been able to take a lot of those learnings and build off them over time.Amar Drawid (09:13)And you mentioned one thing earlier, which is de-identification. For our audience members who may not know what that means. And it's also, there's always a big question that patients have, which is that if my data is being used for research, how is that? Can they identify meor not? Can you talk a bit about de-identification is?Aaron Cohen (09:33)I think the simplest way to say is that there's certain types of health information, we'll call it PHI, and those are things that you could in theory use to identify a patient. so that might be, know, obviously things like social security number people think of, but it's simpler than that oftentimes, you know, name, address, things of that nature. And so, making sure that those things are removed from the actual data that is being analyzed is critical. But there actually gets to be more complicated than that even because some of the studies or some of the things that we'll be interested in doing can get to really small cohort sizes or for really rare disorders. And when you think about it, the smaller the group of patients you're looking at or the more rare something is in theory, the more at risk you might be for being able to re-identify. How many 100-year-olds with sarcoma metastatic to the brain are there? so-that's great Yes, yes, great point. And with the geography component as well. On top of the of boilerplate, PHI data points that people think of, you know, we also make sure we go through a process where we are having the combinations of different clinical details looked at to ensure that it's not necessarily at risk for re-identification.Amar Drawid (11:09)
So, let's talk about large language models or the AI. And so how are you using these large language models to pull the data from this unstructured notes, right, from EHRs and analyzing? Can you talk more about that?Aaron Cohen (11:26)I referenced this earlier and I really want to hit it home. Before answering how we're using large language models, I really want it to be clear why we want to do this, why it's important to be able to understand how patients in the real world are being cared for and how they're doing. I used the example earlier of like, this patient with lupus that I'm seeing going to behave similarly or not to a patient that was on a trial? Take the example of some of the more critical, important data points that can happen in a chart, but like taking a step back, actually have a patient experience like cancer progression, right? These are serious events that happen. They deserve a lot of thought in terms of how it's communicated to the patient. How that ends up getting documented in the chart is different than how it might be documented in a clinical trial where everything is so protocolized, everything is measured,written down in detail every single visit, like the cancer grew this number of centimeters or shrunk this number of centimeters. When I'm documenting in the chart, back to the point about what EHRs are for, that's not how it shows up. It shows up as the cancer grew or the patient did not respond. All of these different ways of saying that the cancer progressed, there may be a datethat was mentioned. There may be a date that's referenced like in the past or I expect it to happen in the future. And so with all this messiness, what we at Flatiron have had to learn, have to do is to say, okay, this is a critical endpoint progression. We need to know that we can identify these events and use them in analyses to answer importantquestions. So how do we define it? How do we guide a human to go through the chart and find these events so that we feel like we're capturing it accurately? And for a long period of time, just because the technology wasn't there yet, that's where Flatiron would do. We would train human abstractors.We clinicians like myself would write out really detailed policies and procedures to guide a human through a chart to say, you know, if you see something like pseudo progression mentioned, which is something that can happen when immunotherapy is given to a patient and the cancer looks like it's growing, but it's really just inflammation and it didn't actually mean that the cancer was worsening.Those sorts of things happen all the time and you need to have rules and guidance throughout to handle that. And we've come up over the last 10 years with those guidelines for humans to be able to do that well. And we're really proud of that. We've published on it. We validated our approach for real world progression as an endpoint. But the main limitation of that is it's not scalable.You know, everything I just described is pretty manual. And even though at Flatiron, we figured out how to leverage technology to make that as efficient as I think anyone could, there's still the inherent limitation of having a human look through a chart,
followspecific rules, and pull out details. And so what LLMs or large language models or any sort of machine learning algorithm that could allow you to extract data allows you to do,hopefully is maintain that accuracy that you were getting with a human, but scale. So to the point where you may have progression on 500 patients, progression events of 500 patients, can you get it at 900,000 patients? And so back to your question, like how do we do it after answering why we would want to do it? That's been a...learning process. It's been an evolution because as I already mentioned, we used to and we still do use humans. We learned a ton for about how humans can go about abstracting this. Maybe six years ago, we really started exploring natural language processing, deep learning models for extraction, taking advantage of all of that really high quality labeled datathat we had by virtue of doing this human abstraction, doing supervised learning to show this data to these models, to be able to evaluate it against thishigh quality data. And we really learned a lot about what those models could do, what they could do well, how they fit in with humans. We explored something that we called hybrid data curation whereif a deep learning model spit out a confidence thresholdor a confidence number for how certain it thinks it is and it was above a certain threshold, we would say, we trust the model. We're going to use that answer. But if it was below that threshold, we'd say, you know what, this is a challenging one for the model. We still need to use a human for it. And the last two years have just been incredibly exciting and interesting withlarge language models coming into the picture. And I'll say like, it's not intuitive that what I just described would be a task that lends itself well to a large language model, right? What we are really trying to do is say, is it in the chart? Yes or no? And if so, pull it out while maintaining theinitial intent of what was documented, not changing it, not assuming, not predicting. Andthen of course, large language models are given these words in a row predicting what the next word would be. So we've had to figure out, how do you take these large language models that are coming out with new versions every couple of months and have these extraordinary capabilities? How do we ensure that it's not predicting whether the patient had cancer,but it's extracting that the patient had cancer. So those are just, I think, some examples of some of the things that we've been able to leverage our past learnings or past experiences and use for large language models today.Amar Drawid (17:55)And for this large language model to do a good job of doing that, I I always kind of think about it needs to have an MD, it needs to have a lot of that training or knowledge, and it also needs to have this, like the ways of working, like how to work with the chart, how to interpret the information. So do you use like RAGS and stuff for knowledges or like prompt engineering, fine tuning? Like what are the kind of techniques that you use toso that it can do a good job.
Aaron Cohen (18:26)You know, I didn't have any computer science or technical background when I joined Flatiron. I have learned so much from working with the data scientists and the machine learning engineers at Flatiron. And it is just so amazing to watch the different experiments and techniques and ways that my cross-functional colleagues willgo about evaluating different approaches. You mentioned some of them. A couple things that I think I want to emphasize. The first thing related to large language models not necessarily being the intuitive tool you would use. They realize, okay, we can't have this spit out more text, which is also unstructured. We need this to come out in a JSON format, which we can analyze. And so that was something key that may be overlooked thatthat has been important in being able to actually extract the data in a way that is analyzable. We have done all sorts of, I would say, exploring with RAG, with fine tuning, LoRA relatedto fine tuning, chain of thought. We've found thus far, and we use, I would say, all of them, but I think the thing that certainly I can wrap my head around the most, and I think where we've had a lot of success, is the prompt engineering and the figuringout how to ask an LLM to do something the way you want it done or the way you might ask a human to do that. And you heard me reference earlier how I have helped, our clinicians have helped write these policies and procedures, these guidelines for how a human would go through. We've been able to take those rules and guidelinesand use them to prompt the LLM to do it that same way. And in that sense, we have a 10-year kind of head start in thinking about how you can effectively ask an LLM to do these hard things and anticipate what sorts of edge cases you're going to run into. It also really sets up, and we'll talk about maybe evaluation later, but itenables you to evaluate the large language model who in theory is following a process similar or the same asyou've asked a human to do in a more apples to apples way.Amar Drawid (20:51)Yeah. And just for the audience who don't know what JSON is, that's the format where you can then have information in a structured way. Going back to the training, do you find it easier to train humans or easier to train the LLM?Aaron Cohen (21:10)It's a little bit of both, I would say. You know, it's funny, you know, as an oncologist that went through a lot of training to become an oncologist, think about what that took to train me to do that. I just think it's humorous that when we're prompting the LLM, we'll say, imagine you're an oncologist. I mean, that helps, you know, I would say the LLM,better frame the problem that it's trying to do. And that's like a sentence of textin all of the session. They have ostensibly the training that I have. And so,in that sense, LLMs are really cool and
nimble and agile because you can make changes to the processes, iterate, do error analysis and go back and change things in a way that's just much easier than certainly working with humans and certainly working with more traditional supervised learning approaches where you're limited by the labels that you have and that the models learn from. And it's not easy to change those rules that it's learned without collecting all new labels. That said, there are still some things that we're finding that LLMs can't do very well,or as well as we would like them to do or need them to do, to do the important research and use cases that we're interestedin. I would say one example of that is they're not very good when the answer is not there. I talked about no and unknown earlier, right? And a human, they will do their due diligence of looking through the chart to see, did the patient have progression, yes or no, or unknown.If they don't see it, they will reliably say:didn't find this. What we've had issues with in discovery with large language models is they don't like to give that answer. Everyone has heard hallucinate, but I don't think people might realize what that means in the context of what we're talking about in terms of real world evidence and data curation. For the purposes of this, they will confidently say,patient ended progression on January 1st, 2022. And it sounds competent, it's entirely plausible, but it's not there at all. And so that's one area that we've had to really think about, how are we in a position to evaluate these large-language models accurately and know when those errors are happening? Another thing that we found challengingthat I think large language models, again, they're improving every month, will eventually get there, but we still have some challenges with are these complicated clinical concepts that might span different pages in the EHR or documents in the EHR where you need to piece together information to then get the answer. An example of this is like local regional recurrence. So,if a patient has breast cancer,oncologists following that patient, God forbid, you know, hopefully the patient does not progress that ends. But unfortunately,all too often it does. And there's two main ways that could happen. It could travel distantly to another organ. You know, we call that metastasis or distant metastasis, but it also just could come back locally, like in the area that it started or local regional recurrence. Both of thoseare considered advanced disease, but they're managed differently. They had different implications for the patient and how they'll do. And so it's important to be able to distinguish between those. But to be able to say whether a patient had a local regional recurrence, a couple of things need to happen. One, the clinician who's documenting this needs to actually be aware thatspecifically saying local regional recurrence might be important or useful. We've talked earlier about why clinicians document and it's for billing and it's for justifying treatment and making decisions, not for research. So very frequently, physician will just say, patient recurred. And you just don't know if it was distant, you don't know if it's local. And so,there's just some...inherent subjectivity there that's challenging to begin with that we found humans are able to kind of handle that ambiguity better than LLMs. But also it requires you to sometimes
piece together, as I mentioned earlier, the information across different documents. And so,a radiology report may say the patient had recurrence to the chest wall, right? And thencouple of documents later, the clinician who's interpreting this radiology report might say thatthe patient did recur and piecing that it was the chest wall together with the fact that it came back equals local regional recurrence. That's something a clinician would think when they read it, but that's hard to teach an LLM to do that.Amar Drawid (26:33)Yeah, that's a great point, right? That you have these different pieces which may be in different areas and it needs to have that kind of sense to put that together, which I'm not sure to what extent they can do very well at this point, right? Like the current language models in 2025.Aaron Cohen (26:54)Yeah, but I'm optimistic. I think that as data becomes more longitudinal, as it becomes more important to piece together information across documents and over time, as novel AI methods make it more possible and feasible than before to combine clinical data with other sorts of data points, I do think we're going to be able toanswer questions that we haven't been able to before by piecing information together over time.Amar Drawid (27:27)Talkabout that. You mentioned about the why, right? So why we want to get this information. So by being able to extract this information, what kind of answers have you been able to get? How has that been helping, let's say, in clinical practice so far?Aaron Cohen (27:48)Amar,I would say that the main one that stands out and it's the one that I referenced earlier is you have these clinical trials that have shown what progression-free survival looks like for patients. And that is one of the main endpoints in clinical trials that's used because it correlates with overall survival and it happens more frequently than death events. And so it's easier tomeasure that outcome than it is overall survival. But there's no guarantee when you do a trial that the progression-free survival that's been measured for that cohort generalizes, again, to the patient that I'm seeing in front of me, right? And so there's a lot of interest in understanding, what is actually the progression-free survival of patients in the real world? And thenthat was some of themain motivation between recent research that we did and presented at on AACR AI. We wanted to see how well large language models could do at extracting progression events and their dates. We wanted to do that specifically to see how well they could do compared to how well a human could do it. And I'll tell you
what I mean by that. Usually when you, you know, evaluate an algorithm, you evaluate it against some reference data. And often in real world evidence and other fields, what you're evaluating against is human label data. But that becomes problematic when the reference data, the human label data that you may be evaluating against isn't 100 % correct.Right. And so why might it not be a hundred percent correct? Well, you know, as we've discussed earlier, real world data is super messy. There's subjectivity, there's tasks that are just hard. You know, I told you about local regional recurrence. So even really highly trained, human abstractors that we pride ourselves on, on who we hire and then how we train them and how we evaluate them,still make mistakes. Like we should acknowledge that humans aren't perfect, right? And when you evaluate against that test data that is imperfect, it limits your ability to understand how well the model,or in this case, the large language model actually did. You know, the analogy I usually use, because everyone can relate to it, is imagine you took a test.You studied a really long time and you get the results back and you got a 75 % and you're disappointed. You thought you should have gotten higher. And then it's revealed later that the answer key had errors.And that's super frustrating because like now you don't know how well you actually did. There could have been answers where you were right and the answer to you was wrong, but it was you that was penalized. And so that same thing happens when evaluating models and in this case, large language models. And so a couple of things, higher the quality of the reference data you can get, the more accurate you're going to be able to really understand howwell the large language models work. And what we did in the study, if you're able to contextualize how difficult that task is for a human by comparing the LLM to reference data and a human to separate reference data, then you're more on equal footing. There may still be mistakes in the reference data, but you're not evaluating the LLMon only one human and being done with it, you're saying, okay, there's a human approach, there's an LLM approach, we're gonna evaluate them on the same data. And in that sense, you can see how similar or not their performances. You can generate regular metrics like recall or precision or F1 score. And so in this study, what we did is we trained it or we taught an LLM to extract progression events and dates across 14 different cancer types.And we actually compared how the LLM did on those tasks to extract progression to an expert human. And we measured the results and we contextualized the results by saying the LLM was within four F1 points of a human. I think that that's just a more...intuitive, easier way to understand the quality of a model than traditional metrics. I could tell you that the precision or recall is 75 % for the model. You won't know anything about the reference data and you don't know what to make of it. And in my experience, people love it when they hear that numbers are in the nineties. Like, that sounds great. We're good. Right? They know what sounds bad,which has tended to be 50s or 60s metrics in the 50s or 60s. But they don't really know what to make or how to think about something in the 60s, 70s and 80s. And I don't blame them. I often don't either
because it's only a piece of the picture. And so what we really wanted to do was show that we could do this well, that we could do this for a hard and important problem that varies by cancer type,because how progression happens and how it's documented, how frequently it happens, all things that can impact performance changes based on the cancer type. And so we wanted to show that we could do this across a multitude of different cancer types, that we could do it close to how well an expert human could do it. And transparently, we also wanted to see where we weren't doing it as well so that we could understand anddig into that.And in this study, we saw that almost across the board, our results were very, very close to that of a human. And there was a couple of exceptions where the difference between how well the LLM was doing and a human was doing was larger thanwe would like. And we were able to say, okay, that's a signal that we can do something better here, right?There's something different about this cancer or these scenarios that we need to understand. So, you know, working as a cross-functional team at Flatiron, which we almost always do, we did error analysis and we discovered that for, let's say, hepatocellular carcinoma, which was one of the cancers where we were struggling more, we saw that the large language model was struggling to discern or differentiate between whether the patient was developingadvanced disease and that was it, or whether the patient already had advanced disease and then was experiencing a progression event. And basically, whether to lump that progression event in with the initial diagnosis or if it was distinct. And humans were able to do that and had, I would say, a better idea of being able to classify that than the large language models. And as a result, we were able to go back and prompt it differently. Like if you're in this scenario,If it's been X number of months, consider it part of initial diagnosis. If it's been more than X number of months, consider it a distinct progression event. And when we did that, we saw the gap close and the performance improve.Amar Drawid (35:15)Okay. And in terms of progression free survival, right, we always think about, we have the progression free survival numbers for clinical trials, but in real world, we always think, well, it's probably 10 to 15 % less. Like what did you find actually by doing all of this that you published?Aaron Cohen (35:34)In this study, we evaluated real-world progression-free survival. I referenced the work we've done on endpoints with human abstraction and publications earlier, which we're very proud of. We wanted to do an analysis that would actually reflect how this large language, this LLM extracted data would actually be used. Because it's one thing to say the metrics are this, but metrics onlytell you a part of the story and they change depending on what
you'reevaluating on. What you really want to know is like, okay, when you use this data instead of human abstracted data, do you get the same answer or not? Like, can you trust the answer? And so what we did is we performed analyses where we looked at what the real-world progression-free survival was across these 14 different cancer types if you usedhuman-extracted data, and then we did it again except subbing in the LLM-extracted data. And the results we saw were almost identical, which was very cool to see. And that's, I think, another example of where metrics don't tell the whole story because you can have metrics that look great but not necessarily lead when you're using them in combination, these models and these variables,to the right answer, or you can have metrics that don't look as good, but when you're using them in combination with each other, the things you made mistakes on tend to fall out of the cohort of interest sometimes and their results may actually still be okay. So I know you're asking specifically about like, how did the real-world progression results compare to the trial? And what I answered was, how did the real-world progression events ofhuman-extracted data compared to large language model data. That was really the focus of the study wewere looking at. That said, and it varies based on the study that you're looking at and the trial and the cohort. I think it's -intuition is correct that ROA PFS does tend to be a bit less just because of thedifferences between a trial and the patients in the trial in real world.Amar Drawid (37:43)Yeah, absolutely. And it's the really nice care that you get in a clinical trial versus in real world and the follow-ups and everything, right? That definitely makes a difference. Yeah, but that's great because as we always think about that intuitively, but you're able to see using the data and using, I know the humans did the work to compare as well, but now this gives us an ability with the LLMs to be able to do that in the futurepretty fast way with that, definitely is going to be helpful when we are thinking about how, when we're developing drugs, especially. That's very helpful for pharmaceutical companies. yeah. So, what are kind of like the things you mentioned sometimes that the AI did better, sometimes not. Like what are, as we're going into the future,what are some of the things that you see the opportunities to improve for the LLM so they can even do a better job than what has been able to do so far?Aaron Cohen (38:46)I think there's ways to answer that from the researcher data scientist perspective, and there's ways to answer that from the clinician perspective. And so maybe I'll try to answer both with different hats on. I think that from the data science research perspective,validation is so important. These large language models are incredible in what they can do. And they're improving. It's a moving target, right? Like you think they can't do something