Great, wow, it's exciting to be here.
And I'm here to share some important needs, and
also opportunities in healthcare.
So we look at the numbers, the needs in healthcare are definitely large.
In the United States, we spend more on healthcare,
and get worse outcomes than comparable countries.
And if we look at different diseases, our success rates and
our treatments vary from say 30 to 60% on a lot of average diseases.
Okay, so there's definitely work to be done here.
Now the good news is that there's lots of really cool
clinical questions that data science can address.
So some general work in the field.
There's great work in imaging,
looking at trying to figure out where strokes are in patients.
How to control seizures through neurostimulation, how to predict sepsis.
So this is just examples, and all of these work is being done by female PIs.
And then, work in our lab where we're trying to predict interventions in
the ICU or taking data from electronic health records and online health forums
to try to better understand diseases, such as autism spectrum disorder.
It is very heterogenous and
a big barrier in that regime, which is even figuring out what a disease is,
right, and if we can categorize it better, maybe we can do better treatments.
And then finally,
another area that we're working in is optimizing the treatment of HIV.
So lots of really, really cool applications.
They can have huge impact for patients, right?
So these are real problems where we can have real impact,
and I'm gonna share today the technical part of just one of these problems
which is the optimization of treatment of patients with HIV.
All right. So just to give you a little bit of
background, HIV affects about 36 million people worldwide.
And the way it's treated is people are given cocktails of antiretroviral drugs
and the disease is rapidly mutating.
So, you use the one drug cocktail for some time, you become resistant or
rather the virus becomes resistant, you have to switch to something else.
So, we have to think about a sequence of decisions here.
Because if we get a choice of cocktails that will cost a patient to be
resistant to our future cocktails then we're in trouble, right?
So we need to figure out how do we plan for now as well as the future.
So are data scientists attack this problem in the past?
I'm gonna share two general ideas.
And this is just one of them yet
to give you a feel of what sort of cool things we can do in the healthcare space.
So there's a class of techniques called Kernel-based approaches.
And you can think of them as nearest neighbors.
So if I want to predict what's going to be good for a particular patient,
say the one with the red dot over here, I'll look at similar patients and
I'll look at what worked well for them.
And so, what has been today, a popular or
common approach is to do this, and also to look just in the near term and
say, all right, if I gave this drug, it dropped patient's viral load,
the load level for a short period of time, and that we consider a success.
But as I mentioned, we have to consider the long term as well, right?
Now, there's another class of techniques that are called model
based techniques in reinforcement learning.
So idea here is instead of looking for nearest neighbors, right, what we're gonna
do is we're actually try to simulate what the disease process might look like.
So what we're going to do is we're gonna say there's some hidden disease state that
we don't know, right?
People are very squishy and diseases are even squishier, so
we have no idea what's going on.
We just call that the white circle over there.
We give drugs, and that changes the system.
The system that we can observe.
But we can observe certain measurements, like the CD4 counts or
the viral loads, etc.
And if we have a model for
the system, the patient, then we can figure out what the best treatments are.
And this is appealing, but it's hard to do in practice, because we have so
many measurements, and not enough data how do we actually train these models.
So the key insight that we had for this particular work,
is that these approaches have actually complementary strengths?
So imagine that we have this picture and
there's patients that are in clusters, right?
So if my red patient that I'm trying to predict, which drug cocktail
should I give this person happen to have a lot of other similar patients,
maybe the right thing to do is just look at around at those similar patients,
those are the best example that I have.
And say, let's try to copy what works best with those patients.
Seems very reasonable, right?
If you're lucky to have a clone in your database, then there's nothing better,
right, cuz it models you exactly.
On the other hand, you might have patients who don't have near by neighbors.
And this is where those nearest neighbor approaches fail.
Because it's not a good idea to just map
to the nearest neighbor if the nearest neighbor is far away.
And in such situations,
it might actually be better to fall back on a more simplified notion of how
disease progresses rather than something that's wrong, right?
So the key insight here is that,
it's better to have a simple explanation, rather than a wrong explanation.
Okay?
And then what do we do?
We combine.
So here's the insight, and the way we are gonna put it together
is we're going to say, well our kernel is going to suggest an action, our POMDP,
which if you don't know what it is, it doesn't matter.
Think of it as the model based approach.
It's gonna select an action and
then we're going to choose between the two recommendations based on again,
where the patient lives and this patient's space, right?
That's the key idea.
So again, the Kernel Action,
in the past people have used short-term success criteria.
Like does the viral load go down or not?
And we extend that to take into account, are we creating additional mutations?
Cuz we don't wanna create mutations,
those are the things that will screw us over in the future.
And are we keeping the viral load down?
And then we build the POMDP.
I'm not gonna go into all the details, and solve the policy, right?
So, jumping to the conclusion, does this work?
Well, we looked at a database of around 33,000 patients,
holding out about 3,000 for testing.
And again, we had information not only about traditional measurements like
CD4 counts and viral loads, but we also had all the mutations
of the virus over time for these patients, which is becoming increasingly common.
This database comes from the EU, but recently there are efforts in South Africa
as well to do more genotyping, cuz we've found that it's very valuable to
understand, and actionable in terms of choosing treatments for HIV.
I also wanna emphasize the large size of the action space here.
So there's 20 drugs that were used in over a thousand different combinations,
and we limited ourselves to only 300, right?
So if you're thinking about this from hard problems in data science, the fact
that we have so many different choices for actions with a relatively small dataset.
This is the other thing I said want to highlight.
In healthcare, our data are inconveniently sized,
which is what I often like to say, because they're not small, right, 33,000 patients.
It's not a small data set, but it is small by big data standards.
We have to think a little bit carefully about how we're going to use the data.
And then here, the results with slightly different scales and the rewards, and so
if we apply just a random policy you will do pretty poorly.
The ST policy is the policy that you get if you just think about
short term rewards, so you think, I just wanna keep the viral load down tomorrow or
in the next three weeks, and I don't think about the future.
How well do I do in the long run?
And then the LT, long term, nearest neighbor or kernel policy,
you'll notice it does significantly better in the long run.
And it maybe obvious to all of us in the room because we're used to thinking about
machine learning and sequential decision making process.
But this was actually new to some of the clinicians that we worked with,
who were not sure that it made sense to even think about the future.
Because they thought the future might be to uncertain.
They're like, who knows what will happen to the patient.
We were actually able to show.
Actually there is enough predictability in this disease progression,
that you can think about the future and you can optimize for it.
Now what you'll notice is that the POMDP policy, which is a model based approach.
That's the approach that I said is too simple,
actually does quite a bit worse, right?
But if you combine the two in the mixture that I mentioned,
we do quite a bit better than the long-term Kernal policy.
And the POMDP has actually being chosen 30% of the time.
So now, like a good data scientist, we went through and we checked,
if it was our hypothesis correct?
Is it the case that we are choosing the Kernel when we have nearby neighbors,
and we're using the model when there is no nearby neighbors, right?
That was again our hypothesis for this domain.
And we find out it is the case.
So these graphs show that if you look at the distances,
the distances to the nearest neighbors, in the POMDP are higher than if you
are looking for your nearest neighbors, and also correlated with history length.
So if you have long history, it's harder to look similar to someone, right?
Cuz you've just had a lot of things happen to you, and what are the chances that
somebody else also had all those exact same things happen to you, right?
So this was super exciting.
This is a recent work that that's gonna be published at AMIA this year.
And my colleague just spent two days this past
week going through with our collaborators on HIV.
I'm going through and checking, okay, does this policy actually make sense, right,
because we evaluated using some retrospective data analysis techniques on
the observational cohort.
There could be all sorts of funny biases.
But again, two days of vetting by the clinicians,
they actually look reasonable, so I'm super excited.
So we've taken a data science problem, a clinical problem, we've formalized it,
we found a key insight that made it work and we got reasonable results, right?
So that's just giving you the story of a full pipeline of the sort of
things you can do in this space, and as this stuff becomes more and more vetted,
of course we're hoping that it will inform the treatment of actual patients.
So now I want to zoom out, and I want to mention that, so
I mentioned all the clinical problems right, working with seizures,
working with HIV, working with disease subtypes.
What are the data science problems that you could be solving in this space?
So when it comes to healthcare, we have a lot of low quality biased data, right?
It's a lot of observational data, and it captures really important populations,
cuz if we look at people who are actually just showing up to hospitals,
that's the real people who are showing up to hospitals, right?
Right? We're not selecting out for,
I only want to look for this race that shows up, or
I only want to look at this age group or this gender.
We're actually seeing the real population.
But that means again, that our data sources are often of very low quality.
We can mark some of those, so
semi-supervised methods are very important.
People only come in when they feel like it.
So, we really need models for data that only shows up by convenience.
Off-policy evaluation, as I said, and more important than anything else,
I think, is interpretability.
Because we have to be able to convince the clinicians that we haven't screwed up
somehow, and we're doing something that's reasonable.
Now, these are really, really cool problems.
And what I want to finish up with, is just to zoom back a little bit,
because this is also a Women in Data Science Conference.
Tell you a little bit about my past to getting here, cuz I am so excited about
the work that I'm doing, but it definitely was not a straight shot to get here.
So I grew up with my great uncle.
I was part of India's independence movement.
I went to a high school that gave, not only, focus on government and
national studies.
Even though it had a really fantastic math program, the focus was on government,
and yet I found that, as much as I was kind of interested in activism,
my interest were more towards academics.
It's kind of embarrassing, I like statistics, and science, and
I couldn't really imagine myself working in a soup kitchen, which in that mind, and
at the time was my mind, was like, what service was all about.
And it it took me like five academic degrees, hopefully it doesn't
take you guys that long, [LAUGH] to figure out how am I going to deliver
those warm fuzzies that I really cared about, with cold hard numbers.
And I'm very happy that today I found a way to combine things that
I'm passionate about with the sort of skills that I have and
the things that I like doing day to day.
And we are all incredibly privileged, because you all love data, and
data is everywhere.
So the last thing that I want to end with, is that, no matter what cause
you care about, there are a lot of really important problems out there.
And whatever cause you care about, there is data associated with those, and
I really encourage you to pursue your passions and solve those real problems.
And if that area happens to be in health, or you think it might be in health,
one place that I encourage you to check out is the machine learning for
healthcare conference.
If you go to mathmed.org, the next conference is happening in
Boston next summer, but also on our website is a list of past speakers,
people in the field that you can talk to.
There's a lot of really important areas in this space, and
I hope you'll be a part of the solution to solving them, thank you.
[APPLAUSE] >> Thank you.
That is a great conference.
So we have five minutes for some questions.
We have a mic?
There's a question here as well for you later.
Yeah. >> Can I ask a question about slide 10?
You talked about the nearest neighbor.
This may be a little.
>> I think there.
Yeah.
>> Yup.
>> Yeah, so you said that you're looking at the nearest neighbors.
And you are looking at the neighbor, so how you define this threshold or
neighborhood that look at, or don't look at, so.
>> I see, yeah, it's a good question.
So it goes into the classifier here under patients statistics.
So we look at the quintile distances, so what is the distance between this patient,
and the nearest patient is not necessarily the most informative that we put that
into the regression as well, but you can also look at the 20, the first quintiles,
so if you look at 25% of the patients who are near to this person,
what distance is that?
So that gives you a sense of clustering or clumping together.
>> So you export many techniques to map the best treatments for certain patients
for pretty well researched disease and conditions like HIV.
But is there a way to possibly map these conclusions to more rare conditions,
perhaps such as neurologic and endocrinology, or things like that
that may able to be utilized with machine learning or predictive modeling.
>> I think there's a lot of opportunities in this space.
I think as soon as you have diseases that are more rare,
I think it becomes important to bring in more and more domain knowledge.
So I think the technical question becomes, how do we take information that clinicians
know, because they know a lot, and I think there's often this machine learning,
data science hubris, stuff like let's just look a the data.
But the more we can incorporate domain knowledge and
say what's gonna be possible, we can use it to filter out hypotheses.
But yes, I think there's important technical questions and
I think that problem can be addressed.
>> So next year when you're done with high school, can you please apply to Stanford?
>> [LAUGH] >> Next question.
>> Hi, I think the clinical work is fantastic, but
if you think of healthcare more broadly,
we have people that have called our infant mortality rates an embarrassment.
We pay far more than any other country and we haven't even insured everyone.
We have more uninsured people than are in many European countries.
So, do you see any opportunities to make headway
in sort of the healthcare space more broadly?
>> So, that's a great question.
So, my personal take on this is that, I think that for people who want to do
the analytics, if that's where your higher is in terms of your passion and you skill,
I think some of this more point of care, bed side stuff is more accessible or
mannible, because you can find clinicians who will adopt to your techniques and
try them and get those policies to change on a small scale.
I think changing policy on a high level data,
we'll definitely be needed to support it.
But honestly I think the evidence is already there, we can look at these other
countries and we can see there are certain social structures and
certain healthcare structures result in better quality, and
I don't think the question is, does the data support that we need change?
I think that we need to advocate to our government for
a change, but it's not really a question of like, does it need to be done?
It does need to be done.
>> All right, well,
this is a great point to change over to the next speaker in a bit.
So, thank you so much again, Finale for coming.
>> [APPLAUSE].
Không có nhận xét nào:
Đăng nhận xét