Working with electronic health records
Developing computational models capable of detecting rare disease patients in population-scale databases such as electronic health records (EHRs) is challenging for several reasons, perhaps the most daunting of which being the limited number of already-diagnosed, ‘labelled’ patients from which to learn.
We overcome this obstacle with a novel lightly-supervised algorithm that leverages unlabelled and/or unreliably-labelled patient data – which is typically plentiful – to facilitate model induction. Importantly, we can prove the algorithm is safe. Adding unlabelled/unreliably-labelled data to the learning procedure produces models which are usually more accurate, and guaranteed never to be less accurate, than models learned from reliably-labelled data alone.
Volv's methods are shown to substantially outperform state-of-the-art models in patient-finding.
State of the Art Performance
Compare our results:
inTrigue: accuracy = 90.8% (standard error = 0.4%) and AUC= 93.0% (standard error = 0.1%)
kopcke: (L2-regulatrised logistic regression on full feature set) accuracy = 73.1% (standard error = 1.1%) and AUC = 74.9% (standard error = 1.0%)
miotto: (classification based on relevance scoring on full feature set) accuracy = 74.2% (standard error = 1.0%) and AUC = 75.8% (standard error = 1.0%)
This means we identify cohorts more accurately:
eliminating inappropriate candidates and finding the true candidates more successfully, delivering better quality recruitment and retention for clinical trials, for example.