inCognita Parkinson's Disease (PD) Results

Suppose we are given access to passively-collected smartphone data for a patient and wish to predict whether the individual has a brain disorder of interest. A supervised learning approach to this task entails:

i.) acquiring smartphone sensor traces for a cohort of patients who have been labeled according to whether they have the target disorder (TD), preferably by clinical experts, and

ii.) using these examples to learn a model which recognizes the TD-status of new (unseen) persons [20]. Unfortunately, in medical passive-sensing applications there are rarely enough labeled examples to support induction of good, generalizable models.

inCognita Results: Detecting and Monitoring Parkinson's Disease with I+FLL

We now evaluate the performance of Algorithm I+FLL for the task of detecting Parkinson's Disease (PD) from smartphone. To enable robustness and generalizability of the approach to be investigated, two different datasets, corresponding to disparate patient cohorts, are used in the experiments:

a ‘US cohort’ of 31 individuals, 23 of whom have PD [10],
a ‘Turkish cohort’ of 40 individuals, 20 of whom have PD [12].

The accuracy of Algorithm I+FLL is assessed by comparing its predictions with those of both strong benchmark models and established PD screening questionnaires. Additionally, we study the predictive power of distinct feature subsets obtained through different signal processing schemes.

Results with US cohort

The prediction task in the first set of experiments is to learn models which accurately determine whether a given individual has PD based upon brief smartphone recordings of the person’s voice. The ‘US cohort’ dataset [10] used in these experiments:

is composed of 195 (de-identified) voice recordings for 31 individuals, 23 of whom have PD, with most recordings lasting for a few seconds; each voice recording is described by 22 features expected to be PD-relevant [10,25], including
- 16 standard ‘statistical’ attributes (e.g. mean/maximum/minimum fundamental frequency, variation of fundamental frequency and signal amplitude, signal-to-noise ratio) – see [25] for details;
- 6 information-theoretic/complexity measures (e.g. fractal dimension, pitch entropy, dynamical complexity) – see [25] for details;
contains recordings of PD patients in the early stages of the disease – for example, 26% have little or no functional disability (stage ≤ 1.5 on the Hoehn/Yahr scale [10]) and 35% have been diagnosed for ≤ 2yr;
includes a label for each patient, ‘PD’ or ‘healthy’, assigned by PD specialists based upon thorough clinical ex-amination (see [10] for details).

Predictive performance is measured with area under the ROC curve (AUC), estimated through leave-one-out cross-validation (together with statistical significance of performance differences [31]). Quantitative assessment of the ac-curacy of Algorithm I+FLL is facilitated through comparison with two state-of-the-art benchmark models as well as two simplified version of the algorithm. The five PD-detection strategies implemented in the experiment are:

‘I+FLL’: Algorithm I+FLL deployed with all 22 physiological features (see above and [25]), where preliminary predictions are made with an RF classifier trained on DL, prediction refinement is based upon two feature labels with initial estimates u0,increased_pitch_entropy = +1, u0,increased_signal/noise = −1, and β1 = β2 = β3 = 0.3;
‘I+FLL-statistical’: Algorithm I+FLL as above but using only 16 voice statistics as features (e.g. mean fundamental frequency, signal/noise ratio) [25];
‘I+FLL-information-theoretic’: Algorithm I+FLL as above but using only 6 information-theoretic and complexity quantities as features (e.g. pitch-entropy, fractal dimension of voice signal embedding) [25];
‘little’: state-of-the-art PD detection model built with an SVM classifier and extensive feature engineering [10];
‘psorakis’: relevance vector machine with sophisticated basis function and training sample selection [34].
Tests of standard learning methodologies (traditional SVMs, logistic regression) [20] for this task suggest these per-form poorly, likely because the training dataset is small.

Results of the ‘US cohort’ PD-detection experiment are shown in Figure 9. It is seen that a prediction model learned using Algorithm I+FLL (cyan bar) significantly outperforms all benchmark models (AUC = 0.985 with p < 0.01 for AUC differences). Notice that the sets of statistical and information-theoretic/complexity features are predictive on their own and complementary when combined.

We also compared the PD-detection accuracy of Algorithm I+FLL, which requires only passive smartphone data, to that of high-quality PD-screening questionnaires. PD-screening tools evaluated in this test include NMS-Quest [35] and Tele-Quest [36], and ground-truth PD labels reflect diagnoses made by PD specialists based upon thorough clinical examination. Because [35,36] report sensitivity and specificity [20], rather than AUC, those are also used here. The results of the study are:

Algorithm I+FLL: sensitivity = 0.945, specificity = 0.960;
NMS-Quest [35]: sensitivity = 0.718, specificity = 0.885;
Tele-Quest [36]: sensitivity = 0.890, specificity = 0.880.

Thus Algorithm I+FLL, processing only smartphone data, achieves better screening accuracy than labor-intensive, clinically-validated PD questionnaires.

Prediction model results US cohort

Figure 9. PD-detection accuracy (AUC) of Algorithm I+FLL with all features (tan) and four benchmark models: ‘little’ [10] (red), ‘psorakis’ [34] (blue), Algorithm I+FLL with statistical features (teal), and Algorithm I+FLL with information-theoretic/complexity features (green).

Results with Turkish Cohort

The next set of experiments also involves learning models which detect PD in individuals based upon smartphone recordings of the person’s voice. In this case, though, we apply Algorithm I+FLL with a cohort that is quite different than the ‘US cohort’ – the individuals speak another language and are culturally and ethnically distinct. These differences allow examination of the algorithm’s robustness and generalizability.

In greater detail, the ‘Turkish cohort’ [12] used in these experiments

consists of 1040 (de-identified) voice recordings for 40 individuals, 20 of whom have PD, with most recordings lasting for a few seconds; each voice recording is described by 26 features which are expected to be PD-relevant [10,12,25] and are similar to the ‘statistical’ attributes summarized above for the ‘US cohort’ (see [25] for more details);
contains recordings of PD patients in the early stages of the disease; for example, no patient has more than mild impairment and 40% have little or no functional disability (UPDRS assessment [12]); additionally, no patient has been diagnosed for > 6yr;
includes a label for each patient, ‘PD’ or ‘healthy’, assigned by PD specialists based upon thorough clinical ex-amination (see [12] for details);
has more feature measurement variability than the ‘US cohort’ dataset, motivating a modest extension to Algorithm I+FLL to provide enhanced robustness to feature noise (see below).

Three versions of Algorithm I+FLL are implemented, corresponding to distinct ways of mitigating the presence of feature noise in the ‘Turkish cohort’:

‘naïve’: Algorithm I+FLL with raw smartphone features, i.e., no feature label learning or physiological features (essentially the classifier developed in [37]);
‘feature-averaging’: Algorithm I+FLL with feature-averaging – each individual is modeled using the mean of that person’s recording-level features;
‘probability-averaging’: Algorithm I+FLL is applied to voice recordings, and individual-level predictions are formed by averaging that person’s recording-level predicted probabilities.

The results of PD-detection experiments with the ‘Turkish cohort’ are displayed in Figure 10. As is clear, the prediction-averaging version of Algorithm I+FLL (green bar) is substantially more accurate than the other schemes (AUC = 0.973 with p < 0.01 for AUC differences). This outcome indicates (recording-level) prediction-averaging may be a valuable extension of Algorithm I+FLL when smartphone sensor readings are especially noisy. These results and those of the preceding subsection offer evidence of the robustness and generalizability of the proposed methodology across quite-different patient cohorts.

Prediction model results Turkey cohort

Figure 10. PD-detection accuracy (AUC) of three versions of Algorithm I+FLL: ‘naïve’ (red), prediction based upon patient’s average feature-vector (blue), and prediction by averaging patient’s recording-level model predictions (green).

Concluding remarks

The results show that with the inCognita framework, Volv present a new method for leveraging passively-collected smartphone data and machine learning to detect and monitor brain disorders such as Parkinson's disease. Crucially, the algorithm is able learn accurate, clinically-compatible models from small numbers of labelled examples (i.e., smartphone users for whom sensor data has been gathered and disease status has been determined). Predictive modelling is achieved by learning from augmented training sets composed of both real and ‘synthetic’ patients.

InCognita References

References listing for the inCognita product