The prediction task in the first set of experiments is to learn models which accurately determine whether a given individual has PD based upon brief smartphone recordings of the person’s voice. The ‘US cohort’ dataset  used in these experiments:
- is composed of 195 (de-identified) voice recordings for 31 individuals, 23 of whom have PD, with most recordings lasting for a few seconds; each voice recording is described by 22 features expected to be PD-relevant [10,25], including
- 16 standard ‘statistical’ attributes (e.g. mean/maximum/minimum fundamental frequency, variation of fundamental frequency and signal amplitude, signal-to-noise ratio) – see  for details;
- 6 information-theoretic/complexity measures (e.g. fractal dimension, pitch entropy, dynamical complexity) – see  for details;
- contains recordings of PD patients in the early stages of the disease – for example, 26% have little or no functional disability (stage ≤ 1.5 on the Hoehn/Yahr scale ) and 35% have been diagnosed for ≤ 2yr;
- includes a label for each patient, ‘PD’ or ‘healthy’, assigned by PD specialists based upon thorough clinical ex-amination (see  for details).
Predictive performance is measured with area under the ROC curve (AUC), estimated through leave-one-out cross-validation (together with statistical significance of performance differences ). Quantitative assessment of the ac-curacy of Algorithm I+FLL is facilitated through comparison with two state-of-the-art benchmark models as well as two simplified version of the algorithm. The five PD-detection strategies implemented in the experiment are:
- ‘I+FLL’: Algorithm I+FLL deployed with all 22 physiological features (see above and ), where preliminary predictions are made with an RF classifier trained on DL, prediction refinement is based upon two feature labels with initial estimates u0,increased_pitch_entropy = +1, u0,increased_signal/noise = −1, and β1 = β2 = β3 = 0.3;
- ‘I+FLL-statistical’: Algorithm I+FLL as above but using only 16 voice statistics as features (e.g. mean fundamental frequency, signal/noise ratio) ;
- ‘I+FLL-information-theoretic’: Algorithm I+FLL as above but using only 6 information-theoretic and complexity quantities as features (e.g. pitch-entropy, fractal dimension of voice signal embedding) ;
- ‘little’: state-of-the-art PD detection model built with an SVM classifier and extensive feature engineering ;
- ‘psorakis’: relevance vector machine with sophisticated basis function and training sample selection .
Tests of standard learning methodologies (traditional SVMs, logistic regression)  for this task suggest these per-form poorly, likely because the training dataset is small.
Results of the ‘US cohort’ PD-detection experiment are shown in Figure 9. It is seen that a prediction model learned using Algorithm I+FLL (cyan bar) significantly outperforms all benchmark models (AUC = 0.985 with p < 0.01 for AUC differences). Notice that the sets of statistical and information-theoretic/complexity features are predictive on their own and complementary when combined.
We also compared the PD-detection accuracy of Algorithm I+FLL, which requires only passive smartphone data, to that of high-quality PD-screening questionnaires. PD-screening tools evaluated in this test include NMS-Quest  and Tele-Quest , and ground-truth PD labels reflect diagnoses made by PD specialists based upon thorough clinical examination. Because [35,36] report sensitivity and specificity , rather than AUC, those are also used here. The results of the study are:
- Algorithm I+FLL: sensitivity = 0.945, specificity = 0.960;
- NMS-Quest : sensitivity = 0.718, specificity = 0.885;
- Tele-Quest : sensitivity = 0.890, specificity = 0.880.
Thus Algorithm I+FLL, processing only smartphone data, achieves better screening accuracy than labor-intensive, clinically-validated PD questionnaires.