Learning models from synthetic patients data science | Volv

In some applications, the labeled training data is extremely limited and/or class-imbalanced, and this is an obstacle to effective learning. This situation arises, for example, because it is often difficult to collect smartphone data traces for individuals with confirmed diagnoses for the Target Disease (TD) of interest, and because typical datasets have far fewer ‘cases’ (patients with the TD) than ‘controls’ (unaffected patients).

We hypothesize that, if sufficiently-realistic ‘synthetic’ patient vectors could be acquired, especially for the minority-class, adding these vectors to the training set could improve the predictive modeling process. Specifically, it may be possible to increase the predictive accuracy of our algorithms by generating synthetic patient vectors, combining these vectors with real labeled examples, and training our algorithms on this larger ‘hybrid’ dataset. The basic aspects of this learning from synthetic patients (LSP) method are now described. (See [1,2] for a more thorough exposition.)

Previous studies have found that, to be useful for improving learning performance, synthetic examples should be realistic and diverse [2,3,4]. We integrate two complementary mechanisms to create such synthetic patients.

Realism is achieved by using adversarial learning to produce synthetic patient feature vectors (SPs) that are in-distinguishable from those of real patients (RPs) [1,2] and by working directly in the original patient feature space.
Diversity is obtained by appropriately injecting randomness into SP construction, with an emphasis on output-ting SPs which are well-separated in feature-space [3,1,2].

As is demonstrated empirically in subsequent sections, the resulting SPs are useful and convenient for downstream analysis (e.g. predictive modeling) and also intuitively-interpretable by clinicians.

The iterative algorithm used to build synthetic patients which are both realistic and diverse is illustrated in the figure. Informally, at each iteration, the realism of current SPs is increased by adapting their feature vectors so as to reduce the ability of a well-trained classifier to distinguished them from RPs. SP-diversity is encouraged primarily through the use of two stochastic processes: random initialization of SP ‘seed’ vectors and random modification of the adaptation procedure aimed at increasing SP-realism.

This basic idea is formalized with the SP generation and learning algorithms. The main focus is applications with very few positive-class training examples, as this is common in smartphone-based health monitoring.

However, as delineated, synthetic patients corresponding to both positive-class and negative-class RPs can be created if needed.

References

Colbaugh, R and K Glass, ‘Predictability-oriented defense against adaptive adversaries’, IEEE SMC, Seoul, Korea, October 2012.
Colbaugh, R and K Glass, ‘Enhancing smartphone-based depression prediction with synthetic patients’, Technical Report, Volv Global, Lausanne, Switzerland, June 2019.
Goodfellow, I et al., ‘Generative adversarial networks’, NIPS, Montreal, Canada, December 2014.
Chawla, N et al., ‘SMOTE: Synthetic minority oversampling technique’, JAIR, Vol. 16, 2002.

References

Learn More about Volv data science and how we make a difference