In some applications, the labeled training data is extremely limited and/or class-imbalanced, and this is an obstacle to effective learning. This situation arises, for example, because it is often difficult to collect smartphone data traces for individuals with confirmed diagnoses for the Target Disease (TD) of interest, and because typical datasets have far fewer ‘cases’ (patients with the TD) than ‘controls’ (unaffected patients).
We hypothesize that, if sufficiently-realistic ‘synthetic’ patient vectors could be acquired, especially for the minority-class, adding these vectors to the training set could improve the predictive modeling process. Specifically, it may be possible to increase the predictive accuracy of our algorithms by generating synthetic patient vectors, combining these vectors with real labeled examples, and training our algorithms on this larger ‘hybrid’ dataset. The basic aspects of this learning from synthetic patients (LSP) method are now described. (See [1,2] for a more thorough exposition.)
Previous studies have found that, to be useful for improving learning performance, synthetic examples should be realistic and diverse [2,3,4]. We integrate two complementary mechanisms to create such synthetic patients.
As is demonstrated empirically in subsequent sections, the resulting SPs are useful and convenient for downstream analysis (e.g. predictive modeling) and also intuitively-interpretable by clinicians.
The iterative algorithm used to build synthetic patients which are both realistic and diverse is illustrated in the figure. Informally, at each iteration, the realism of current SPs is increased by adapting their feature vectors so as to reduce the ability of a well-trained classifier to distinguished them from RPs. SP-diversity is encouraged primarily through the use of two stochastic processes: random initialization of SP ‘seed’ vectors and random modification of the adaptation procedure aimed at increasing SP-realism.
This basic idea is formalized with the SP generation and learning algorithms. The main focus is applications with very few positive-class training examples, as this is common in smartphone-based health monitoring.
However, as delineated, synthetic patients corresponding to both positive-class and negative-class RPs can be created if needed.