Our synthetic data solution is based on our cutting-edge machine learning in healthcare research.
Our predictive modeling solution is based on comprehensive research in machine learning.
We have research publications on other data modality such as clinical notes and medical images such as x-ray.
Real electronic health records (EHRs) are high-dimensional, including diagnoses (ICD codes), procedures (CPT codes), and medications. Altogether, over 20K dimensions need to be modeled and synthesized. Most existing synthetic generator solutions cannot produce such high-dimensional data. Instead, they often require users to specify a handful of variables of interest from a vast number of features in the real data. Those generators will only generate those few variables (usually in the order of tens). In comparison, MediSyn can produce high-dimensional EHRs in their original resolution with high fidelity.
Each dot corresponds to a single medical code (ICD or CPT code). High R^2 indicates high fidelity.
MediSyn can capture the co-occurrence patterns of medical codes within a visit. The correlation of prevalence between medical code pairs is very high, despite the fact that we have to model over 5 million code pairs.
MediSyn generates realistic longitudinal patient records of multiple visits over time. The temporal correlation of medical codes is accurately captured. Each dot in the plot indicates a pair of medical codes that occurs in consecutive visits. The x-axis is the prevalence of this pair in real data, while the y-axis corresponds to that in synthetic data.
Our synthetic data can support machine learning modeling:
Our synthetic patient data are not mapped to any specific real patient. Furthermore, we thoroughly test the all synthetic data with privacy attacks to ensure the privacy preservation of real patients.
Membership attack is about discovering the identities of real patients in the training data. We introduce two versions of membership attacks.
Our experiment results show that attackers are unable to identify the real patients in the training data. Their attack success probability is close to random guesses (close to 0.5) in all settings.