Synthetic Data in China’s AI Training Race

Startups embrace synthetic datasets to overcome restrictions on real-world data.
✍️ By Dr. Alan Hughes | Telecoms & Space Policy Analyst
China’s AI sector is facing a data dilemma: stricter regulations on privacy and content limit access to large, real-world datasets. In response, startups and research labs are increasingly turning to synthetic data—computer-generated datasets designed to train machine learning models.
Synthetic data allows AI systems to simulate millions of scenarios without relying on sensitive personal information. For example, autonomous driving firms can generate endless driving conditions, from rain-slick highways to crowded urban streets, without exposing real driver data.
In healthcare, synthetic patient records enable algorithm training while protecting confidentiality. In finance, artificial transaction data helps improve fraud detection without risking exposure of private accounts.
Chinese regulators, aware of the balance between privacy and innovation, have cautiously supported this approach. Universities in Beijing and Shanghai now run joint labs with startups focused exclusively on synthetic data generation.
Critics argue that synthetic data may not fully capture the complexity of real-world behavior, risking weaker model performance. Still, as restrictions tighten globally, China’s adoption of synthetic data could give it an edge in navigating the innovation-regulation dilemma.