Weikai Yang1
Changjian Chen1
Jiangning Zhu1
Lei Li2
Peng Liu2
Shixia Liu1
1Tsinghua University
2China Aerospace Science & Industry Corporation
机器学习的成功依赖于高质量的训练数据. 但在实际应用中, 由于数据来源渠道多以及部分标注者水平不足, 训练数据质量很难得到保证. 为了解决这一问题, 可视分析技术通过深度结合机器学习和可视化技术, 将人融入到数据质量分析与提升回路中, 帮助提升训练数据质量, 从而提高模型性能. 本综述首先总结了训练数据质量问题的主要类型; 然后基于总结的问题类型, 对相关的可视分析工作进行分类与总结; 最后, 深入分析了基于可视分析的训练数据质量提升研究中所面临的机遇与挑战.
The success of machine learning relies on high-quality training data. However, it is difficult to ensure the quality of training data in practical applications due to the various sources of training data and the inexperience of some annotators. By tightly integrating machine learning and visualization, visual analytics techniques involve humans in the loop of data quality analysis and improvement, thereby enhancing the quality of training data and improving model performance. In this survey, we first summarize the main types of training data quality issues. Based on the identified problem types, we categorize and summarize relevant visual analytics approaches. Finally, we delve into the opportunities and challenges faced in research on training data quality improvement using visual analytics.