基于可视分析的训练数据质量提升研究
A Survey of Visual Analytics Research for Improving Training Data Quality

Weikai Yang1   Changjian Chen1   Jiangning Zhu1   Lei Li2   Peng Liu2   Shixia Liu1

1Tsinghua University       2China Aerospace Science & Industry Corporation

Teaser Image
Teaser Image

三类主要的训练数据质量问题与典型示例

Abstract

机器学习的成功依赖于高质量的训练数据. 但在实际应用中, 由于数据来源渠道多以及部分标注者水平不足, 训练数据质量很难得到保证. 为了解决这一问题, 可视分析技术通过深度结合机器学习和可视化技术, 将人融入到数据质量分析与提升回路中, 帮助提升训练数据质量, 从而提高模型性能. 本综述首先总结了训练数据质量问题的主要类型; 然后基于总结的问题类型, 对相关的可视分析工作进行分类与总结; 最后, 深入分析了基于可视分析的训练数据质量提升研究中所面临的机遇与挑战.

The success of machine learning relies on high-quality training data. However, it is difficult to ensure the quality of training data in practical applications due to the various sources of training data and the inexperience of some annotators. By tightly integrating machine learning and visualization, visual analytics techniques involve humans in the loop of data quality analysis and improvement, thereby enhancing the quality of training data and improving model performance. In this survey, we first summarize the main types of training data quality issues. Based on the identified problem types, we categorize and summarize relevant visual analytics approaches. Finally, we delve into the opportunities and challenges faced in research on training data quality improvement using visual analytics.

Materials
pdf