Using Machine Learning to Recommend Correctness Checks for Geographic Map DataSEIPIndustry Program
Developing an industry application that serves geographic map data to users across the world presents the significant challenge of checking the data using “data correctness checks.” The size of data that needs to be checked—the entire world—and data churn rate—thousands per day—makes executing the full set of candidate checks cost prohibitive. Current techniques rely on hand-curated static subsets of checks to be run at different stages of the data production pipeline. These hard-coded subsets are uninformed of data changes, therefore a large fraction of their checks have no chance of detecting bugs. Other, relevant checks are often excluded. Checks are executed billions of times per release taking hundreds of human and machine hours. To address these problems, we have developed new representations of map data changes and checks, formally defined “check safety,” and built a recommender system that dynamically and automatically selects and ranks a relevant subset of checks using signals from latest data changes. Empirical evaluation shows that it improves (1) efficiency by eliminating 65% of checks unrelated to changes, (2) coverage by recommending and ranking change-related checks from the full set of candidate checks, previously excluded by the manual process, and (3) overall visibility into the data editing process by quickly and automatically identifying latest fault prone parts of the data.