Papers
arxiv:1708.01402

Exploiting Redundancy, Recurrence and Parallelism: How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

Published on Aug 4, 2017
Authors:
,
,

Abstract

Accurate and efficient record linkage is an open challenge of particular relevance to Australian Government Agencies, who recognise that so-called wicked social problems are best tackled by forming partnerships founded on large-scale data fusion. Names and addresses are the most common attributes on which data from different government agencies can be linked. In this paper, we focus on the problem of address linking. Linkage is particularly problematic when the data has significant quality issues. The most common approach for dealing with quality issues is to standardise raw data prior to linking. If a mistake is made in standardisation, however, it is usually impossible to recover from it to perform linkage correctly. This paper proposes a novel algorithm for address linking that is particularly practical for linking large disparate sets of addresses, being highly scalable, robust to data quality issues and simple to implement. It obviates the need for labour intensive and problematic address standardisation. We demonstrate the efficacy of the algorithm by matching two large address datasets from two government agencies with good accuracy and computational efficiency.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/1708.01402 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/1708.01402 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.