Swoosh: A Generic Approach to Entity Resolution

Full textClick to download.
CitationStanford University Technical Report
AuthorsOmar Benjelloun
Hector Garcia-Molina
Jeff Jonas
Qi Su
Jennifer Widom


We consider the Entity Resolution (ER)problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER (genER) problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify a set of important properties that should be satisfied by the black-box functions to enable efficient and deterministic ER algorithms. We develop two such algorithms, R-GenER and F-GenER (R for Record and F for Feature). The GenER algorithms minimize invocations of the match and merge functions while still providing accurate ER results. We report on experiments demonstrating that the GenER algorithms provide significant performance gains over naive approaches.

Back to publications
Back to previous page