Full text | Click to download. |
Citation | Stanford University Technical Report
|
Authors | Omar Benjelloun
Hector Garcia-Molina Jeff Jonas Qi Su Jennifer Widom |
We consider the Entity Resolution (ER)problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER (genER) problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify a set of important properties that should be satisfied by the black-box functions to enable efficient and deterministic ER algorithms. We develop two such algorithms, R-GenER and F-GenER (R for Record and F for Feature). The GenER algorithms minimize invocations of the match and merge functions while still providing accurate ER results. We report on experiments demonstrating that the GenER algorithms provide significant performance gains over naive approaches.