Locality-sensitive hashing (LSH) is a collection of techniques that seem almost magical. Faced with a problem that involves finding the similar pairs from a large set, you can avoid doing work that is quadratic in the size of the set if you do it right. Avoiding quadratic behavior is vital, since even a “small” set of a million records would imply half a trillion comparisons if done naively. We shall illustrate the general principle of LSH by considering a particular problem: entity resolution. We are given a set of records, say of credit-card transactions and bank records, and we want to find which records refer to the same person. But names, addresses, phones, and other indicators of who you are might not be exactly the same. There could be spelling errors, people move, and other changes may make the representation of the same person look somewhat different in different records. We hope, however, that the records representing the same person will look more similar than records representing different people. We’ll talk about a problem of this type that I was once called upon to solve, and the way I used LSH to solve it.