mRelief, a non-profit working to increase access to SNAP benefits (food stamps), in collaboration with data science firm Civis Analytics, today released new research quantifying how a more flexible benefits application…
Today’s modern fundraising, outreach, and advocacy programs require reaching your target audience across many different platforms. You have to meet them where they are, and they are all over the place. They can be on the internet, answering a door knock, or walking into your event. Organizations collect data on these individuals in each location, but this creates disparate data sets containing the same individuals. You want to tailor your email campaign to make asks based on events someone has attended, and text based on their donation history. You need a Golden identity list, with a 360 degree view of your audience and members.
Civis Platform steps up to this challenge with its Identity Resolution IDR offering, a comprehensive solution to unify and deduplicate person level data across multiple sources. In this blog, we expose some of the science and data engineering that goes into this type of solution. We discuss:
At its core, Civis Identity Resolution takes one or more data sources about people (e.g., members, customers, donors) as input and finds matches among the person records they contain. It uses these matches to identify duplicates within and across sources and thereby determine which sets of records correspond to the same individuals. Note that by “duplicates”, we mean a set of records that refer to the same individual and should be linked together. Such records may come from different sources and have different information (e.g., one may include name and phone number while another has name and email address).
The user of Civis IDR can then confidently perform analyses about their business and interact with their people in a unified way, free from problems that might arise from duplication, having disconnected person data in multiple places, etc.
Table of Contents
Let’s clarify why one might need an identity resolution system. Here are some of the many challenges in deriving value from person data:
The Civis IDR system intends to address issues such as those described above. From one or more sources of person data, Civis IDR produces two primary outputs:
Our approach to Identity Resolution has been refined over multiple years of use and dozens of clients in multiple industries. We have focused on the following themes.
With this in mind, we can dive a bit deeper into how Civis IDR works.
At a high-level, the IDR system performs multiple steps, which are listed here, illustrated in Figures 1-6, and described in more detail in subsequent sections:
In practice, the first step of using IDR is to get data and preprocess it into standard formats. For this, we leverage the Imports and Data Enhancements functions of Civis Platform. The Platform provides connectivity to a variety of types of data sources (e.g., Amazon Redshift, PostgreSQL, Salesforce), making it easy to get data into Civis Platform and ready to process.
Additionally, Civis Platform provides Data Enhancements to standardize person datasets. These include jobs for standardizing addresses (CASS) and updating potentially outdated addresses (NCOA) as well as a custom Person Data Standardization job for detecting and correcting issues with formatting of phone numbers, email addresses, etc. Once the data is loaded into Civis Platform and preprocessed as necessary, we can begin the IDR process.
Fundamentally, Identity Resolution involves finding similar person records amongst a set of potentially millions of input records. To find groups of similar person records, we first find pairs of similar person records, and since the number of pairs can be extremely large (e.g., a trillion pairs for a million input records), it would be prohibitively costly in terms of time and compute resources to perform detailed comparisons of each possible pair of records.
Therefore, before computing precise match scores for pairs of records, we find candidate match pairs by looking for basic shared features (e.g., having the same name and ZIP code or having the same name and birthday). The academic literature on record linkage refers to this sort of process as “blocking” (see, e.g., Fellegi and Sunter, 1969). This helps avoid needing to compare a record for a Jane Doe who lives in Chicago, IL to a record for a John Smith who lives in Portland, Oregon to see whether they may be the same record.
Once we have a set of candidate match pairs, we use a statistical model that employs a more detailed and computationally intensive set of features in order to provide a match score, ranging from 0 to 1, representing the probability that the two records match. The set of features includes, in particular, the frequencies of names and various population statistics so that matching on a rare name in a sparsely populated location will lead to a higher score than matching on a very frequent name in a large metropolitan area. The features also take into account subtleties like the fact that a mismatch of a piece of information (e.g., phone number) should be treated differently than the case where one of the records is missing that piece of information.
At this point, we have identified pairwise matches between sources and within each source. We have likely candidate matches for a record, if any exist.
However, this list of pairs of matching records does not by itself provide a grouping of all the records corresponding to an individual (i.e., the Cluster Table). To find such a grouping, we first convert the list of pairwise matches into a graph representation, where each vertex in the graph represents a record, and each edge in the graph is a match, with the edges being weighted according to the pairwise match scores. We then use a graph clustering algorithm to find clusters of records, where each cluster corresponds to a distinct individual.
Once we have identified which input records correspond to distinct individuals, we can create the Golden Table, a table that has canonical PII values for each individual, based on the records that were clustered together for that individual. This can serve as the primary place for contact information for the people (e.g., customers, members) associated with the IDR user’s organization.
This table can be customized based on business needs and domain knowledge. If particular sources of input records are known to have the most reliable email address values, then those sources can be prioritized when selecting the email address for the Golden Table.
Alternatively, the IDR user can select PII for the Golden Table based on frequency. For example, using this approach, if there are five records grouped together in the above steps and three of those have the same phone number, then that phone number would be selected.
How do we know that we are constantly improving our ability to resolve IDs?
Evaluating the efficacy of matching can be difficult. In some cases, it can be very clear whether two records are or are not a match. For example, if two records have the exact same values for full name, email address, phone number, and other fields, they almost certainly represent the same person. In other cases, matches can be more difficult to evaluate. For example, consider the following pair of records:
First Name | Last Name | ZIP code | Email Address | Birth Year |
John | Smith | 60606 | jsmith@abc.com | 1980 |
John | Smith | 60606 | johns@gmail.com | null |
To determine whether these records are a match, one needs to consider information such as the frequency of the first and last names, the population of the ZIP code, and the chances of a field like email address being different in records for the same individual. It is difficult for people to determine whether such records are a match because they don’t have this sort of knowledge. Automated systems, however, can take frequency statistics and other information into account to provide an estimate of the chance that the records represent the same individual.
Customers often ask what their match rates will be — that is, what fraction of their input records will be matched. The overarching goal of identity resolution is in some sense to find matches, so the match rate is obviously important. However, a higher match rate is not always better: one can increase a match rate by including additional pairs of records that are unlikely to represent the same individuals (e.g., records that just match on first and last name).
In addition to match rates, one should consider the rate and cost of false positives. In the context of matching for IDR, a false positive is a pair of records that are marked as a match but actually represent different people. Ideally, one wants a high match rate and a low false positive rate. Unfortunately, unlike a match rate, one cannot easily compute a false positive rate because it requires distinguishing false positives from true positives (i.e., correct matches). In lieu of a precise computation of the false positive rate, it can be helpful to estimate the false positive rate by examining the PII for a sample of several dozen or more matches to check that most of the matches look reasonable (e.g., that records with wildly different PII values aren’t being matched together frequently).
Relatedly, it is important to consider the cost of false positives (i.e., matching records incorrectly) versus false negatives (i.e., missing a good match), which lead to low match rates. Depending on the downstream business use case, false positives may be more or less expensive than false negatives. For example, if matching two people incorrectly could lead to annoying a customer with irrelevant advertising emails, then false positives might be very expensive and one might want to set a high threshold for what constitutes a match. On the other hand, if it’s OK to potentially send a few emails based on bad matches, then one might want to use a lower threshold to reduce false negatives and increase the match rate.
As such, we expose a threshold parameter for controlling how strict the system should be when assigning grouping different records together under the same resolved ID. We also provide functionality for running experiments with different thresholds and for examining sample outputs so that users can find a threshold they are comfortable with given their tolerance for false positives vs. false negatives.
The Civis IDR system is built with the state-of-the-art Apache Spark™ distributed computing engine. IDR can process tens of millions of records in a few hours and can be scaled to hundreds of millions of records as needed.
This document described the entire Civis Identity Resolution process, which can find links and duplicates across multiple sources of person data. However, for cases where the user has only a single table of person data and wishes to match it to the Civis Data asset, which has detailed information about most adult Americans, we provide a separate tool called Civis Data Match.
Given a table of input data and a Match Target (in most cases, the version of Civis Data to match records to), this tool produces a table of matches. These matches consist of a source record ID, a matched target record ID, and a match score indicating the strength of the match, from 0 to 1.
Of the five steps described above for IDR, Civis Data Match employs steps 1-3 (preprocessing input data, finding candidates, and scoring the candidates) but not steps 4-5 (grouping records together via graph clustering and then producing the Cluster Table and Golden Table).
Civis Analytics helps leading public and private sector organizations to use data to gain a competitive advantage in how they identify, attract, and engage people. With a blend of proprietary data, technology, and advisory services, and an interdisciplinary team of data scientists, developers, and survey science experts, Civis helps organizations stop guessing and start using statistical proof to guide decisions. Learn more about Civis at www.civisanalytics.com.
1 In connection with this work, Civis uses the L-BFGS-B optimization package that is part of the SciPy package. Attribution: C. Zhu, R. H. Byrd and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization (1997), ACM Transactions on Mathematical Software, 23, 4, pp. 550 – 560.
If you’re excited about sharing more of the data that is driving your work with your stakeholders, we invite you to reach out to our team of experts. Let us guide you through the process, answer any questions you may have, and help you create and share your services.