A Guide to How Civis Identity Resolution Works

Author:

Mike Heilman | Principal Machine Learning Engineer, Civis Analytics

Today’s modern fundraising, outreach, and advocacy programs require reaching your target audience across many different platforms. You have to meet them where they are, and they are all over the place. They can be on the internet, answering a door knock, or walking into your event. Organizations collect data on these individuals in each location, but this creates disparate data sets containing the same individuals. You want to tailor your email campaign to make asks based on events someone has attended, and text based on their donation history. You need a Golden identity list, with a 360 degree view of your audience and members.

Civis Platform steps up to this challenge with its Identity Resolution IDR offering, a comprehensive solution to unify and deduplicate person level data across multiple sources. In this blog, we expose some of the science and data engineering that goes into this type of solution. We discuss:

The IDR process

Evaluating matching results

Scalability

At its core, Civis Identity Resolution takes one or more data sources about people (e.g., members, customers, donors) as input and finds matches among the person records they contain. It uses these matches to identify duplicates within and across sources and thereby determine which sets of records correspond to the same individuals. Note that by “duplicates”, we mean a set of records that refer to the same individual and should be linked together. Such records may come from different sources and have different information (e.g., one may include name and phone number while another has name and email address).

The user of Civis IDR can then confidently perform analyses about their business and interact with their people in a unified way, free from problems that might arise from duplication, having disconnected person data in multiple places, etc.

Table of Contents

Why Identity Resolution Matters

What Civis Identity Resolution Provides

Product Philosophy

The Identity Resolution Proc ess

Loading and Preprocessing Input Data

Finding Candidate Matches

Ranking Candidates

Identifying Groups of Strongly Matching Records

Creating The Golden Table

Appendix: Civis Data Match

About Civis Analytics

Why Identity Resolution Matters

Let’s clarify why one might need an identity resolution system. Here are some of the many challenges in deriving value from person data:

The data may be in several tables in different databases.

The data may be duplicated within and across data sources.

The accuracy of the data may vary by data source.

The data may be stale (e.g., people move, and they change names and phone numbers before re-engaging).

The data may contain multiple different points of contact for the same individual (e.g., one database may have names, email addresses, and ZIP codes, while another may have names and full addresses).

What Civis Identity Resolution Provides

The Civis IDR system intends to address issues such as those described above. From one or more sources of person data, Civis IDR produces two primary outputs:

The Cluster Table: a table that provides, for each record in the input data, a unique “resolved ID” identifier corresponding to an individual. For example, if there are 10 source records that correspond to the same individual, then this output table will contain 10 records with the same resolved ID value.

The Golden Table: a table with one record per distinct individual found by the IDR system. Each record will contain the “best” information about that individual (name, street address, email address) based on the source records that IDR linked together. The IDR user can configure how this information is selected (e.g., by indicating that one of their data sources should be preferred over others when selecting email addresses).

Product Philosophy

Our approach to Identity Resolution has been refined over multiple years of use and dozens of clients in multiple industries. We have focused on the following themes.

Accuracy & Quality: We have released numerous improvements to our IDR process over the years. These improvements are all rooted in our experience with customers. As we identify corner cases, customer suggestions, and new use cases, we incorporate these learnings into our algorithm for everyone to take advantage of.

Trust & Transparency: This document is emblematic of our approach. We are proud of our work, and we want you to understand it. In turn, we want to hear from you. Do you have an idea we should incorporate? Let us know, and we’ll work on it. Good data is foundational to the work you do. You need to trust that foundation. If you have a question, please reach out to us.

Speed & Scalability: We regularly run our algorithm on hundreds of millions of records, and we have engineered our product to scale with your needs and business processes.

Usefulness: A product is only as good as how well it fits into your existing processes. We strive to meet customers where they are – whether it’s by adding additional API integrations for importing data, understanding how you’d like to export results, or providing professional services to help integrate IDR into your business processes. We are here to help inform the decisions you need to make.

With this in mind, we can dive a bit deeper into how Civis IDR works.

The Identity Resolution Process

At a high-level, the IDR system performs multiple steps, which are listed here, illustrated in Figures 1-6, and described in more detail in subsequent sections:

Loading and preprocessing input data.

Finding candidate matches between pairs of records.

Scoring and filtering the candidate pairwise matches.

Identifying groups of strongly matching records that correspond to an individual.

Outputting the Cluster Table and Golden Table.

Loading and Preprocessing Input Data

In practice, the first step of using IDR is to get data and preprocess it into standard formats. For this, we leverage the Imports and Data Enhancements functions of Civis Platform. The Platform provides connectivity to a variety of types of data sources (e.g., Amazon Redshift, PostgreSQL, Salesforce), making it easy to get data into Civis Platform and ready to process.

Additionally, Civis Platform provides Data Enhancements to standardize person datasets. These include jobs for standardizing addresses (CASS) and updating potentially outdated addresses (NCOA) as well as a custom Person Data Standardization job for detecting and correcting issues with formatting of phone numbers, email addresses, etc. Once the data is loaded into Civis Platform and preprocessed as necessary, we can begin the IDR process.

Finding Candidate Matches

Fundamentally, Identity Resolution involves finding similar person records amongst a set of potentially millions of input records. To find groups of similar person records, we first find pairs of similar person records, and since the number of pairs can be extremely large (e.g., a trillion pairs for a million input records), it would be prohibitively costly in terms of time and compute resources to perform detailed comparisons of each possible pair of records.

Therefore, before computing precise match scores for pairs of records, we find candidate match pairs by looking for basic shared features (e.g., having the same name and ZIP code or having the same name and birthday). The academic literature on record linkage refers to this sort of process as “blocking” (see, e.g., Fellegi and Sunter, 1969). This helps avoid needing to compare a record for a Jane Doe who lives in Chicago, IL to a record for a John Smith who lives in Portland, Oregon to see whether they may be the same record.

Ranking Candidates

Once we have a set of candidate match pairs, we use a statistical model that employs a more detailed and computationally intensive set of features in order to provide a match score, ranging from 0 to 1, representing the probability that the two records match. The set of features includes, in particular, the frequencies of names and various population statistics so that matching on a rare name in a sparsely populated location will lead to a higher score than matching on a very frequent name in a large metropolitan area. The features also take into account subtleties like the fact that a mismatch of a piece of information (e.g., phone number) should be treated differently than the case where one of the records is missing that piece of information.

Identifying Groups of Strongly Matching Records

At this point, we have identified pairwise matches between sources and within each source. We have likely candidate matches for a record, if any exist.

However, this list of pairs of matching records does not by itself provide a grouping of all the records corresponding to an individual (i.e., the Cluster Table). To find such a grouping, we first convert the list of pairwise matches into a graph representation, where each vertex in the graph represents a record, and each edge in the graph is a match, with the edges being weighted according to the pairwise match scores. We then use a graph clustering algorithm to find clusters of records, where each cluster corresponds to a distinct individual.

Creating The Golden Table

Once we have identified which input records correspond to distinct individuals, we can create the Golden Table, a table that has canonical PII values for each individual, based on the records that were clustered together for that individual. This can serve as the primary place for contact information for the people (e.g., customers, members) associated with the IDR user’s organization.

This table can be customized based on business needs and domain knowledge. If particular sources of input records are known to have the most reliable email address values, then those sources can be prioritized when selecting the email address for the Golden Table.

Alternatively, the IDR user can select PII for the Golden Table based on frequency. For example, using this approach, if there are five records grouped together in the above steps and three of those have the same phone number, then that phone number would be selected.

Evaluating Matching

How do we know that we are constantly improving our ability to resolve IDs?

Evaluating the efficacy of matching can be difficult. In some cases, it can be very clear whether two records are or are not a match. For example, if two records have the exact same values for full name, email address, phone number, and other fields, they almost certainly represent the same person. In other cases, matches can be more difficult to evaluate. For example, consider the following pair of records:

First Name	Last Name	ZIP code	Email Address	Birth Year
John	Smith	60606	jsmith@abc.com	1980
John	Smith	60606	johns@gmail.com	null

To determine whether these records are a match, one needs to consider information such as the frequency of the first and last names, the population of the ZIP code, and the chances of a field like email address being different in records for the same individual. It is difficult for people to determine whether such records are a match because they don’t have this sort of knowledge. Automated systems, however, can take frequency statistics and other information into account to provide an estimate of the chance that the records represent the same individual.

Beyond Match Rates

Customers often ask what their match rates will be — that is, what fraction of their input records will be matched. The overarching goal of identity resolution is in some sense to find matches, so the match rate is obviously important. However, a higher match rate is not always better: one can increase a match rate by including additional pairs of records that are unlikely to represent the same individuals (e.g., records that just match on first and last name).

In addition to match rates, one should consider the rate and cost of false positives. In the context of matching for IDR, a false positive is a pair of records that are marked as a match but actually represent different people. Ideally, one wants a high match rate and a low false positive rate. Unfortunately, unlike a match rate, one cannot easily compute a false positive rate because it requires distinguishing false positives from true positives (i.e., correct matches). In lieu of a precise computation of the false positive rate, it can be helpful to estimate the false positive rate by examining the PII for a sample of several dozen or more matches to check that most of the matches look reasonable (e.g., that records with wildly different PII values aren’t being matched together frequently).

Relatedly, it is important to consider the cost of false positives (i.e., matching records incorrectly) versus false negatives (i.e., missing a good match), which lead to low match rates. Depending on the downstream business use case, false positives may be more or less expensive than false negatives. For example, if matching two people incorrectly could lead to annoying a customer with irrelevant advertising emails, then false positives might be very expensive and one might want to set a high threshold for what constitutes a match. On the other hand, if it’s OK to potentially send a few emails based on bad matches, then one might want to use a lower threshold to reduce false negatives and increase the match rate.

As such, we expose a threshold parameter for controlling how strict the system should be when assigning grouping different records together under the same resolved ID. We also provide functionality for running experiments with different thresholds and for examining sample outputs so that users can find a threshold they are comfortable with given their tolerance for false positives vs. false negatives.

Scalability

The Civis IDR system is built with the state-of-the-art Apache Spark™ distributed computing engine. IDR can process tens of millions of records in a few hours and can be scaled to hundreds of millions of records as needed.

Takeaways

Civis Identity Resolution enables users to group together person records within and across multiple data sources to provide a comprehensive and deduplicated view of the individuals they care about.

To determine whether different person records correspond to the same individual, Civis Identity Resolution uses a precise statistical model that has been adjusted based on feedback from dozens of customers in multiple industries over a period of years.

Civis Identity Resolution uses scalable distributed computing tools to deliver results efficiently and reliably for datasets of tens of millions of records or more.

Appendix: Civis Data Match

This document described the entire Civis Identity Resolution process, which can find links and duplicates across multiple sources of person data. However, for cases where the user has only a single table of person data and wishes to match it to the Civis Data asset, which has detailed information about most adult Americans, we provide a separate tool called Civis Data Match.

Given a table of input data and a Match Target (in most cases, the version of Civis Data to match records to), this tool produces a table of matches. These matches consist of a source record ID, a matched target record ID, and a match score indicating the strength of the match, from 0 to 1.

Of the five steps described above for IDR, Civis Data Match employs steps 1-3 (preprocessing input data, finding candidates, and scoring the candidates) but not steps 4-5 (grouping records together via graph clustering and then producing the Cluster Table and Golden Table).

About Civis Analytics

Civis Analytics helps leading public and private sector organizations to use data to gain a competitive advantage in how they identify, attract, and engage people. With a blend of proprietary data, technology, and advisory services, and an interdisciplinary team of data scientists, developers, and survey science experts, Civis helps organizations stop guessing and start using statistical proof to guide decisions. Learn more about Civis at www.civisanalytics.com.

¹ In connection with this work, Civis uses the L-BFGS-B optimization package that is part of the SciPy package. Attribution: C. Zhu, R. H. Byrd and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization (1997), ACM Transactions on Mathematical Software, 23, 4, pp. 550 – 560.

A person from behind works on a computer with data displayed in chart form

Get Started Today

If you’re excited about sharing more of the data that is driving your work with your stakeholders, we invite you to reach out to our team of experts. Let us guide you through the process, answer any questions you may have, and help you create and share your services.

Let’s Get to Work

Additional Resource Center Content to Explore

How Inflation Will Impact Nonprofit Donations During the 2022 Holiday Giving Season

With inflation at its highest rate in decades, we polled Americans from all walks of life to explore how economic pressures will impact nonprofit giving during the 2022 holiday season.

Research Report

Empowering the Public Sector with Data Science

Download the Whitepaper

Research Report

From Science to Production: Unleash your Jupyter Notebooks

Data scientists are explorers. They use Jupyter Notebooks, one of the most popular environments for data science analysis, to begin work toward creative solutions to big problems. But once those…

Blog Article

Creative Focus: Reach the right people with the right message in any industry

Today, we’re excited to announce General Availability of Civis Creative Focus, our online message testing tool. Now, users in any industry can build better messaging, create more focused advertisements, and generate positive…

Blog Article