September 30, 2021

Validating 2020 voters in Pew Research Center’s survey data

By Scott Keeter, Hannah Hartig and Ruth Igielnik

Knowing who voted is critical to developing an accurate understanding of the outcome of an election. But self-reports of voting tend to be somewhat unreliable. Fortunately, surveys asking about voting can be made more accurate by validating respondents’ self-reported turnout with official voting records.

Today, Pew Research Center is releasing an updated version of its 2020 post-election survey dataset that includes validated measures of turnout in the 2020, 2018 and 2016 U.S. general elections, along with a special weight for use with these variables. We validated turnout by attempting to locate an official turnout record for each of the members of the Center’s American Trends Panel (ATP) — our nationally representative survey panel of U.S. adults — in at least one of three commercial databases. These publicly available turnout records are compiled by the states and the District of Columbia as part of their routine administration of elections. Commercial vendors then make the information available to political parties, campaigns and researchers. In this post, we’ll discuss the measures and what you can do with the dataset in more detail.

This dataset is the basis for a report we issued on June 30 about the characteristics of the 2020 electorate, including those who voted and those who did not. The dataset is available as an SPSS statistics file (with the file extension .sav) and is accompanied by a ReadMe.txt file with information about the computation of the turnout variables. All major statistical software packages can read SPSS files, but at the end of this blog post we offer some suggestions for how you can use a free package to analyze the data.

As a reminder, Pew Research Center releases nearly all of its raw survey datasets to the public. It usually takes a period ranging from a few months to more than a year after collection to release a dataset. This delay allows the Center’s staff to fully analyze and report on the data, as well as to clean and anonymize the files to protect respondents from the risk of being personally identified. All data for release can be found on our website. Users are asked to register for an account, after which they can download and manage datasets as often as desired.

How validated voters are defined

To validate turnout among members of the ATP, we attempted to link panel members to a turnout record in at least one of three commercial voter files: one that serves conservative and Republican organizations and campaigns, one that serves progressive and Democratic organizations and campaigns, and one that is nonpartisan.

A member of the ATP is considered to be a voter for a given election if they told us they voted and were recorded as having voted in at least one of the three commercial voter files. Those who said they did not vote in a given election are considered nonvoters. Additionally, nonvoters include anyone — regardless of their self-reported vote — for whom a record of voting could not be found in any of the three commercial voter files. That includes respondents who were not matched in any of the three files. We assumed this last group were not registered voters and therefore had not voted.

(Note: Because of a law passed in 2018, Utah residents can opt to keep their voter registration and vote history data private. Therefore, we could not assume that the absence of a voting record meant that a Utah panelist is a nonvoter. Consequently, Utah residents in the American Trends Panel are considered to be voters if they reported having voted when asked in the post-election survey.)

Overall, we matched 97% of our voting-eligible panelists to at least one of these files and located a 2020 turnout record (or self-report in the case of Utah) for 9,668 panelists. Panelists who could not be matched or for whom no 2020 turnout record could be located were considered to be validated nonvoters (1,477 panelists).

For additional details about the voter file matching and voter verification process, see Pew Research Center’s report on the 2020 electorate and this report on how voter files are used to study U.S. politics.

How this dataset can be used

The dataset Pew Research Center is releasing today will allow users to replicate or extend portions of the analysis presented in our June report on the 2020 election. Specifically, it is intended to allow users to analyze the 2020 electorate and those who turned out in the 2020 general election. Users can also examine 2018 and 2016 turnout and vote choice among those who participated in the 2020 survey.

This dataset cannot be used to replicate the analysis presented in the Center’s reports on either the 2018 or 2016 elections, or the data from those reports that is included in the 2020 report’s detailed tables and discussed in the study. The 2018 and 2016 reports were based on post-election surveys conducted in the weeks after the two elections and voter validations conducted several months later. Those surveys did not include people who joined the panel in subsequent years. Consequently, they are not directly comparable to results from the 2020 post-election survey and its voter validation.

For panelists in the 2020 survey who were not in the panel in 2018 or 2016, their turnout in the earlier elections is coded based on the presence or absence of turnout records for the relevant elections in the three voter files obtained this year for the 2020 validation. Similarly, vote choice among these voters is based on retrospective measures asked in surveys taken after they joined the panel. For the 2020 panelists who participated in the 2018 or 2016 surveys, turnout is coded based on voter validations conducted following those elections, and their vote choice was measured in post-election surveys in those years.

The variables of interest

The dataset includes seven new variables:

A special weight to be used with any analysis of the validated vote

Measures of validated turnout for 2020, 2018 and 2016

Vote choice among panelists who voted in 2020, 2018 or 2016

Noncitizens (F_CITIZEN=2,99) are coded as missing for these measures of turnout and vote choice.

The special weight

The special weight is named WEIGHT_W78_VALIDATEDVOTE. The weight adjusts the sample on the large set of variables used in a typical wave of the ATP but includes additional parameters for turnout and vote choice in the 2020, 2018 and 2016 elections. This weight should be used in conjunction with any analysis involving the variables described below. For analysis that does not require identifying voters or nonvoters, WEIGHT_W78 should be used instead.

The methodological report for the 2020 study describes the weighting process in greater detail.

The turnout variables

Validated turnout variables for the three elections are as follows:

These are dichotomous variables coded “1” for validated vote and “0” otherwise.

When the weight is applied, voter turnout matches the national turnout among the voting eligible population as documented by the U.S. Elections Project, based on ballots counted for the highest office in the election. The share of adults who were eligible to vote in each election is based on the 2019 American Community Survey, the latest available at the time.

The vote choice variables

Vote choice variables for the three elections are as follows:

These variables are coded as “1” for the Republican candidate, “2” for the Democratic candidate and “3” for candidates of other parties.

When the weight is applied, candidate choice for each election matches vote shares for each party’s candidate(s), as documented by the Federal Election Commission.

Some tips for analyzing the data

The 2020 dataset described here is available for download as an SPSS file. Nearly any statistical program designed for the analysis of surveys can read the SPSS file. However, it’s important to note that spreadsheet software like Microsoft Excel or Google Sheets should not be used; these and other similar programs may not be able to read the SPSS file and do not have the capability to use the survey’s weights, which are critical for producing accurate estimates.

Statistical software packages like SPSS, SAS, Stata or R can tabulate the weighted data and reproduce the analyses found in our report. But users should ensure that their software package is properly accounting for the effect of weighting on the precision of the estimates. That is, estimates of the margin of error or the significance of differences between two groups in the sample will be incorrect unless the survey software has the ability to correctly account for the impact of the weighting on the variance of the estimates.

Fortunately, the open-source statistical package R is free and has the capability to correctly handle the weighting of survey data. Our colleagues at the Center have developed some special packages within R and have written guides to the use of R and these packages.

This blog post describes the basics of using R to read and analyze Pew Research Center data, including properly handling the weights to create an accurate margin of error and tests of significance: https://medium.com/pew-research-center-decoded/how-to-analyze-pew-research-center-survey-data-in-r-f326df360713

This post is an introduction to a special package written by the survey methodology team (named “pewmethods”) to simplify several tasks in working with survey data: https://medium.com/pew-research-center-decoded/introducing-pewmethods-an-r-package-for-working-with-survey-data-97601a250a46

This post provides a guide to the use of the pewmethods package: https://medium.com/pew-research-center-decoded/exploring-survey-data-with-the-pewmethods-r-package-198c4eb9d1af

And here is an explanation of how you can use the popular set of R packages known as the “tidyverse” to explore Pew Research Center’s survey data: https://medium.com/pew-research-center-decoded/using-tidyverse-tools-with-pew-research-center-survey-data-in-r-bdfe61de0909