A challenge for intersectional survey research is the “Small n Problem”: when there are too few observations in the sample to permit the desired analysis.
To perform the multivariate quantitative techniques popular in analyses of survey data, we need an adequate number of cases in each category.
For instance, if gender and class are necessary for our study, and we want a detailed class schema (i.e., more than two nominal categories), then we face the problem of having too few cases in a given gender*class category. With extant survey and administrative data , we generally have too few cases to allow rigorous intersectional analysis with the extant statistical techniques (see Hancock 2013 for QCA methods).
In this post, we examine solutions to the Small n Problem in intersectionality research.
Solutions to the Small n Problem
To solve the small n problem using cross-national surveys, we have six main options (each option has its own cross-national data and measurement comparability issues):
1) Limit the number of intersections and the content of intersections, i.e. create only those that have a sufficient number of cases;
2) When analyzing a concept with multiple categories, such as social class, meaningfully combine those categories, i.e. “pooling categories”;
3) Pool countries within one survey wave, i.e. “pooling countries”;
4) Pool the same country across multiple survey waves, i.e. “pooling time”;
5) Pool countries and time;
6) Harmonize different datasets of the same country, i.e. “pooling international survey projects.”
Pooling International Survey Projects, i.e. Survey Data Harmonization
We now go in-depth on ex post cross-national survey data harmonization (SDH) as a solution to the Small n Problem.
Up to now, the largest and most widely available cross-national survey data sources allow researchers to pool both countries and time (World Values Survey, European Social Survey, International Social Survey Programme, and the like).
What if you want to analyze one country within a comparable time period, and still need a large enough number of cases to create and analyze intersections? What if, in the construction of intersections, you do not want to be limited by intersectional categories, and do not want to be forced to pool categories within the concept?
What intersectional categories are of interest and how many cases are available?
One solution is to pool international survey projects. This requires the harmonization of multiple cross-national survey projects. The Survey Data Recycling project defines it as follows: Cross-national survey data harmonization combines surveys conducted in multiple countries and across many time periods into a single, coherent dataset. It is a generic term for procedures that aim to achieve, or at least improve, the comparability of surveys over time and of surveys from different countries (Granda and Blasczyk 2010; Granda, Wolf and Hadorn 2010).
Ex post survey data harmonization is an especially complex process, because it combines projects that were not specifically designed to be comparable. Though fraught with a daunting methodological complexity, ex post harmonization can tap the great wealth of cross-national surveys produced by the international social science community in ways that influence our substantive and methodological knowledge (Dubrow and Tomescu-Dubrow 2015).
In essence, we could analyze a single country within a reasonable time frame if we combine the international survey projects in which that country appears. For example, if we want to analyze Poland, we can analyze the European Social Survey in 2002, where n is approximately 1500, which is too small for fine-grained analysis of a class schema with greater than two or three categories. But, if we pool Poland ESS and WVS, for example, for the years 2000 – 2003, our n increases to approximately 3500. If we add more survey projects, we add more cases. Our modeling choices increase.
The Survey Data Recycling project, a partnership between (a) the Institute of Philosophy and Sociology of the Polish Academy of Sciences and (b) The Ohio State University, produced a large scale harmonized dataset that we discussed above.
Challenges of Pooling International Survey Projects
Survey Data Harmonization has a lot of promise and a lot of challenges.
It is a tremendous time, effort and money consuming endeavor and the outcome is a target data set with a very large n and a few harmonized variables. Methodological problems are even more daunting.
To identify all of the methodological challenges inherent in Survey Data Harmonization, it is best to start with the overarching methodological challenge in data comparability, and then recognize that there are numerous methodological problems and room for error at each step of the harmonization process (Tomescu-Dubrow and Slomczynski 2016).
In harmonization, this means moving from source variables – the original variables in the datasets of particular surveys – to target variables, i.e the harmonized, common variable produced from the source variables.
The challenge of cross-national Survey Data Harmonization is to produce meaningful data that accounts for all of the error produced in the data lifecycle (Granda and Blasczyk 2010). This lifecycle begins at the initial data source (e.g. each country involved in the international survey research project) to the harmonization decisions undertaken by the Survey Data Harmonization project (creation of the target variables), to data cleaning of the final master file (the harmonized data).
Thus, not only do Survey Data Harmonization projects inherit the errors of the initial data source, but they may create their own in the harmonization process (see Tomescu-Dubrow and Slomczynski 2016 on explicitly accounting for errors using the Survey Data Recycling approach).
The implications for the future of the analysis of intersectionality using existing cross-national surveys is clear: By pooling international survey projects can we create, for a single country or a group of comparable countries, a sufficient number of cases within comparable time period.
Big Data and Intersectionality
Pooling international survey projects creates a very large n. The idea and the resulting data set invite comparisons to the big data wave that is now popular in business and science, and has been aggressively reported on in the mass media. There is a recent journal published by Sage, Big Data & Society and many books are published about it (e.g. Mayer-Schonberger and Cukier 2013).
Survey Data Harmonization of a certain size becomes big data, but does not necessarily embrace the big data ethos embraced by big data enthusiasts, especially those working in the business sector. In cheerleading big data, Mayer-Schonberger and Cukier (2013) argue that we should not care about the errors in the data (in their words, “messiness”).
Since their data allegedly captures everyone, they assume the data are representative of the population (in their words, “n = all”). They also advocate for correlation over the search for causation.
Big data produces large enough numbers of cases for all sorts of analyses that are popular with quantitative social scientists. As Mayer-Schonberger and Cukier (2013: 189) put it in their discussion of exit polls,
“…exit polls on election night query a randomly selected group of several hundred people to predict the voting behavior of an entire state. For straightforward questions, this process works well. But it falls apart when we want to drill down into subgroups within the sample. What if a pollster wants to know which candidate single women under 30 are most likely to vote for? How about university-educated, single Asian American women under 30? Suddenly, the random sample is largely useless, since there may be only a couple of people with those characteristics in the sample, too few to make a meaningful assessment of how the entire subpopulation will vote.”
What they call “drill down into subgroups” is what advertising companies have done for a long time. Targeted advertising is built on finding “subgroups.” The difference between the pre-big data era and now, besides the ubiquitous invasion of privacy by social media and other internet companies, is a combination of very large numbers of cases with the technological and statistical sophistication that enables researchers to use statistical software such as Hadoop to derive correlations.
Knowing all this, there is no straight line from big data to “intersectionality.” Big data and “drilling down into subgroups” does not mean that we understand the identities of the people and groups, and it does not mean that we account for power structures. Without identity and power structures, we may not have intersectional analysis.
Joshua K. Dubrow is a PhD from The Ohio State University and a Professor of Sociology at the Polish Academy of Sciences.