The Rise of Big Data and Computational Social Science in Sociology

In this post, we examine the rise of big data and computational social science in sociology.

Many sociologists complain about quantification

Academics (on Twitter and elsewhere) often loudly complain about metrics — the quantification of their production, including impact factor, citations, and “points” that universities and other academic institutions, as well as government institutions, devise to calculate the worth of the academic in a given year.

Why do academics complain? They complain because metrics inform the policy that administrators use to reward, punish, or otherwise alter the work life of academics.

Who is responsible? Academics have themselves to blame; they have been at the forefront of quantification, but business and other fields and endeavors have also contributed.

Kruger (2020) provides a history of bibliographic databases. In short, it started in the 1960s, innocently enough, as a way for scientists to be more aware of each others’ research. They entered the data by hand. By the 2000s, the process automatized. By the mid-2010s, million-dollar corporations had produced so much bibliographic data that they have to market even its metadata to find a profitable use for it all.

Far from its original purpose to build awareness of scientific research, bibliographic big data became a way to control science itself.

Is quantification leading sociology to meritocracy? (answer: no)

Counting is supposedly meritocratic and logical, installing order in a seemingly chaotic and complex world. Yet, algorithms and other counting schemes are human-made and thus subject to human error, driven by social biases of many kinds (Kotliar 2020: 920).

“…while recent research has shown that algorithms stem from specific socio-cultural contexts and that data tends to mirror the social surroundings from which it was extracted, the geographical and cultural distances between those who develop profiling algorithms and those who are being profiled remains overlooked.”

Algorithms are ways to crawl over the data on we, the humans, and our interactions with machines and other humans. Increasingly, algorithms, designed by humans and employed in human-programmed machines, help shape our social interactions (Fourcade and Johns 2020; Kotliar 2020; Edelmann et al 2020).

Data has become part of the power structure

Data — their collection and analysis and interpretation — have become part of the power structure.

The powerful: they use data to create and justify decisions of all kinds. Digital technologies aide them.

Digital technologies are tools for the storing and sharing of information and since the 1950s, these technologies are of two main parts: computers (software and hardware) are the storage bin and the internet is the sharer-in-chief. To understand the interaction chains that bind us, there are three possible: human-to-human, human-to-computer, and computer-to-computer. Only human-to-human is (potentially) without computer intrusion.

As a result, sociologists, by cheerleading quantification (and big data and computational social science), are helping to cement data’s place in the power structure.

Digital technology is efficient

There are clear benefits to digital technology, of course.

Digital technology can allow humans to talk more efficiently to other humans, or computers to talk to one another. They have enabled globalization by being the most efficient way to store and share information; it moves money and makes people money and it transfers knowledge and culture. (nb Tweets are worthless, in and of themselves; but Tweets from the right persona can cause small-scale economic and political havoc).

The ubiquitous and portable availability of digital hardware make strong the bonds between human-to-computer, and computer-to-computer interaction, increasingly at the expense of human-to-human interaction. Spying with computers – that humans are unaware that a computer intervenes into the relationship – robs humans of genuine human-to-human interaction.

… but digital life can be unnatural

The unnatural environment of living in cyberspace, a reality brought into sharp relief by the Covid 19 pandemic (Fourcade and Johns 2020; Milan 2020), gave rise to digital technologies that brought about endless information and endless segmentation. Our desire for information manifested in the internet and is rooted in our desire to reduce choice complexity.

Humans prefer simple over complex: when faced with a too-large array, we aggregate and categorize; we segment. When faced with new information, humans look for how the new segment fits into old segments. Then, humans look for ways to house this information. Humans created computers to house the information to help analyze it.

When technologists feel that humans are not adequate to the task, they turn to machine learning to do more analysis without a direct human touch (Fourcade and Johns 2020: 804):

“machine learning refers to the practice of automating the discovery of rules and patterns from data, however dispersed and heterogeneous it may be, and drawing inferences from those patterns, without explicit programming.”

Machine learning assists in Giddens’ structuration: because machines are programmed by people, it reinforces the social beliefs of their programmers (Fourcade and Johns 2020). To what end, we do not yet know.

We do know that the data fed into those machines can be small, or they can be big data.

Big Data: What are the sources of big data?

Big Data are both a source of complexity and a way to organize information. In the literature, big data has many definitions, but most seem to agree that it is an unusually large dataset drawn from diverse sources of information.

Lazer and Radford (2017) define big data:

“The term big data thus refers to data that are so large (volume), complex (variety), and/or variable (velocity) that the tools required to understand them must first be invented…”

The Big Data sets can be millions of cases long. Quantitative social science had thought of data as cells within rows and columns. Today, big data can be understood as pictures, videos, words, and numbers (Lazer and Radford 2017).

Photo by Sigmund on Unsplash

We can posit three main types of big data sources (from Lazer and Radford 2017)

Digital life

These are behaviors and other expressions conducted via computer, or captured or broadcasted via the internet. Social media and Wikipedia are the exemplars.

Digital traces

This is what Lazer and Radford (2017) call the “the archival exhaust of the modern bureaucratic organization.” Sometimes called metadata, it is the information on actions of organizations that are applied to various entities, including people. They are records of action, but are not the action itself (the cell phone tower and caller name, time, and date, but not the text/audio/video of the phone call).

Digitalized life

These are records of everyday life of people and other nondigital entities in their nondigital activities, such as Google Book scans or Bluetooth connections. These data can be combined and selected into other big data databases. This will increasingly be a common practice, Lazer and Radford (2017) argue.

What is Computational Social Science?

Big data requires computational social science just to handle it.

Edelmann et al (2020) argue that to harness these data, computational social science (CSS) is required. CSS does not require big data, but it is necessary for big data analysis. CSS originally meant simulations, e.g. agent based modeling. Scholars started to append CSS to big data analysis, such that any big data analysis is CSS. 

Edelmann et al (2020) define CSS:

“Computational social science is an interdisciplinary field that advances theories of human behavior by applying computational techniques to large datasets from social media sites, the Internet, or other digitized archives such as administrative records.” (62)

The uses of big data are many.

Big data can capture actual human behavior, rather than the self-report of that behavior. Such data include “nowcasting,” which seeks to depict real-time events, e.g. Covid 19 cases and deaths. Big data is used to understand mass human behavior, such as cultural connections, migration and mobility, diurnal patterns, and so on. They can be natural and field experiments, featuring how people react to “rule changes” made to digital software by Silicon Valley corporations, or to natural events. They can solve the small n problem: an n can be too small for statistical analysis in other data; but when combined with similar data or come from big data sources, the “small n” transforms to a “large n.”

The problems with using Big Data in sociological research

Big data also has many problems. Here are a few of them.

1. Representation and Generalizability

A major methodological problem is about representation and generalizability. These data may be very large but heavily biased. The much derided “convenience samples” have turned into “convenience censuses” (Lazer and Radford 2017). Big data datasets also tend to be of one type, e.g. Tweets or cell phone data, but not both. How to merge them? This is a problem that computational social science may solve.

2. Opaqueness where transparency should be

Changes in data may be due to software glitches or some other artifact other than actual human expression.

“For example, in the Google Ngram project, the word ‘fuck’ is used with startling frequency in books published through 1800, and drops to near zero during the 1800s. Upon closer inspection, it is clear that this did not reflect some dramatic shift in social mores, but rather is an artifact of contemporary optical character recognition systematically misinterpreting an archaic version of “s” as an “f.” Lazer and Radford (2017: 30).

3. Influence of outside actors

The data can also reflect behaviors manipulated by outside actors, such that we cannot treat human expression in big data as natural; rather, it could have been Astroturfed by the tech company, governments, or a collection of users.

4. Robots and Catfishing

Bots post a threat: they crawl the web, too, and leave digital traces. What may be thought of as human expression may actually be “bot expression.”

Humans may not be who they say they are: they may manipulate their own profile to give a false impression (catfishing).

Ethics: A big data problem

There are ethical problems in coordinating data — an unassembled dataset or panel — to create revealing profiles of real people who have not given their consent to this uber-collection. And, relatedly, there is the unauthorized but apparently legal use of publicly available data. For example, the Cambridge Analytica scandal was about using publicly available data in partnership with private organizations (such as Facebook) to create intensely revealing profiles of individuals.

GDPR, and the EU and the “right to forget” were attempts to reduce one’s unintentional digital profile. Monetization of user’s private data is used to create big data. Who owns your data that goes into big data? Corporations and the government whose leaders change over time, but your data stay in their servers for the next leader to use.

Hasselbalch (2019), writing of big data ethics, referred to the “Big Data Society,” defined as the “distribution of societal powers in the socio-technical systems.” These 

“data systems are increasingly ingrained in society in multiple forms (from apps to robotics) and have limitless and wide-ranging ethical implications (from price differentiation to social scoring), necessitating that we look beyond design and computer technology as such.” (p. 5)

To understand the structures of power that help or imperil us, it is not enough to look at human-machine interactions. We need to look at human-human interactions. Hasselbalch (2019) argues that data ethics of power is part of the larger power structure. He argues that we should create a “human-centric” distribution of power such that ethics are at the center of data collection, use, and distribution.

The Cheerleaders of Big Data

Despite all of these problems, cheerleaders of big data see big changes imminent. Mayer-Schönberger and Cukier (2014) write that the social sciences will change as a result of big data:

“One of the areas that is being most dramatically shaken up by n = all is the social sciences. They have lost their monopoly on making sense of empirical social data, as big- data analysis replaces the highly skilled survey specialists of the past. The social science disciplines largely relied on sampling studies and questionnaires. But when the data is collected passively while people do what they normally do anyway, the old biases associated with sampling and questionnaires disappear. We can now collect information that we couldn’t before, be it relationships revealed via mobile phone calls or sentiments unveiled through tweets. More important, the need to sample disappears.”

Or, not. A recent article by Sturgis and Luff (2020) find that big data has not overtaken the survey, at least as of 2015.

“We assess the case for a decline in the use of survey data in the social sciences during a period in which conventional survey research has faced existential challenges to its ongoing feasibility and growing competition from new forms of ‘Big Data’. …” Noting studies of this kind since the 1980s, they “update these studies to include the period 2014 to 2015.

While our analysis reveals the emergence of a small proportion of articles using Big Data, we find no evidence of a concomitant decline in the use of survey data.

On the contrary, the use of surveys increased, being used in nearly half of all published articles in this set of journals in 2014/15 and, where articles reported using Big Data, many of them also used survey data.”

The Future of Computational Social Science

Perhaps social science will change but not be replaced. Edelmann et al (2020) argue that the future of computational social science is not in replacing the social sciences, but in building theory.

“In our view, the most influential work within computational social science in the coming years will be the type that is able to link macro levels of theories about topics such as cultural change to microlevel processes of decision making.” (74)

Conclusion: Coupling Social Science with Computational Social Science

The danger of degrading the links between big data, CSS, and the social sciences is clear and present.

Big data comes without direct input of social scientists. Big data are “found” data that social scientists can manipulate and use for their purposes. Edelmann et al.’s (2020) main concern is that CSS is often not rooted in sociological theory: “Indeed, the majority of theorizing in the broader field of computational social science outside of sociology either is inattentive to sociological theory or focuses on a handful of influential ideas” (74). Edelmann et al want more theory on how computers and software mediate human communication and thus how machines reshape human thoughts and behaviors (see, for example, Fourcade and Johns 2020). Whereas Edelmann et al (2020) argue that sociologists can assist in these problems by positing theories of the interaction between humans and machines, Mayer-Schönberger and Cukier (2014) argument that a whole barrel-full of correlations will show how humans behave.

There is a rapidly growing, and perhaps uncontrollable, situation in which machines handle human affairs (Lazer and Radford 2017; Edelmann et al 2020).

The data that we produce depends on these machines. Humans asked for these data to be collected, and let the machines do the work.

But how humans program these machines and use and interpret them will remain, hopefully, in control of humans.

Readings

Edelmann, Achim, Tom Wolff, Danielle Montagne, and Christopher A. Bail. “Computational Social Science and Sociology.” Annual Review of Sociology 46 (2020).

Fourcade, Marion, and Fleur Johns. “Loops, ladders and links: the recursivity of social and machine learning.” Theory and Society 49, no. 5 (2020): 803-832.

Hasselbalch, Gry. “Making sense of data ethics. The powers behind the data ethics debate in European policymaking.” Internet Policy Review 8, no. 2 (2019): 1-19.

Jenkins, J. Craig, Kazimierz M. Slomczynski, and Joshua Kjerulf Dubrow. 2016. “Political Behavior and Big Data.” International Journal of Sociology 1-7.

Kotliar, Dan M. “Data orientalism: on the algorithmic construction of the non-Western other.” Theory and Society 49, no. 5 (2020): 919-939.

Krüger, Anne K. “Quantification 2.0? Bibliometric infrastructures in academic evaluation.” Politics and Governance 8, no. 2 (2020): 58-67.

Lazer, David, and Jason Radford. “Data ex machina: Introduction to Big Data.” Annual Review of Sociology 43 (2017): 19-39.

Mayer-Schönberger, Viktorm and Kenneth Cukier. 2014. Big Data: A Revolution That Will Transform How We Live, Work, and Think. New York: Mariner Books.

Sturgis, Patrick, and Rebekah Luff. “The demise of the survey? A research note on trends in the use of survey data in the social sciences, 1939 to 2015.” International Journal of Social Research Methodology (2020).

Websites

Notes

Big bibliographics is big business. Kruger (2020: 64):

“When, in 1992, Thomson Reuters bought Web of Science from the Institute of Scientific Information, they paid $210 million (see Jayapradeep & Jose, 2017). When they resold the product to Clarivate Analytics in 2016, they received $3.55 billion for it (see Thomson Reuters, 2016).”

Copyright Joshua Dubrow, The Sociology Place 2022

Joshua K. Dubrow is a PhD from The Ohio State University and a Professor of Sociology at the Polish Academy of Sciences.

Leave a Reply