Quantitative Research in the Social Sciences and ChatGPT

We covered a basic approach in How to use ChatGPT in Social Science Research; in this post, we focus on specific issues of using ChatGPT for quantitative research.

Caveat: This post reflects what I’ve learned until March 2024. Remember that the AI tools develop quickly. Some of the problems or functionalities I describe below may be obsolete within a year. At this stage, there are few clear rules for any of this. As such, the best advice is to develop a way of thinking about AI.

  1. Basic Principles
  2. Uses of ChatGPT
    1. Analysis of data
    2. Coding
  3. Priming
  4. Writing a quantitative paper in the social sciences using ChatGPT
    1. Caveat
    2. Start with a topic
    3. Theory and literature review
    4. Data
    5. Methods
      1. Variables
    6. Coding: Iteration between ChatGPT and R
      1. Examples of prompts into ChatGPT to produce R code
  5. Summary

Basic Principles

You are the director and you have assistants. Those assistants are AI.

AI assistants are the present and future of social science research. It is best to think now about how to use them.

Learning AI is a skill, like learning methodology. It is good to learn a particular methodological type, but it’s better to think about the basics of methods and to choose the tools that are best for a given problem. The same principle applies to AI. It’s good to learn software, but it is better to think about how to use a range of software products to meet your needs. To do that requires considering how to think about AI, and it requires experimentation.

Its always best to be specific. The more specific you are, the better the result, generally.

ChatGPT etc. require human supervision and human expertise, especially for quantitative research in the social sciences. You cannot “trust” ChatGPT to do everything right. Like a good Director, you have to monitor what your assistants are doing, and give reasonably clear instructions. Unlike humans (or, at least, most humans), ChatGPT has great difficulty interpreting what you want. The AI is creative, but within boundaries, and is likely to make errors in interpretation. At least it will try.

Uses of ChatGPT

With ChatGPT, you can analyze data with Data Analyst and you can code data for R or other statistical software programs.

Analysis of data

With Data Analyst, ChatGPT can read .CSV files of a dataset, even one as large as the European Social Survey. With ChatGPT 3.5, you can input data only in the form of a .CSV file and ChatGPT can read it and provide a table or a list of the data. You can, in the same prompt, ask it to analyze the data. As of now, Data Analyst can only read a few file types. You can easily convert a STATA or SPSS file into .CSV for the Data Analyst tool.

NB: You can clean simple datasets in Data Analyst and it will provide you with a cleaned dataset to download. The functionalities of this are promising, but early, and thus not quite to be trusted.

Coding

One of the most powerful uses of ChatGPT is to assist in coding. It knows a variety of coding languages, including R, but also Python, etc. It may provide wrong syntax, but you can put the error messages into ChatGPT and ask it to fix the code. It will do so until the code provides what you intended. I discuss this in detail below.

Priming

Priming is when you provide the AI information on which to base its response. Priming is necessary to get the best use of the AI. For example, “I am an academic researcher…” or “I am a high school student writing a term paper.”

Priming works best with prior knowledge of what you are specifically looking for. At this point, AI is not a mind reader. You can provide definitions of concepts, or specifics of the theory… It depends on what you want to write. The distinction between priming and being specific is blurry. It can all be done in the same prompt, and thus priming = specificity.

Note that ChatGPT is not consistent with its replies. Sometimes it ignores commands you’ve given it, sometimes it adheres to them. But, generally, it gets close.

Writing a quantitative paper in the social sciences using ChatGPT

Caveat

We can create the foundations for a paper– It will look like a paper, but it is not a finished paper because it would need a lot of editing. Instead of thinking of what you can produce with ChatGPT as a complete paper, think of it as the foundations of one, where many of the mundane tasks are quickly accomplished. Any paper conceived and executed in this way requires expertise to turn it into a finished paper.

Start with a topic

It is best, at this stage, to have an idea as to what concepts you want to study.

Theory and literature review

I can ask the AI for definitions of terms, and for papers that examine this relationship.

Example of a Prompt: “Please tell me about the relationship between Future Orientation and Political Participation. Please define each term and suggest theories as to their relationship.”

Then, within ChatGPT I ask ScholarAI or some such for papers on this relationship.

Prompt: “Please tell me about major research works about the relationship between Future Orientation and Political Participation.”

The AI assistants point me towards some relevant articles. At this stage, AI bibliographic search tools such as Elicit, Consensus, Connected Papers, and ScholarAI or ScholarGPT are an emerging form of research assistant, but need a lot more development to be consistently useful.

Data

You can ask the AI for suggestions on what data to use to test the hypotheses.

Prompt: “What cross-national survey data would be best to test these hypotheses?”

I know I want to use the European Social Survey. The AI can lead you to the website to download the data, but as of now, it cannot download it for you. (soon, it will)

Methods

You can ask the AI…

  • whether the variable is a good fit for the concept (use the codebook)
  • to code the variable
  • which statistical method for analysis would be best, given the parameters of the data and the variables
  • for the code to analyze the data.

As of now, the Data Analyst function is limited. It can do very basic things when the data are well defined. So, we will skip it, and use an iteration between R and ChatGPT.

Variables

Let’s ask ChatGPT which variables are best. It can’t read the data in Data Analyst well enough to do the job, and it misses details when you upload a codebook. When I fed it an ESS codebook and asked for variables that deal with “future orientation,” it missed variables that I know are in there. At this stage, you still need to look through the questionnaire. Priming may solve the problem, but to prime, you need prior knowledge. So, in any case, it would still require your own research.

But, we can ask its opinion, because it is an assistant that you can talk to.

Prompt: “What about this variable that appears in ESS: Do you generally plan for your future or do you just take each day as it comes? Please express your opinion on a scale of 0 to 10, where 0 means ‘I plan for my future as much as possible’ and 10 means ‘I just take each day as it comes’.”

Coding: Iteration between ChatGPT and R

I’ve been working with this off and on since it came out, and this is a summary of what I learned thus far:

Be specific

  • Prompts must be very specific about variables and analyses.
  • Best practice is to mention the variable explicitly in each prompt

Effective use of ChatGPT for R requires some R knowledge, though expert coding skills are not necessary.

  • Setting the working directory and specifying the dataset in prompts are crucial for accurate syntax generation.
  • In R, there is a multitude of ways to accomplish tasks, and thus it will provide many different ways to code or analyze data 

ChatGPT does error correction and revision, but sometimes you need to clear the chat

  • After R gives you an error code, you can cut and paste it back into ChatGPT and ask ChatGPT to correct the code.
  • This is a chat, and thus an Iterative process; however, sometimes starting new chats can be more efficient than correcting compounded errors from previous interactions. ChatGPT can build up some bad habits and produce systemic errors based on its chat history.

ChatGPT 4 is superior

  • ChatGPT 4 is much, much, much, much better than ChatGPT 3.5 in code writing, problem-solving, and helpful suggestion abilities.

Limitations in Automated Code Generation

  • Ask it to provide the entire syntax, including recodes, for accurate syntax.
  • Code generation is capped at a certain character count, requiring manual continuation and assembly of code by the user.

Keep good notes!

  • Keep a record of definitions of the terms, of how they appear in the codebook, of the ChatGPT prompts, and R syntax. 
  • Ask ChatGPT to summarize your notes
  • As usual, annotate the syntax. ChatGPT does this automatically.

Examples of prompts into ChatGPT to produce R code

Prompt: “I need syntax for a research project. The syntax is in R. The working directory is “fill in your path here”. The dataset is found through this path: “fill in your path here”. All of these variables should be saved in the same dataset.

“Please recode “agea” such that all values less than 20 and greater than 80 are treated as missing cases. Call the new variable, “agea_20to80”. After producing this, please provide a histogram of this variable.”

“Please recode “pbldmn” such that 1 =1, 2 = 0, and values 7, 8, and 9 = 0. Call the new variable, “pbldmn_recoded”. After producing this, please provide a summary and a histogram of this variable.”

The ChatGPT prompts then become more sophisticated. For example:

“Please provide a correlation matrix that includes the p values for the following variables: LIST VARIABLES HERE. Please write the code so that R can send this correlation matrix that includes the p values to an Excel file to my computer.”

I wrote this prompt after much trial and error. I had also asked it, “given the types of variables, what type of regression should I use, and what fit statistics would be best?” This was a logistic regression analysis. Eventually, I wrote this prompt.

“Please run a multivariate logistic regression with PUT DV HERE as the dependent variable, and LIST IV HERE as the independent variables. Please include the p-values for all the coefficients. For model fit, please include AIC, BIC, Cox & Snell R-squared, Nagelkerke R-squared, the number of cases. Please save the print as a csv file with a timestamp. Please include the names of the variables, and have all the model summary information, including coefficients, odds ratios, AIC, BIC, R-squared values, and number of cases, saved in the CSV file.”

Summary

  • The impact of AI, like ChatGPT, on social science research, is transformative. 
  • It can automate mundane tasks to allow researchers to focus on more creative aspects. 
  • AI can streamline data analysis, coding, and even literature review, proposing a future where AI tools become integral to research workflows. 
  • ChatGPT can code in R. 
  • Specificity and priming are essential interactions with AI. 
  • There are limitations, such as the AI’s potential for generating inaccuracies and the crucial necessity for human oversight in quantitative research.

Joshua K. Dubrow is a PhD from The Ohio State University and a Professor of Sociology at the Polish Academy of Sciences.