Gokul Prasad / Annie Fan
Analyzing U.S. Congressional Tweets with OpenAI GPT-3
Tell Me More

Background

How did this project come to be?

UCSD China Data Lab

In the "Congress Tweets" project, the UCSD China Data Lab hand-scraped and scored thousands of Tweets to see how China is represented in Tweets from the U.S. Congressmembers. You can read their results here.

Quarter 1 Results

We utilized standard Machine Learning models from the scikit-learn Python library to see how traditional methods work on this issue. We found that Naive Bayes and Random Forest proved to be the highest-performing models, but we had issues with class imbalance that made our results a little shaky.
Annie's Q1 Code
Gokul's Q1 Code

Large Language Models (LLMs)

LLMs have exploded in popularity and sophistication over the last few years with cost-affordable and high-quality models being made publicly-accessible. We chose OpenAI GPT-3 as our option due to its incredible quality and ease-of-use in Python.

EDA

Visualizations to help understand the data.

...
Relevance Distribution
...
Sentiment Distribution
...
Sentiment Over Time
...
Tweets by State

Methods

Timeline of development for the two GPT-3 powered models.

Results

What did we find in this project?

Relevance

The utilization of GPT-3 provided accuracies that fell well short of last quarter's results. Even after extensive prompt engineering, the peak accuracy achieved by the model was ~ 75%, while the Naive Bayes classifier regularly reached > 90%. Below is an example confusion matrix generated during experimentation, revealing a common weak point of the model. Specifically, Tweets that the human encoder thought were irrelevant were often considered relevant by the model.


Sentiment

In each training trail, GPT-3 Davinci model was applied to a sample of 100-200 Tweets. GPT-3 achieved between 60% and 70% accuracy each trail in classifying Tweet sentiments. Shown in the confusion matrix below, GPT-3 performed best in classifying Tweets with positive sentiment but struggled in distinguishing neutral Tweets with positive and negative Tweets. Compared to the Random Forest Classifier, which achieved approximately 55% accuracy, GPT-3 had higher accuracy in classifying all sentiment classes.
... ...

Discussion

What did we learn in this project?

Our Project

The application of LLMs to this particular topic generated measures that fell short of our expectations, and of our Quarter 1 results. Despite continuous engineering of the prompt, we were not able to match our original statistics in classifying relevance nor achieve high accuracy in classifying sentiment; however, the process of using an LLM like GPT-3 revealed clearly that these tools definitely have their place in this field. In conversing with the UCSD China Data Lab over the course of the last 10 weeks, our work has been helpful in determining whether they will move forward with using LLMs throughout the future stages of their Twitter analysis projects.

Limitations & Future

Looking at the overall performance of the GPT-3 language models, and even at the supervised ML pathways from earlier, it's easy to see that this is a difficult task no matter what tools we use. Human interpretation of text is so influenced by preexisting biases, contexts, and other factors that cannot be modeled, and the strange Internet-affected language of Twitter only contributes to this further. Because of LLMs' ability to comprehend text in its own context, outside the purview of 'trainign data', it is worth considering shifting away from a simple classification & sentiment analysis, and to utilize the full power of these models for deeper textual analysis.

The Models

Large Language Models are undoubtedly the future of language modeling and analysis. Their incredible power, ease-of-use, and high quality outputs put them far ahead of any other traditional NLP methodology. However, that does not mean simply tossing out what's been done before and trusting GPT-3 to always get the solution. As our project shows, GPT-3 is not always ready to bridge the gap resulting from natural differences in human interpretation and context. Subject matter expertise and domain knowledge are and will remain absolutely key to getting the best out of these models.

Project Material

Our Deliverables

Report

Our published paper with more information.

Poster

Our showcase poster for presentation.

Our Team

Meet our Team!

...

Gokul Prasad

UCSD '23, Data Science

...

Annie Fan

UCSD '23, Data Science & Cognitive Science

...

Dr. Margaret Roberts

Project Mentor

...

Dr. Young Yang

Advisor from China Data Lab