DSC180B Final Project

Background

How did this project come to be?

UCSD China Data Lab

In the "Congress Tweets" project, the UCSD China Data Lab hand-scraped and scored thousands of Tweets to see how China is represented in Tweets from the U.S. Congressmembers. You can read their results here.

Quarter 1 Results

We utilized standard Machine Learning models from the scikit-learn Python library to see how traditional methods work on this issue. We found that Naive Bayes and Random Forest proved to be the highest-performing models, but we had issues with class imbalance that made our results a little shaky.
Annie's Q1 Code
Gokul's Q1 Code

Large Language Models (LLMs)

LLMs have exploded in popularity and sophistication over the last few years with cost-affordable and high-quality models being made publicly-accessible. We chose OpenAI GPT-3 as our option due to its incredible quality and ease-of-use in Python.

EDA

Visualizations to help understand the data.

Relevance Distribution

Sentiment Distribution

Sentiment Over Time

Tweets by State

Methods

Timeline of development for the two GPT-3 powered models.

Model Selection

Choosing the Right GPT-3 Submodel

GPT-3 has 4 powerful models under its hood, but davinci was our selection due to its higher capacity for understanding specific instruction. Furthermore, its training data ends in October 2021, nearly 2 years after every other model choice.
Initial Prompting

Starting the Process

Initially, both prompts started with very simplistic formats. No in-context learning was used, and fairly simplistic definitions of terms like 'relevance', 'positive', 'neutral', and 'negative' were used. Because each Tweet required its own API call to the OpenAI GPT-3 API, the prompts were run on samples of 100 Tweets at a time rather than the whole dataset.
Prompt Engineering

Improving Classification

In order to improve the results from the model, further prompt tuning was required. Both prompts pursued 'few-shot' learning, where the prompt itself provides example inputs, outputs, and reasonings. These included randomly selected Tweets as well as 'pain-point' examples that highlighted where the model consistently struggled.
Comparing to Quarter 1 Results

Evaluate Applicability of GPT-3 Models

To assess if GPT-3 improved classification performance from the team's results in quarter 1, we compared the GPT-3's performance in classifying Tweet relevance with the Naive Bayes Classifer from quarter 1 and compared the GPT-3's performance in evalutaing sentiment with a Random Forest Classifer.
Evaluate and Wrapping Up

Into the Future

After running through prompt engineering, sampling, and evaluating several times, the last step of the project was to communicate our findings and results with our mentors from the China Data Lab, and deciding what role GPT-3 can have in the future of the 'Congress Tweets' project.

Results

What did we find in this project?

Relevance

The utilization of GPT-3 provided accuracies that fell well short of last quarter's results. Even after extensive prompt engineering, the peak accuracy achieved by the model was ~ 75%, while the Naive Bayes classifier regularly reached > 90%. Below is an example confusion matrix generated during experimentation, revealing a common weak point of the model. Specifically, Tweets that the human encoder thought were irrelevant were often considered relevant by the model.

Sentiment

In each training trail, GPT-3 Davinci model was applied to a sample of 100-200 Tweets. GPT-3 achieved between 60% and 70% accuracy each trail in classifying Tweet sentiments. Shown in the confusion matrix below, GPT-3 performed best in classifying Tweets with positive sentiment but struggled in distinguishing neutral Tweets with positive and negative Tweets. Compared to the Random Forest Classifier, which achieved approximately 55% accuracy, GPT-3 had higher accuracy in classifying all sentiment classes.

Discussion

What did we learn in this project?

Our Project

The application of LLMs to this particular topic generated measures that fell short of our expectations, and of our Quarter 1 results. Despite continuous engineering of the prompt, we were not able to match our original statistics in classifying relevance nor achieve high accuracy in classifying sentiment; however, the process of using an LLM like GPT-3 revealed clearly that these tools definitely have their place in this field. In conversing with the UCSD China Data Lab over the course of the last 10 weeks, our work has been helpful in determining whether they will move forward with using LLMs throughout the future stages of their Twitter analysis projects.

Limitations & Future

Looking at the overall performance of the GPT-3 language models, and even at the supervised ML pathways from earlier, it's easy to see that this is a difficult task no matter what tools we use. Human interpretation of text is so influenced by preexisting biases, contexts, and other factors that cannot be modeled, and the strange Internet-affected language of Twitter only contributes to this further. Because of LLMs' ability to comprehend text in its own context, outside the purview of 'trainign data', it is worth considering shifting away from a simple classification & sentiment analysis, and to utilize the full power of these models for deeper textual analysis.

The Models

Large Language Models are undoubtedly the future of language modeling and analysis. Their incredible power, ease-of-use, and high quality outputs put them far ahead of any other traditional NLP methodology. However, that does not mean simply tossing out what's been done before and trusting GPT-3 to always get the solution. As our project shows, GPT-3 is not always ready to bridge the gap resulting from natural differences in human interpretation and context. Subject matter expertise and domain knowledge are and will remain absolutely key to getting the best out of these models.

Project Material

Our Deliverables

Report

Our published paper with more information.

Poster

Our showcase poster for presentation.

Our Team

Meet our Team!

Gokul Prasad

UCSD '23, Data Science

Annie Fan

UCSD '23, Data Science & Cognitive Science

Dr. Margaret Roberts

Project Mentor

Dr. Young Yang

Advisor from China Data Lab