Behind the Scenes: Large Language Models and Survey Feedback Analysis

When you write a paper, so much gets left on the cutting room floor. There’s a wealth of insights, backstory, and reasoning behind the decisions that shape a research project—things that don’t always make it into the final version but might still be valuable to others working in the space. This post aims to share some of that context for our recent paper on using large language models (LLMs) for survey analysis.

If you’d like to check out the paper before looking at the rest of this post, I recommend reading at least the background and discussion sections. They’re written to be accessible and should give you a good sense of the work. Another useful reference is the post I wrote for the corporate learning team at Harvard Medical School, which explores applications of this research in a corporate setting.

Now, let’s get into some of the key learnings from this project.

Key Takeaways

Dense embeddings showed promise for analyzing student feedback, reaching 70% accuracy in multi-label classification, but struggled with comments covering multiple topics. This limitation led to exploring alternative approaches.
The release of the GPT-4 API in March 2023 marked a turning point, enabling more sophisticated analysis. This allowed us to move from simply categorizing feedback to generating detailed thematic analyses and explanations, along with more complicated multi-step workflows.
Evaluating LLM performance on complex tasks required developing new frameworks beyond traditional metrics. We created rubric-based assessments, judged by LLMs, and methods to measure synthesis quality across multiple comments.
The project evolved from using standalone tools to developing structured workflows where LLMs could make decisions about which analysis techniques to apply. This points toward more sophisticated AI-assisted research methods.

Alternative Approaches: Deep Dive into Embeddings

A research paper often presents a linear path to its results, but in reality, the journey is anything but straightforward. Early on, I explored a very different approach to this project before shifting focus to LLMs. Back in 2022, I was particularly interested in whether we could use dense embeddings to encode student comments and extract meaningful insights.

The dataset consisted of thousands of comments from online courses, covering everything from sentiment and suggestions to broader themes. My initial thought was that we could embed these comments and cluster them based on similarity to uncover underlying topics, a form of topic modeling. Additionally, for classification tasks—such as identifying whether a comment was about course logistics, animations, or video length—I considered manually labeling a subset of comments, embedding them, and then using similarity search to classify new comments based on their nearest neighbors.

This was a multi-label classification problem, meaning a single comment could touch on multiple topics. For example, consider one comment shown in the paper: “I found the quizzes incredibly difficult, but the teacher was great and I felt I got what I paid for. If I had had more time to complete the course, this would have been even better.” The challenge was that embeddings, particularly single dense embeddings representing each full comment, struggled to represent multi-topic comments effectively. A comment with multiple topics may have an embedding that doesn’t localize well to the semantic “neighborhood” of any of the individual topics associated with that comment, decreasing the performance of downstream classifiers.

I spent a lot of time studying embeddings, and Nils Reimers’ (the author of the wonderful SentenceTransformers library) work was very helpful. His research, papers, and presentations helped me gain a much deeper understanding of how embeddings work and where their limitations lie.

Ultimately, the embedding approach wasn’t ideal for the problem we were tackling, especially due to the complexity of multi-topic comments. But the process was surprisingly effective; despite these limitations, we managed to surpass 70% classification accuracy, and we developed a rigorous evaluation framework with multi-rater labeling and a great set of metrics. That evaluation setup turned out to be extremely useful when we pivoted to using LLMs.

Large Language Models Change the Game

While I was working with embeddings, I was also keeping an eye on LLMs. I had experimented with GPT-2 and GPT-3, but they weren’t quite capable enough for the nuanced classification and analysis we needed. That changed with the release of the GPT-4 API in March 2023. It was immediately clear that not only was this model good at classification (reaching performance comparable to our human raters), but it could do so much more than just classification—it could help analyze and synthesize student feedback in a much more dynamic way.

At that point, I made a conscious decision to pivot. Rather than focusing solely on more standard NLP tasks like classification, I wanted to explore broader possibilities for leveraging LLMs in analyzing unstructured survey data. This meant developing workflows for:

Thematic analysis across large datasets
Extracting structured insights from free-text comments
Leveraging and exposing the model’s chain-of-thought reasoning

Unlike traditional NLP classification models, LLMs allowed us to handle more complex tasks like cross-comment synthesis and explanation generation—things that are incredibly valuable in educational research but difficult to achieve with standard models.

(Note: I’ve since implemented these same approaches with Claude and Gemini as well, with equally impressive results).

Evaluations: Measuring the Impact

One of the key challenges in this project was designing strong evaluation frameworks. Standard classification tasks were relatively straightforward to assess, given the well-established literature on multi-label classification metrics. However, evaluating more complex tasks—like thematic analysis and structured concept extraction—was far less clear-cut.

At the time, there wasn’t much precedent for using LLMs to evaluate other LLM-generated outputs based on rubric-driven assessment. I drew from material like OpenAI’s early work on LLM-based evaluations, iterating on approaches to make sure our methodology was both rigorous and reproducible. This process involved:

Comparing LLM-based classifications to traditional NLP models
Designing rubric-based assessments for more complex tasks, with an LLM as the evaluator
Experimenting with ways to quantify synthesis quality in comment analysis

These evaluation challenges were some of the most interesting aspects of the project, requiring new thinking around how to measure LLM performance beyond conventional NLP benchmarks.

Open-Sourcing and Agents: Looking Ahead

As I was preparing to open-source parts of this project on GitHub, I started thinking more about tool use and agent-based workflows.

Throughout the project, I developed a range of modular tools for the LLM—some for structuring output, others for specific NLP tasks. For example, one tool helped extract specific themes from comments, while another generated structured summaries of feedback patterns. These tools could be chained together: first analyzing individual comments for themes, then synthesizing patterns across comments, and finally generating actionable insights for course improvement.

Over time, it became clear that these tools could be combined into multi-step workflows, where an LLM could dynamically determine which tools to use based on the task at hand. For instance, when analyzing a new batch of comments, the LLM might first decide whether to run sentiment analysis, thematic analysis, or both, depending on the needs of the educator or researcher.

This naturally led to the idea of agents—not in the fully autonomous sense, but as structured loops with decision points that allowed the model to iteratively refine its analysis. Even in its early and imperfect form, this approach felt incredibly powerful. Being able to engage with an LLM conversationally, while having it execute multi-step analyses, felt like a glimpse into the future of AI-driven workflows (including research workflows).

Final Thoughts

This project involved a deep exploration of embeddings before shifting to LLMs, which ultimately proved to be more effective for analyzing qualitative feedback. The transition allowed us to move far beyond simple classification toward much more nuanced, multi-faceted interpretations of survey data.

Looking forward, there is still much to refine in structured, agent-driven LLM workflows. The potential for AI in learning-related research continues to grow, and I plan to further explore these possibilities. As I mention in one of the posts referenced near the top, while our study focused on education, the applications of this technology extend far beyond. Any field dealing with large volumes of text data – from market research to customer feedback analysis – could benefit from these techniques.