Recreate : SuveyMonkey Data Transformation

Feb 20, 2025

6 min read

Hi everyone, as we continue to improve Bayeslab(the AI first data workbench), it's interesting to see how real data analysis cases can be recreated and how much can be enhanced.

So we started to look for real cases with step by step description and data to recreate. Here's the first one in this series, recreated from this video:

Day in the Life of a Data Analyst - SurveyMonkey Data Transformation
Thanks Shashank Kalanithi for creating this video.
Level: Entry Level
Data: Data - Survey Monkey Output.xlsx

Please look at the original video for details on the background and step by step thinking flow of a data analyst (very good learning).

The method in this video actually represent a very common procedure for survey data processing:
First make the data tabular by giving every column a correct title
Then unpivot the survey result(wide format) so every answer is now a separate row(long format)
Do aggregation/analysis on this unpivoted table.

Project Background

A quick recap (Video 0:00 to 4:28):

You're paid to take a SurveyMonkey data set and transform into a cleaned format to use in Tableau later.

Raw data: an excel file

Expected Result: a ready for analysis table(several dimensions and 2 measures- total respondent per question and same answer respondents per answer)

Step 1: Make it Tabular

First let's upload the file to Bayeslab( or you can use local agent to keep the file at local)

You can browse the data like in Excel, lots of columns(wide format). Now let's make it tabular by first make sure there's only one column header row (Video: 6:17 to 15:55).

Tabular is the prerequisite of most of data analysis functions later. It's also the goal of data cleaning, we need:- only one header row with complete and meaningful names- each row represent only one sample, no spillover to next row- consistent type in one column- unique row/ids, no duplicate

First we get a transpose of original header and first row(also contains part of the header)

PROMPT:

[datafile] list column header and first row, transpose

You can see it's there now. And there's a result variable automatically generated for you (Transposed DataFrame of First Row).

Then let's make the magic of filling in unnamed/blank column names with correct question/answer. This is not a easy logic to express, here's what works for me:

PROMPT:

[data] replace all unamed:x in column header(except unnamed 8) with format x - y

x come from the nearest rows above that contains "Question"
y comes from "first row"

You can see the description is still vague but AI is able to manage.

In case you wonder, I also tried to give the prompt in a more "AI" way (both 4o and o1 tried)）:

Like i said this is a tricky logic to write. If given to chatgpt with exact data and fill it might succeed, but it's not repeatable next time or on another survey result with same format.

Now let's write the headers back:

PROMPT:

[data file], replace headers with column header in [Result Variable], then remove 1st row

Result preview on the right shows the column names are now updated.

You can now save it to an csv or excel if you like. I'll just continue here as Bayeslab automatically save my variable and versions.

Step 2: Unpivot

First a little preparation (Video: 27:44 to 31:01) to drop a few columns:

Then unpivot the table to a long format(Video: 31:05 to 43:07) , in the video we use pandas's melt function which I personally think is a strange name.

So we'll just use "unpivot" to describe:

PROMPT:

[Data] unpivot this table, id columns are first 8 , value columns are the rest.

Name variable column "Question + Subquestions", value column with "Answer"

If you'd like to see the generated code, it actually does exactly what's in the video with an addition(remove empty values)

The removal of empty values actually is correct in this case since null answers would not make sense to count, but if you'd like to keep it then feel free to modify the code and run again.

Step 3: Aggregate/Join

In the original video, author continues to explain joins and others in order to prepare data for visualization or aggregation. It's perfectly ok to replicate the exact actions but we actually can get more from Bayeslab by directly asking using business logic language.

This is possible since the original data is now in a pretty good shape for analyze: tabular, good names, empty removed.

First PROMPT:

[Data] calculate for each question ,how many unique respondents for that question

Parse Question out from Question + Subquestions like: for Question 25 - Response 4, the question would be "Question 25"

output like "Question", "Respondents"

The data is same with what's in the video, cool. (Avoid so much troubleshooting time with na and syntax).

Note: I specifically make it clear of what format to output, this is a great way to convey the requirement clearly to AI.

Second Prompt:

[Data] calculate how many unique respondents answered the same answer for a given question/subquestion

output like "Question + Subquestion", Answer, Same Answer

Finally, let's join it back to get exactly the result in video:

There're some more in the video including changes to column name and save it to excel. I'm sure you know how to prompt it now

Now we have a repeatable page that can run again if the data is updated. The final page looks like this:

Reflection

I'm able to do this in about 20 minutes using Bayeslab, with back and forth tweaking my prompt a little. Most of the time spent is actually sanity checking if the data is right, exactly what you should do no matter what tools to use.

I find it's convent to be able to look at my data at every step without switching , and able to look at the code and modify when I'm in doubt.

This way, i can skip several things to improve my productivity:

Write excel formulas
Go back and forth to spell the name correctly(sheet name, column name, file name….etc)
- misspell is not an issue for AI
Remember to make a copy of your data every step for rollback
- it automatically create a new var
Setup python environment
- like Collab and others, a hosted python env
Remember python method to use or syntax
Switching back and forth between excel/vscode and other tools
Search stack overflow for answers

Of course there're things to improve as well:

Simple tasks like removing columns are still more complex than it should be
Looking at 100 columns for a data frame requires dragging the small scroll bar
The code AI writes is somewhat long. Although it's correct, but I'd like to make it more readable.
The preview of data frame has very little function to navigate compare to excel

Spoiler: We're working on a new "table block" to greatly simplify working with dataframes especially for simple operations(think powerquery), stay tuned.

We'll try this scenario again in a few weeks.