Subscribe to our LinkedIn so you don't miss important media news and analysis

7.6.2024By Alberto Puliafito

Using generative AI for data analysis, part 1

5 Minutes

Article by

Alberto Puliafito

Prepare the data, work step-by-step and write detailed prompts

Generative artificial intelligences (AIs) promise to become important tools for journalists and publishers. These technologies shine particularly in analysing data, evaluating survey responses, and dissecting comments and sentiments on various topics.

Yet, while generative AIs promise effectiveness, we must sidestep the allure of the “wow-effect” trap.

The wow-effect trap of new AI models

One of the most striking wow-effects of new AI models is, with no doubt, their capability to perform data analysis. When we began using the newly launched GPT-4o by OpenAI after May 13, we saw video recordings of the new model answering to very simple prompts that simply made us say, “wow”. These prompts usually go like this “Conduct an in-depth analysis of this data, identify trends, perform high-level statistical analysis, create visualisations.” Then someone attaches a CSV file – since GPT-4o accepts multimodal commands, integrating files and text, like its predecessor – and presses enter.

GPT-4o then performs what appears to be a kind of “miracle”. Indeed, the speed and the results are astounding: the AI analyses the file, extracts important information, identifies trends, searches for and verifies correlations. However, as with all results obtained swiftly, a quick analysis reveals that things are not always as easy as they seem.

True power is not in the shortcuts but in (human) preprocessing

Generative AIs like ChatGPT, Gemini, Mistral, and Claude can process large volumes of text data quickly and efficiently. They can identify patterns, extract key themes, and assess the sentiments expressed in written responses. This capability is definitely useful in analysing massive files or handling surveys and open answers or comments, where the sheer volume of data can be overwhelming. But even these tools need preprocessing.

Preprocessing involves cleaning the data, such as removing duplicates, correcting typos, and standardising the text format.

Although these machines can perform some of these tasks with a simple command, it’s better to work step-by-step. This is a good strategy with humans and becomes one with generative AIs.

File anonymisation and preparation

First, anonymise the file you want to analyse. For example, when analysing survey responses from the AI event for journalists I organised in Milan, I first downloaded the CSV file from the Google Form hosting the survey. Then I anonymised the file, manually removing names, emails left with consent, and any details from open answers that could lead to personal identification. This task cannot be delegated to an online-connected AI.

Before uploading the file, it’s advisable to create a complex prompt explaining the file.

This is a case-study I personally run with ChatGPT-4o. The file I wanted to analyse is a csv containing a survey that follows an event about AIs (see the full video), to understand the impact and the evaluation of the event by the audience.

*A part of the csv I’m going to upload to ChatGPT-4o*

A good prompt to do this job is structured as follows:

To start the prompt explain the context of the file you’re about to upload. Then briefly describe each column of the file.

Explain to the machine, as you would to a human that has never seen the file, what to expect in the various columns.

When there are multiple-choice answers, specify that the person who answered was allowed to select all that apply. This is particularly important when there are many choices and when punctuation used in the responses could occasionally be interpreted by the AI as a CSV separator.

When there are open answers, clarify this well, and remind the machine that it might also find irrelevant responses.

Final instructions and analysis

Finally, provide additional instructions to the AI. Explain your goal, ask to ensure the data is formatted correctly, request the AI to clean the data to correct any errors and standardise the response formats, give a first view of the output you expect and, for extra safety, ask for data anonymisation in case you missed something.

Now, load the file, associate the prompt, and press enter.

Outputs

With this elaborated prompt we expect an elaborated output. And in fact this is the case.

Note that the AI recognises that the survey I uploaded is in Italian and recognises the various sections, meaning the first part was successful.

It then explains the next steps.

Without finding missing data in the cells, ChatGPT can proceed. If it had found any, it would have handled them according to the provided instructions or asked for more information. We have to always read the output carefully.

*Output, part 3: standardisation and conversion*

Next, a standardisation process of the responses begins. In this case, the task is easy, but again, remember that it wouldn’t be if the file were much more complex. ChatGPT begins its work, anticipating what it will do. Everything we are seeing here as output is produced without further human intervention.

First, it provides a statistical summary, starting arbitrarily with the evaluation of the event. Here, it’s crucial to verify data consistency, for example, the total votes.

*Output, part 4: next steps and summary statistics*

ChatGPT then proposes visualisations, like the one you see here.

It creates some, and you can ask for more or for different forms, for example to change the bar graph in a “pie chart” or to generate a “pie chart” from other data.

Now, performing sentiment analysis, ChatGPT analyses open responses and identifies so-called “stop-words”, which are not significant for understanding the text in Italian. This is not an easy task and we will return to this in the second part of this guide, as sentiment analysis requires specific work to be effective

So, I ask the AI to avoid G and H columns asking for more analysis and ideas: it’s always a good thing to ask these chatbots for other ideas.

For example, as suggested you can inquire if there are correlations (in this case, there are none) between different magnitudes, ask if they are strong or weak correlations, and propose other questions.

For instance, in this case, we can ask whether people who attended the event in person enjoyed it more or less than those who participated remotely.

ChatGPT calculates the average ratings by dividing the two groups, then evaluates the distribution of votes and finally briefly answers the posted question.

Best practices for prompt creation

So, it’s time to see, in general, how to create effective prompts is crucial for maximising the potential of generative AIs in data analysis and survey evaluation. A well-structured prompt combined with the method I’ve shown since here can significantly enhance the quality and relevance of the AI’s output. Here are some tips and best practices for creating effective prompts:

1. Be clear and specific:

Clearly state the context and purpose of the analysis. Provide background information that helps the AI understand the data.

Specify the format of the data and describe the columns and types of data they contain.

2. Avoid ambiguity:

Use precise language to avoid misunderstandings. Ambiguity can lead to irrelevant or incorrect analysis.
Define any technical terms or acronyms that the AI might not recognise.

3. Break down complex tasks:

Divide complex tasks into smaller, manageable steps. This helps the AI process the information more effectively and reduces the likelihood of errors.
For example, instead of asking for a comprehensive analysis in one go, start with basic data cleaning, then move on to more complex analysis.

4. Provide examples:

If possible, include examples of what you expect in the response. This helps guide the AI in understanding your requirements. Examples can include sample outputs, specific formats, or detailed descriptions of the desired analysis.

5. Specify output format:

Clearly state the format in which you want the results. Whether it’s a summary, a statistical report, visualisations, or a combination, specifying this upfront can save time.
Mention any specific visualisation types (e.g., bar charts, pie charts) you need.

6. Set boundaries:

Define the scope of the analysis to keep the AI focused. Specify any limitations or exclusions explicitly. For instance, if certain columns or data types should be ignored, mention this clearly in the prompt.

Common pitfalls to avoid:

Be sure not to make these mistakes:

Vagueness: vague instructions can lead to irrelevant or incomplete results.

Overloading the prompt: avoid including too many instructions at once. This can overwhelm the AI and result in a less focused analysis.

Ignoring data quality: ensure that the data is clean and well-prepared before submitting it for analysis. Poor data quality can skew the results. Ask the AI to help you in cleaning the data if you don’t know how to do that task.

Lack of context: Failing to provide sufficient context can lead to misinterpretation. Always include background information and explain the purpose of the analysis.

In short, once we have built the premises for working with the chosen artificial intelligence, human imagination can guide the work. You might even conclude by asking to write a draft of an executive summary or an abstract that summarises in discursive form everything we have seen.

The more complex the file, the more useful this type of setup is to avoid problems.

Source of the cover photo: Mika Baumeister via Unsplash

Author

Article by

Alberto Puliafito

Alberto Puliafito is an Italian journalist, director and media analyst, Slow News’ editor-in-chief. He also works as digital transformation and monetisation consultant with Supercerchio, an independent studio.

Did you like what you just read? Spread it to the world!

‌

‌‌

‌‌‌

‌

‌‌

‌‌‌

‌

We are using cookies to give you the best experience on our website.

You can find out more about which cookies we are using or switch them off in settings.

Using generative AI for data analysis, part 1

Alberto Puliafito

Prepare the data, work step-by-step and write detailed prompts

The wow-effect trap of new AI models

True power is not in the shortcuts but in (human) preprocessing

File anonymisation and preparation

Final instructions and analysis

Outputs

Best practices for prompt creation

Common pitfalls to avoid:

The Fix Newsletter

Author

Alberto Puliafito

‌

‌‌‌

‌‌‌