5. Data pre-processing#

There are at least two key tasks that can happen at the pre-processing stage:

Convert the data into a format that can be understood by LLMs
Minimise the data to respect privacy.

Warning

This page could be expanded with more preprocessing strategies and examples.

5.1. Preparing your data for LLMs#

The data you collected should be pre-processed into a format so that the GenAI can use it for further analysis. This usually means that anything that is not already collected as simple digital text, will require some data pre-processing. In the tabs below, some options for data preprocessing such as image to text, and automatic speech transcription. Note that also these tools are using AI/ML, but we won’t go deeper in the details on why they are different from GenAI with LLMs.

Often survey answers are provided in digital formats such as xlsx or csv. When the survey data is collected with pen and paper, check the section on Observations for some more details.

5.2. Data minimisation#

Data minimisation is one of the core principles of the GDPR. While it is often impossible to minimise the data so that it can be truly anonymous. There are techniques for at least “minimising” the data, i.e. keep only the minimum required for your research purpose. For example an audio-visual interview can be converted to audio-only format if the researcher is not planning – and did not disclose to the study participant – to analyse the video data.

Warning

This section can be expanded with content from these slide set:

Glerean, E. (2024, February 12). Basics of personal data anonymisation. Zenodo. https://doi.org/10.5281/zenodo.10649060

Note

Do you have other uses of AI for data preprocessing? Share your knowledge with other Aalto researchers by contributing to this book!

Data pre-processing

Contents

5. Data pre-processing#

5.1. Preparing your data for LLMs#

5.2. Data minimisation#