Choosing Labels for Text Classification

This article explains how to choose labels for text classification in a qualitative research.

Choosing labels for a text classification process is determinant for successful qualitative research. It's only possible by reading the dataset at first, and it requires domain expertise. Despite all these challenges, choosing the right labels has a significant effect on the depth of research.

Once for all, the main goal of classification in a qualitative research is to summarize the raw dataset so that everyone has a clear idea about the content. Just like summarizing an article or a book, we should have a good understanding of the dataset by reading the entire content. This enables us to foresee which labels we will have in which categories.

Make sure you have read "What is Text Classification?" article before continuing this part.

Determining Categories

In text classification, the term category stands for aspect. Let's consider we have a dataset that has consumer feedbacks about a mobile phone operator. Each feedback can be handled from different aspects like Products and ServicesSentiment. For example a consumer feedback saying, "I can't use my data in the neighborhood, how am I suppose to go out and meet people?" can be labeled in both Products and Services category and Sentiment category. The label for Products and Services category will be "Mobile Data" and, the label for Sentiment category will be for "Negative".

In some cases, we may only have one category as the default category. Determining categories is about deciding how many aspects we will have when analyzing a dataset. Each category will require additional labeling and more human power at the beginning, so it's a balance between the depth and cost of the research.

Determining Labels

Labels define how classification will be made for each category. If we take the example above, we currently have Mobile Data label for Products and Services and Negative label for Sentiment category. It's obvious that Sentiment category will also have Positive label which makes Sentiment category an example of Binary Classification. On the other hand, Products and Services category will probably have other labels like Voice Call, SMS, Hotspot, etc., which makes it an example of Multiclass Classification.

The label count in a category is about the content in a dataset. There must be two labels at minimum, but the maximum label count depends on the research. When there are more labels than necessary in a category, the readability of the research will decrease. On the other hand, when there are fewer labels then necessary, the research output will be shallow. So, we need to ask, "Is it both readable and satisfying, when we visualize each category on a graph?". We need to make readable and satisfying visualization out of a category.