Optimize Dataset-Based Custom Models

3 mins read - Updated on Oct 27, 2025

A Custom Model in Kimola is an AI system you train with your own labeled data so it can automatically classify new text according to your categories.
When you create a model using a training dataset, you provide examples of text and their corresponding labels — teaching the model how to group similar content accurately.

Optimizing this dataset helps improve your model’s accuracy, consistency, and overall reliability.
This article explains how to review, clean, and refine your dataset to achieve better model performance.

Getting Ready

To begin, sign in to your Kimola account and open your dashboard. From the left menu, click Models to access the Models page. Then, click Create AI Model in the top right corner and choose By Training Set to start building or retraining your model.

Ensure Balanced and Sufficient Data

Your training dataset teaches your model how to classify text correctly.
It should include two main columns:

Content: The text data to analyze (e.g., customer feedback, survey responses).
Label: The category each text belongs to (e.g., Delivery, Pricing, Support).

For best results, aim for a balanced dataset:

Each label should have a similar number of samples — ideally at least 500 per label, and 2,500+ for complex datasets.
Avoid giving one label too much data, as it can make your model biased.
Too few examples make it harder for the model to recognize that label.

However, too much repetitive data can also reduce learning quality. When a model sees many nearly identical examples, it starts to memorize instead of learning — a phenomenon called overfitting. Overfitted models perform well on training data but struggle to classify unseen text correctly.

To optimize your dataset:

Include varied examples that show different tones, writing styles, and contexts.
Remove duplicate or near-duplicate records that don’t add new information.
Add more data only if it brings diversity — not repetition.

Tip

A high-quality training set is both balanced and diverse — enough data to teach the model, but not so much that it memorizes patterns instead of understanding them.

Review Label Accuracy and Consistency

Label quality is as important as label quantity. If records are mislabeled or inconsistent, the model will learn the wrong patterns.

Check your dataset for the following:

Mixed topics: A single label that includes unrelated content.
Inconsistent writing: Labels with different capitalizations (e.g., delivery, Delivery).
Typos or duplicates: Misspelled or empty labels reduce accuracy.

Tip

Merge or rename labels that overlap in meaning before retraining your model.

Clean Your Text Data

AI models learn by analyzing text. Noisy or incomplete data can reduce performance. Before retraining, clean your dataset by removing:

Links (e.g., https://example.com)
Emojis, symbols, or non-text characters
Empty rows or entries without meaningful text

Keeping your text clean ensures your model focuses only on meaningful linguistic patterns.

Retrain and Compare Results

After improving your dataset, upload it again and retrain your model in Kimola.
Once training is complete, evaluate performance by:

Checking the Consistency Rate on the model’s overview page.
Testing the model on new examples.
Comparing predictions with your previous version.

A higher consistency rate indicates improved stability and accuracy.

Monitor Model Performance

Over time, your dataset may become outdated as language or topics evolve.
If your consistency rate decreases or new patterns appear in your data, update your dataset and retrain the model.

Regular monitoring ensures your AI stays accurate and relevant.