How to generate an NLP dataset from any internet source?

Mar 29, 2022 - 8 min read

Having a dataset is the first step of the research process. The quality, scope and even feasibility of research depend on the initial dataset.

As a member of the Kimola team, I regularly work on large datasets with tens of thousands of customer feedbacks, and I can say that every single feedback is important. There are many sources on the internet that explain the importance of having and analyzing customer feedback, but very few explain how to create an appropriate dataset to start with. Therefore, in this article, I will explain how to create an NLP dataset from any internet source which is the very beginning of the data analysis process so that you can analyze raw data and complete your research.

Before diving deep into the details, I would like to inform you that I wrote this article to demonstrate creating a sample dataset for research purposes. Data scraping all the content on a website and publishing them on another platform might be illegal and might cause problems.

Let's start with the basics to ensure we are all on the same page.

What is a dataset?

A dataset is a collection of data. Each dataset column represents a specific variable, while each row corresponds to one particular dataset record. A researcher or a data analyst can observe changes in consumer trends, compare differences between datasets or find insights into consumer behaviour. Datasets can be in many data formats like numeric, image, audio or video. However, my profession at Kimola is analyzing data of customer feedback so I will create an NLP dataset through text data in this article.

The text data that we need to analyze might be in a single field or can be separated into more than one field in a dataset. We may want to analyze the title and body together, which usually happens for user reviews or news articles. On the other hand, there might be fields that we don’t include text analysis but use as a helpful attribute of outher types of data such as date, time, or link in our research.

Sometimes a dataset can be used to train a Machine Learning model. On a training set, other than the text data, a field must contain the label (or tag), best known as positive or negative for sentiment analysis. A much more complicated labelling architecture is preferred in the consumer research industry to classify massive text data and unlock insights.

Why are datasets important for research?

First of all, the go/no-go decision for research depends on the dataset. Every research requires a different set of data, which determines the research's scope and quality.

One of the essential elements to consider when creating an NLP data set is the variety of the data. We should avoid creating datasets with bias. For example, if we analyze a dataset created by web scraping from an internet source that consumers share only complaints; that will cause us to have research results with only negative thoughts and they will dominant the research results. As researchers, we should focus on the balance by working with datasets collected from various sources. This approach will enable us to get more comprehensive and accurate results to understand the public opinion on a topic.

On the other hand, even if we have a dataset collected from various sources, we need to ensure that our dataset is consistent. Especially public free datasets, and sometimes the ones we create may contain irrelevant data. This may cause us to analyze ad texts, for example, rather than consumer opinions which may completely change the research outputs.

Last but not least, we need to make sure that we start our research with the right amount of data. There is no magic formula to find out the right amount, but we will have a strong opinion for the data size that we can work with during the website scouting.

Since we covered the basics of the datasets, we can continue to generate a dataset for our imaginary research project.

How do you find the right website for data collection?

There are millions of websites on the world wide web where people share their opinions on different topics. Thus, there are a few essential tips to find the right websites for creating an NLP dataset.

Before scouting websites, you should make clear the dataset language. Because machine learning models mostly support a single language and, as you can imagine, a model to classify consumer opinions in Spanish may not perform in English.

After choosing the language, you can type a search query like “Top e-commerce websites in {country}”, “Top forums on {topic}”, and “Most influential blogs for {topic}” to find websites where people share their opinions to scrape. There are countless articles on the internet that list different websites on different topics. The only thing to consider here is choosing websites that provide date-time information for each post because our research is generally expected to represent a specific date range.

How to scrape data from any website?

The most reliable way to scrape data to create an NLP dataset is using a browser extension. After choosing websites to scrape data from, you can install this extension called Instant Data Scraper on Google Chrome or any Chrome based internet browser. Then the icon of the extension will appear next to the search bar.

How does the Instant Data Scraper work?

Let's say you want to understand the pain points and motivations for mobile app users of Social Networks and you choose to start from analyzing the reviews for the Twitter mobile application on Google Play Store. For this purpose, we navigate to Twitter's Google Play page on our web browser and then we click on the Instant Data Scraper icon next to the search bar. A pop-up window appears as shown below;

The extension analyzes the HTML content on the web page, detects patterns and captures the variables for each review. Thus, we will have links of the reviews, dates, the counts of likes, name of the users and their reviews, in different columns. This contains pretty much everything in a dataset like the main content to analyze and other helpful attributes like date-time, likes etc.

Google Play has a user interface design that allows you to scroll infinitely. In order to scrape all the data on such websites, all you need to do is check the "Infinite scroll" option on the extension and then click on the "Start crawling" button; the extension will start scrolling Twitter's Google page automatically and scrape all the data.

How to scrape data from multiple pages?

Let's say you need to scrape reviews of a company that has a page on TrustPilot. Unlike Google Play Store, TrustPilot displays reviews on multiple pages rather than an infinitive scroll option. After opening the Instant Data Scraper window by following the same steps I mentioned above, all you need to do is make sure that the "Infinite scroll" option is unchecked and then click the "Locate 'Next' Button" button. When you click the "Next" button on the open website, the extension will detect the button automatically, and when you start the scraping process, it will automatically perform this process and scrape all the data.

Instant Data Scraper will notify you when the entire crawling and scraping process is completed. After this notification, you will be able to download all the reviews in CSV or XLSX formats as your dataset. Just before saving the dataset as a file, you can delete unnecessary columns by clicking the cross on the top corner so you can focus on the data that matters to your research.

CONGRATS! You have generated your first dataset!

How to analyze the Natural Language Processing dataset you generated?

Welcome to the world of text analytics! As researchers, we take our datasets as treasures, but that doesn’t mean we spend all of our time reading the content and classifying each one. This is always an option, of course, but in the century of artificial intelligence, NLP technologies are available not only for software developers but also for marketing and research professionals with Kimola Cognitive.

Kimola is a data analytics company that provides a no-code Machine Learning platform specifically designed for marketing and research professionals. The platform, Kimola Cognitive, enables researchers to either use a pre-built machine learning model or create their own to automatically classify a dataset simply by uploading an Excel file. There is a 7-day free trial option and plans that grow with your needs. Although there are different data analysis techniques and data analysis tools out there, starting with the one which has a Gallery of pre-built machine learning models might help you jiggle the data faster. Kimola Cognitive also has a browser extension that helps data collection and our new article will come soon and linked here.

In consumer research, choosing labels for data classification depends on the context. For example, if we are analyzing consumer reviews about a chocolate bar, we probably have labels like Taste, Packaging, Pricing etc. to quantify data. On the other hand, if we are analyzing consumer reviews of a bank, we would have labels like Customer Service, Ads & Campaigns, ATMs, Credits and Loans etc. Kimola Cognitive offers many pre-built Machine Learning models for different industries including banking, e-commerce, mobile network operators, automobile, mobile apps and mobile games.

To scrape text data and analyze the NLP dataset you generated, you can sign up here to Kimola Cognitive and create a free account and upload a dataset file.