RapidMiner Text Mining
Text mining is a process to pull words from free form text fields, word documents, or other data sources. Business use cases for text mining include scrubbing text from survey notes, claim notes, project notes or call center notes. It is applicable in any situation that requires parsing text to find words and the number of times they appear. These include uses in various business functions such as Risk Management, Customer Service, Project Management, or Knowledge Management. RapidMiner text mining enables you gain insight from your business information and transform that into knowledge and actionable insights. In this example, I will use data collected from the Loch Ness Monster sighting website. Each sighting is posted with a date and free form text field. The goal of this example is parse the sighting and determine the most common words by frequency from the information provided.
Loch Ness Monster Sighting Example
The data from the website was scraped into a CSV file available on GitHub – Loch Ness Data. Each row contains the date of the sighting and the sighting description. To build the model and output the results you will need to use these operators. The Process Documents operator has a subprocess that uses 5 additional RapidMiner Studio operators.
- Read Document
- Process Documents
- Transform Cases
- Filter Stopwords(English)
- Filter Tokens (by Length)
- Stem (Snowball)
- WordList to Data
- Write Excel
Building the RapidMiner Text Mining Model
Open RapidMiner Studio and create a new process.
Step 1 – Add the Read Document operator and connect to the Loch Ness Monster file in the Paraments window.
Step 2 – Add the Process Documents operator. Connect the Read Document out to Process Documents doc. Connect the exa to res.
Step 2 Subprocesses – Double click on the Process Document operator and add the following operators.
Tokenize – Set mode parameters to non letters. Connect doc to Tokenize doc
Transform Cases – Set transform to parameter to lower case. Connect Tokenize doc to Transform Cases doc.
Filter Stopwords(English) – Connect Transform Cases doc to Filter Stopwords doc.
Filter Tokens (by Length) – Set min chars parameter to 4 and max chars parameter to 25. Connect Filter Stopwords doc to Filter Tokens doc.
Stem (Snowball) – Set language parameter to English. Connect Filter Tokens doc to Stem doc. and Stem doc to doc output.
Step 3 – Add the WordList to Data operator. Connect the Process Documents wor output to wor input.
Step 4 – Add the Write Excel operator. Connect the WordList to Data exa to the Write Excel inp and the Write Excel thr to res.
When completed the process should look like this:
The Process Documents Sub Process should look like this:
Execute the process to see the results.
Based on the text of Loch Ness Monster sighting the results from the text mining model show the words with their number of appearances in the text. For example, Loch appears 50 times and the city of Urquhart appeared 20 times. Instead of trying to read and pull the information from the text descriptions. This model pulls out the word stubs to make is easy to see the frequency of the words. In a business scenario, the same process enables you determine frequently recurring words from Call Center agent notes, claim notes, part failure descriptions, etc.
I have uploaded the complete model build on Youtube – https://www.youtube.com/watch?v=-NcnKvFuqpg