Data analysis is mostly focused on structured and standardized data, e. g. data from data bases, because these data can be used easily for analysis. Nevertheless even unstructured data offer chances to generate advantages. Concrete applications like content analysis or sentiment detection are discussed more and more frequently.
Of course, there are still limits to the possibilities of qualitative data analysis. The automated recognition of moods is limited when it comes to ambiguous statements. But the unlimited availability of digital texts and documents shows that analysis of unstructured data is justified and useful. Unstructured data does exist in a plenty of forms. Examples could be e-mail histories as well as scientific papers. The analyses of those unstructured texts are complex through extensive data volume, differing formats and different types of problems.
The free statistical programming language R is one of the leading solutions for this kind of problems. R offers almost unlimited possibilities for every kind of statistical problem. For example the additional package tm provides functions that allow the management of text documents and facilitates the use of heterogeneous text formats and is therefore a useful application for text mining tasks. A multitude of text formats like e-mails, RSS feeds and many other formats (HTML, CSV, PDF, etc.) can be read in to R. The data structure as well as the algorithms can be adjusted according to personal needs, because tm’s developers created a modular concept that supports integration, transformation and filtering options. These options allow the concrete filtration of texts according to determined criteria. The advantage of R in this case is the possibility to use the gained results for further analysis using R as a statistical language.
The following graphic „election promises Germany 2013 freqent words“ has been created to visualize possibilities of data mining with R. The upcoming Bundestag election in Germany has been chosen as an interesting example. Frequent words have been filtered from the election promises of the five most popular German parties to demonstrate R’s power to cope with unstructured data. The results show how often the frequent words have been used in the parties’ election promises and can be interpreted as their special topics of interest, depending on the section’s color. The dark red sections show that the word appeared very frequently in the parties’ election promise while lighter colored sections show that these words have been used less often.
eoda’s Data Science Trainings offers a course named “qualitative data analysis” from November 18th to November 19th that will broach the issue of text mining as one method of qualitative data analysis.
Heiko Miertzsch - Posted on 13.09.2013
Heiko Miertzsch ist einer der beiden Gründer der eoda GmbH.