Improving our ability to find the right information at the right time is important for every business and every user. The constant growth in unstructured information makes text mining applications increasingly important in achieving this goal. In looking ahead to see how text mining will solve some of our biggest challenges in extracting the value from large and noisy volumes of unstructured data, I thought I’d share some of the text mining research papers that I’ve recently come across. From clustering, problems related to entities and more, these text mining research papers focus on certain techniques, and may prove helpful for others facing similar issues.
Starting from categorization and classification, “Support-vector networks,” an older but still relevant a text mining research paper by Corinna Cortes and Vladimir Vapnik is worth mentioning. Another paper on the same topic is “Text categorization with support vector machines: Learning with many relevant features” by Thorsten Joachims.
Moving on to clustering, the text mining research paper “A comparison of document clustering techniques” by Michael Steinbach, George Karypis and Vipin Kumar from the Department of Computer Science at the University of Minnesota provides the foundation for understanding how clustering algorithms work.
Information scientist Don R. Swanson is known as one of the most respected scholars in text mining. His text mining papers are relevant for understanding the big picture evolution of the topic. Here is a collection of some of his work: (Don R Swanson – Google Scholar).
Among the text mining research papers focusing on the problem of entity linking, in other words linking entities in a document to the entities in the Wikipedia page, for example, I found a valuable resource in Ji Heng. Heng is an Associate Professor, Computer Science at Rensselaer Polytechnic Institute. You can take a look at
several papers that are available on her website(publication) by searching for entity linking.
It is also worth to mention the text mining research paper “Local and Global Algorithms for Disambiguation to Wikipedia”. This paper is authored by Lev Ratinov, Dan Roth, Doug Downey and Mike Anderson from the University of Illinois at Urbana-Champaign and focuses on entity linking as an optimization problem.
Finally, the last text mining research paper that I will include on entity linking is “A Neighborhood Relevance Model for Entity Linking” by Jeffrey Dalton and Laura Dietz from the University of Massachusetts. This paper focuses on the extremely important aspect of disambiguating context in information.
The last text mining technique that I would reference is about pure extraction. This topic offers many interesting text mining research papers for this area. For specific reasons related to the development of entity recognition in different languages, I would recommend the following paper about name recognition for Arabic, which is a real challenge: “A Rule Based Persons Names Arabic Extraction System,” is authored by Ali Elsebai and Farid Meziane from the University of Salford and Fatma Zohra Belkredim from the Universitie Hassiba Ben Bouali in Chlef, Algeria.
In addition, to the above text mining research papers, I would like to suggest a few books that I covered in a previous post:
- “Introduction to Information Retrieval” by Christopher Manning [Introduction to Information Retrieval] includes the findings of some of the most important papers and is a great resource for understanding text mining
- “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman and Jeff Ullman [Mining of Massive Datasets] is another foundational book
- On the topic of NLP+Text Analytics, one of my favorites is “Opinion Mining and Sentiment Analysis” by by Lillian Lee and Bo Pang. [The Book].