Text mining

From Wikipedia, the free encyclopedia

Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Contents

Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.

Recently, text mining has been receiving attention in many areas, most notably in the security, commercial, and academic fields.

One of the largest text mining applications that exists is probably the classified ECHELON surveillance system.

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been begun such as Nature's proposal for an Open Text Mining Interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative: The National Centre for Text Mining (NaCTeM), a collaborative effort between the Universities of Manchester, Liverpool and Salford, funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils aim to provide tools, carry out research and offer advice to the academic community, with an initial focus on text mining in the biological and biomedical sciences. In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.

  • Anderson Analytics - provider of text analytics and content analysis especially as it relates to consumer behavior.
  • Attensity - suite of text mining solutions for a variety of industries.
  • Autonomy - suite of text mining, clustering and categorization solutions for a variety of industries.
  • Clarabridge - text mining and categorization applications for customer, healthcare, and investigative analytics.
  • Clearforest - text mining software to extract meaning from various forms of textual information.
  • Cortex Intelligence - provider of text and web content analytics.
  • IBM Intelligent Miner for Text - commercial text mining software
  • Inxight - provider of text analytics, search, and unstructured visualization technologies.
  • Nstein Technologies - provider of text analytics, and asset/web content management technologies (media, e-publishing, online publishing).
  • PolyAnalyst - commercial text mining software
  • SAS Enterprise Miner - commercial text mining software
  • SPSS - provider of TextSmart, SPSS Text Analysis for Surveys, and Clementine, commercial text analysis software
  • TEMIS - TEMIS is a software editor providing innovative Information Discovery solutions to serve the Information Intelligence needs of business corporations.
  • TextAnalyst - commercial text mining software
  • Textalyser - an online text analysis tool for generating text analysis statistics of web pages and other texts.
  • Topicalizer - an online text analysis tool for generating text analysis statistics of web pages and other texts.
  • The "Ultimate Research Assistant" - a knowledge management tool that uses a combination of traditional search engine technology and text mining techniques to facilitate online research of complex topics.

Until recently websites most often used text-based lexical searches. Text mining may allow searches to be directly answered by the semantic web. Text mining is also used in some email spam filters.

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.