Text Data Mining:
The discovery by computer of new, previously unknown information by automatically extracting information from a usually large amount of
different unstructured textual
resources. Data
mining on text has been designated at various times as statistical text
processing, knowledge discovery in text, intelligent text analysis, or natural
language processing, depending on the application and the methodology that is used.
Examples of text mining tasks include classifying documents into a
set
of specified topic areas (supervised learning), grouping documents such that each
member of each group has similar meaning (clustering or unsupervised learning),
and finding documents that satisfy some search criteria (information retrieval)
Data
mining is the analysis of (often large) observational data sets to find un- suspected
relationships and to summarize the data in novel ways that are both understandable
and useful to the data owner. As was previously stated, text data mining is
concerned with data mining methodologies applied to textual sources. A document
is a sequence of words and punctuation, following the grammatical rules of the
language. A document is any relevant segment of text and can be of any length.
Examples of documents include sentences, paragraphs, sections, chapters, books,
web pages, emails, etc. A term is usually a word, but it can also be a
word-pair or phrase. In this article, I will use term and
word
interchangeably. A corpus is a collection of documents, and we will see that
many of the feature extraction methods are corpus.
Text
mining, sometimes alternately referred to as text data mining, roughly
equivalent to text analytics, refers to the process of
deriving high-quality information from text.
High-quality information is typically derived through the devising of patterns
and trends through means such as statistical pattern learning. Text mining
usually involves the process of structuring the input text (usually parsing,
along with the addition of some derived linguistic features and the removal of
others, and subsequent insertion into a database),
deriving patterns within the structured
data, and finally evaluation and interpretation of the output. 'High
quality' in text mining usually refers to some combination of relevance, novelty,
and interestingness. Typical text mining tasks include text categorization, text
clustering, concept/entity extraction, production of
granular taxonomies, sentiment analysis, document summarization, and entity
relation modeling (i.e.,
learning relations between named entities).
Text analysis involves information retrieval, lexical
analysis to study word frequency distributions, pattern recognition, tagging/annotation,
information extraction, data mining
techniques including link and association analysis, visualization, and predictive analytics. The overarching goal
is, essentially, to turn text into data for analysis via application of natural language processing (NLP) and
analytical methods. A typical application is to scan a set of documents written
in a natural language and either model the document
set for predictive classification purposes or
populate a database or search index with the information extracted.
Text mining and text analytics:
The
term text analytics describes a
set of linguistic,
statistical, and machine
learning techniques that model and structure
the information content of textual sources for business intelligence, exploratory data analysis, research,
or investigation. The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of text mining in 2004 to
describe text analytics. The latter term is now used more frequently in business
settings while "text mining" is used in some of the earliest
application areas, dating to the 1980s, notably life-sciences research and
government intelligence.
The
term text analytics also describes that application of text analytics to
respond to business problems, whether independently or in conjunction with
query and analysis of fielded, numerical data. It is a truism that 80 percent
of business-relevant information originates in unstructured form, primarily
text. These techniques and processes discover and present knowledge – facts, business
rules, and relationships – that is
otherwise locked in textual form, impenetrable to automated processing.
No comments:
Post a Comment