Extract Text Mine

Home	About	Install	FAQ	Screens	Tools


Search	Extract	Q & A	E-Mail	Cluster	Dictionary

Entity Extraction

With Text Mine, you can extract people names, places, organizations, and entities from text. For example, the following entities will be extracted from

The growing popularity of Linux in Asia, Europe, and the U.S. is a major concern for Microsoft. It costs less than 1 USD a month to maintain a Linux PC in Asia. By 2007, over 500,000 PCs sold in Asia maybe Linux based.

Token	Type
Linux	Tech
Asia	place
Europe	place
U.S.	place
Microsoft	org
1	number
USD	currency
PC	tech
2007	number
500,000	number
PCs	miscellaneous

The type of the entities extracted depends on the contents of dictionary tables for people, places, and organizations. The dictionaries can be customized by adding entities for a particular technology or include new entities.

Modules

Entity - The text is passed to the entity_ex function from which entities are extracted. Each entity is assigned a type. This function also returns the sentences where entities occur (which is useful to identity key sentences for questions). See tx_entities.pl in the cgi-bin directory.

Function Calls for Entity_ex

The entity_ex function uses text_split to generate sentences from the passed text. If the text is not well formatted (text from a web page), then the text_split function will attempt to split the text into chunks using vectors. Adjacent chunks of text will be compared to check for a break. A well formatted article like a news articles will be split into sentences.

For each sentence, tokens will be extracted using assemble_tokens . In the first pass, the tokens will first be classified into potential entities with associated entity types. In subsequent passes, the types of each entity will be evaluated using context as well as rules for entities in a table. Finally, the most likely entity type will be set for every potential entity.

POS Tagger

The parts of speech tagger classifies the words in text into one of 8 parts of speech. For example, the sentence - Albert Einstein was one of the greatest scientists of all time. would be classified as follows -

Token	POS Type
Albert Einstein	noun
was	verb
one	adjective
of	preposition
the	determiner
greatest	adjective
scientists	noun
of all time	adverb
.	punctuation

Modules

Pos - The text is passed to the pos_tagger function from which parts of speech are extracted. Each token is assigned a type that is chosen from the list of possible types for the token. The most likely POS for the token is selected using context and frequency of occurrence. See tx_pos.pl in the cgi-bin directory.

Function Calls for Pos_tagger

The pos_tagger function calls lex_tagger which uses the dictionary to find the list of potential tags for any word. The tagger then scans the list of tokens and associated tags returned by the lex_tagger and uses rules and frequency of tag usage to set a single tag for every token.

A Decision Tree for a Part of Speech Tagger

Projects

1. Generate more training data to build an improved set of rules for extracting entities.

2. Adding and correcting entries in the entity dictionaries

3. Improve the Parts of Speech Tagger with more training data and verify rules.