About Text Mine
  Home   About   Install   FAQ   Screens   Tools
 
  Search   Extract   Q & A   E-Mail   Cluster   Dictionary

Entity Extraction

With Text Mine, you can extract people names, places, organizations, and entities from text. For example, the following entities will be extracted from

The growing popularity of Linux in Asia, Europe, and the U.S. is a major concern for Microsoft. It costs less than 1 USD a month to maintain a Linux PC in Asia. By 2007, over 500,000 PCs sold in Asia maybe Linux based.

Token Type
Linux Tech
Asia place
Europe place
U.S. place
Microsoft org
1 number
USD currency
PC tech
2007 number
500,000 number
PCs miscellaneous

The type of the entities extracted depends on the contents of dictionary tables for people, places, and organizations. The dictionaries can be customized by adding entities for a particular technology or include new entities.

Modules

Entity - The text is passed to the entity_ex function from which entities are extracted. Each entity is assigned a type. This function also returns the sentences where entities occur (which is useful to identity key sentences for questions). See tx_entities.pl in the cgi-bin directory.

Function Calls for Entity_ex

The entity_ex function uses text_split to generate sentences from the passed text. If the text is not well formatted (text from a web page), then the text_split function will attempt to split the text into chunks using vectors. Adjacent chunks of text will be compared to check for a break. A well formatted article like a news articles will be split into sentences.

For each sentence, tokens will be extracted using assemble_tokens . In the first pass, the tokens will first be classified into potential entities with associated entity types. In subsequent passes, the types of each entity will be evaluated using context as well as rules for entities in a table. Finally, the most likely entity type will be set for every potential entity.

POS Tagger

The parts of speech tagger classifies the words in text into one of 8 parts of speech. For example, the sentence - Albert Einstein was one of the greatest scientists of all time. would be classified as follows -

Token POS Type
Albert Einstein noun
was verb
one adjective
of preposition
the determiner
greatest adjective
scientists noun
of all time adverb
. punctuation

Modules

Pos - The text is passed to the pos_tagger function from which parts of speech are extracted. Each token is assigned a type that is chosen from the list of possible types for the token. The most likely POS for the token is selected using context and frequency of occurrence. See tx_pos.pl in the cgi-bin directory.

Function Calls for Pos_tagger

The pos_tagger function calls lex_tagger which uses the dictionary to find the list of potential tags for any word. The tagger then scans the list of tokens and associated tags returned by the lex_tagger and uses rules and frequency of tag usage to set a single tag for every token.
A Decision Tree for a Part of Speech Tagger

Projects

1. Generate more training data to build an improved set of rules for extracting entities.

2. Adding and correcting entries in the entity dictionaries

3. Improve the Parts of Speech Tagger with more training data and verify rules.