About Text Mine
  Home   About   Install   FAQ   Screens   Tools
 
  Search   Extract   Q & A   E-Mail   Cluster   Dictionary

Search the Web / Local Machine

Search comes in 2 flavors. You can search the Web using a spider or you can search your local machine.

In the first case, Text Mine uses a set of parameters defined for a crawl to control a spider . An unconstrained spider will collect everything accessible from a set of links and quickly overwhelm the machine with a lot of junk. The parameters to constrain the spider include specific sites to visit, specific sites NOT to visit, a limit on the number of URLs and the time to fetch an URL, the number of spider levels, and keywords to prioritize the crawl.

The spider can be interactively controlled and multiple spiders run in parallel. The results collected by the spider are indexed and saved in a directory. The spider collection can be searched and the results are ranked. Other Text Mine tools such as Extract and Cluster can be applied to the collection.

In the second case, Text Mine searches your local machine using an index built automatically and with manual entries. You can index images, audio, video, and formatted files in any directory. Images, audio, and video will be automatically indexed using the full path name including the directory and file name. This may or may not be meaningful. For an accurate index, multimedia files should be manually indexed.

Formatted documents will be converted to text if a filter exists for the document type. Currently, HTML files, POD files, and plain text files are handled. Boolean search queries are handled and search results can be restricted to a date range.

Modules

Index - Used to extract the content words from a query and match the text of a document with the query. The content words are possibly indexed terms. The documents that refer the query content words are collected and ranked by frequency of occurrence of content words. Check me_search.pl in the cgi-bin directory for usage.

Function Calls for content_words

The content_words function is used to extract words from text that are non-function words. For example, the following content words - Iraq's, war, spokesman, Iran, kept up, attacks, made, boasts, illusory, and victories would be extracted from Iraq's war spokesman said Iran had kept up its attacks and had made "boasts of illusory victories."

A Web Crawler for a Search Engine
Projects

1. Add more filters to extract the text from formatted documents such as Word, PDF, and postscript files.

2. Enhance the automatic indexing using the full path name for a file