|
|
Search the Web / Local Machine
|
|
|
Search comes in 2 flavors. You can
search the Web using a spider or you can search your local machine.
|
|
|
In the first case, Text Mine uses a set of parameters defined for
a crawl to control a spider .
An unconstrained spider will collect
everything accessible from a set of links and quickly overwhelm the
machine with a lot of junk. The parameters to constrain the spider
include specific sites to visit, specific sites NOT to visit, a limit
on the number of URLs and the time to fetch an URL, the number of
spider levels, and keywords to prioritize the crawl.
|
|
|
The spider can be interactively controlled and multiple spiders
run in parallel. The results collected by the spider are indexed
and saved in a directory. The spider collection can be searched
and the results are ranked. Other Text Mine tools such as
Extract and
Cluster can be applied to the collection.
|
|
|
In the second case, Text Mine searches your local machine using
an index built automatically and
with manual entries. You can index images, audio, video, and
formatted files in any directory. Images, audio, and video
will be automatically indexed using the full path name including
the directory and file name. This may or may not be meaningful.
For an accurate index, multimedia files should be manually indexed.
|
|
|
Formatted documents will be converted to text if a filter exists
for the document type. Currently, HTML files, POD files, and
plain text files are handled. Boolean search queries are handled
and search results can be restricted to a date range.
|
|
|
Modules
|
|
|
Index - Used to extract the content
words from a query
and match the text of a document with the query. The content
words are possibly indexed terms. The documents that refer
the query content words are collected and ranked by frequency
of occurrence of content words. Check
me_search.pl in the cgi-bin directory for usage.
|
|
|
Function Calls for content_words
|
|
|
The content_words function
is used to extract words from text that are non-function
words. For example, the following content words -
Iraq's, war, spokesman, Iran, kept up, attacks, made,
boasts, illusory, and victories would be extracted from
Iraq's war spokesman said Iran had kept up its attacks and
had made "boasts of illusory victories."
|
|
A Web Crawler for a Search Engine
|
|
Projects
|
|
|
1. Add more filters to extract the text from formatted documents
such as Word, PDF, and postscript files.
2. Enhance the automatic indexing using the full path name for
a file
|
|