Extract Text Mine

Home	About	Install	FAQ	Screens	Tools


Search	Extract	Q & A	E-Mail	Cluster	Dictionary


		Question & Answer

		Text Mine handles questions and answers for two types of problems. Given a FAQ with several hundred questions and a question submitted in natural language, Text Mine will find the top 10 closest questions. For example, in the Perl FAQ of about 300 questions, the following questions are the closest to the submitted question.

		Submitted question: How do I search strings ?
		Top 10 Closest Questions: 1. How do I strip blank space from the beginning/end of a string? 2. Why doesn't & work the way I want it to? 3. What's wrong with always quoting "$vars"? 4. How can I count the number of occurrences of a substring within a string? 5. How do I expand function calls in a string? 6. How do I reverse a string? 7. How do I use a regular expression to strip C style comments from a file? 8. How do I pad a string with blanks or pad a number with zeroes? 9. How can I expand variables in text strings? 10. How do I expand tabs in a string?

		In the second type of problem, Text Mine submits a question to a search engine on the Web and fetches the pages associated with the top 10 hits. The text of each page is analyzed and broken into text chunks. The text chunk that is most likely to answer the question is presented.

		Text Mine uses a text splitter to break the text of web pages into chunks. It then categorizes the question as a person, place, org, or other type of question. A sentence with the entity which most likely answers the question is given higher weight. Other measures such as degree of overlap with the question are included to rank text chunks.

		Modules

		Quanda - This module contains functions to categorize questions, extract question features (such as the interrogation nouns, the w-word for the question, and the type of question), and other function. Check the quanda sub-directory of the tools directory for examples. The CGI script, qu_browse.pl searches the Perl FAQ of about 300 questions for the nearest FAQ question for a submitted question.

		Function Calls for get_qcat

		The get_qcat function returns a question category for the passed question. The get_iwords and get_inouns functions return the interrogatory words and interrogatory nouns in the neighborhood of the interrogatory words for a question. A list of bigrams from the question are generated using the tokens returned from assemble_tokens . A vector for the query is generated using gen_vector . The tokens of the vector are expanded using get_rel_words . Similarly, the query nouns for the question are expanded using expand_query . Next, the question features are compared with features for question categories and a neural network is used to select a category.


Processes in a Pipeline to Answer Questions

		Projects

		1. Test with a different set of questions from another FAQ 2. Verify the set of questions returned for the Perl FAQ. Can question classification be improved ?