Abstract:
This research is directed towards automating frequently asked questions Web pages summarization, a task that captures the most salient pieces of information in answers of each question. In fact, it was found that there exists thousands of FAQ pages available on many subjects not only online but also offline as most of products nowadays tend to attach a leaflet tilted as FAQ along with its product describing their functionality and usage. Moreover, FAQ Web pages are devised into a manner of a question having a specific heading style, e.g. bold, underlines, or tagged. The answer then follows in a different lower style; usually smaller font and may be scattered to subheadings if the answer is logically divided. This uniformity in the creation of FAQ Web pages and its abundance online, made it pretty much amenable to automation.
To achieve this objective, an approach, which applies Web page segmentation to detect Q/A â Question and Answer-units along with the use of some selective statistical sentence extraction features for summary generation, is proposed. The proposed approach is English language dependent.
These features are namely, question-answer similarity, query overlap, sentence location in answer paragraphs and capitalized words frequency. The choice of these features was mainly influenced by the process of analyzing the different question types and anticipating the expected correct answer.
The first feature â Sentence Similarity- evaluates the semantic similarity between the question sentence and each sentence in the answer. It does so by comparing word senses of both the question and answer words and then assigns each pair of words a numerical value and then in turn an accumulated value to the whole sentence.
The second feature â Query Overlap-extracts the following word types; nouns, adverbs, adjectives and gerunds from the question sentence and automatically formulate a query with and count the number of matches with each of the answer sentences.
The third feature -Location-gives higher score to sentences at the beginning and lessens the score to the following sentences.
The fourth feature -Capitalized Words Frequency- computes the frequency of capitalized words in a sentence. We give each feature a weight and then linearly combine them in a single equation to give a cumulative score for each sentence. The different document features were combined by a home grown weighting score function. It was found that using each of the features solely performed well in some cases based on different kinds of evidence. Pilot experimentations and analysis helped us in obtaining a suitable combination of feature weights.
The thesis proposes three main contributions. First, using the proposed feature selection combination is a highly preferable solution to the problem of summarizing FAQ Web pages. Second, utilizing Web page segmentation, enables the differentiation between the different constructs resides in Web pages, which enables us to filter out those constructs that are not Q/A. Third, detect those answers which are logically divided into a set of paragraphs under smaller headings representing the logical structure. Thus generating a conclusive summary with all paragraphs included.
The automatically generated summaries are compared to summaries generated by a widely used commercial tool named Copernic Summarizer 2.1 . The comparison is performed via a formal evaluation process involving human evaluators. In order to verify the efficiency of our approach we conducted four experiments. The first experiment tested the summarization performance quality of our system against the Copernic system in terms of which system produces readable short quality summaries. The second experiment tested the summarization quality with respect to the questions' discipline. The third experiment to measures and compares the human evaluators' performance in evaluating both summarizers. The fourth experiment analyzes the evaluators' scores and their homogeneity in evaluating our system and the Copernic system as well. In general, it was found out that our system â FAQWEBSUMM- performs much better than the Copernic summarizer. The overall average scores for all pages indicate that it is superior to the other system by approximately 20% which is quite promising. The superiority comes from the idea of knowing the web page structure in advance that helps in reaching better results than applying a general solution to all types of pages.
Keywords: Web Document Summarization, FAQ, Question Answering and Web Page Segmentation.