Effectiveness of Word Extraction and Information Retrieval on Cancer from Thai Website

Supaporn Weeraphanyanont, Phayung Meesad


This article proposes word extraction and cancer information retrieval from the Thai website. For word extraction, TH-OnSeg is proposed as a words segmentation based on LexTo algorithm with cancer dictionary and cancer oncology. TH-Onseg is used to extract cancer related words to be used as document indexing for cancer websites. The experiments were conducted by comparing the word extraction with LexTo words segment algorithm based on Thai electronic dictionary. The results show that the TH-OnSeg technique has higher efficiency; it can extract more words than LexTo for unknown words, known words, and ambiguous words.  In addition, we propose a semantic web-based technique combined with n-grams for cancer information retrieval. The experiments were conducted by comparing the proposed technique with information retrieval methods in database.  The results show that the use of semantic web techniques combined with N-gram for cancer information retrieval yields the highest number of cancer websites. The highest recall is not less than 0.9 in all experimental cases of both misspellings and misspellings.


cancer; word segmentation; information retrieval; ontology; TH-OnSeg

