Page Header

Effectiveness of Word Extraction and Information Retrieval on Cancer from Thai Website

Supaporn Weeraphanyanont, Phayung Meesad


This article proposes word extraction and cancer information retrieval from the Thai website. For word extraction, TH-OnSeg is proposed as a words segmentation based on LexTo algorithm with cancer dictionary and cancer oncology. TH-Onseg is used to extract cancer related words to be used as document indexing for cancer websites. The experiments were conducted by comparing the word extraction with LexTo words segment algorithm based on Thai electronic dictionary. The results show that the TH-OnSeg technique has higher efficiency; it can extract more words than LexTo for unknown words, known words, and ambiguous words.  In addition, we propose a semantic web-based technique combined with n-grams for cancer information retrieval. The experiments were conducted by comparing the proposed technique with information retrieval methods in database.  The results show that the use of semantic web techniques combined with N-gram for cancer information retrieval yields the highest number of cancer websites. The highest recall is not less than 0.9 in all experimental cases of both misspellings and misspellings.


cancer; word segmentation; information retrieval; ontology; TH-OnSeg

[1] National Statistics Office, "Death rates per 100,000 population by leading cause of death and sex, whole kingdom : 2007 - 2014," 2014. [Online]. Available: http:// [Accessed 10 January 2017]. (in Thai)

[2] Health On the Net Foundation, "HONcode certification," 1995. [Online]. Available: [Accessed 10 January 2017].

[3] S. Kertkid, A. Aun-Anan and P. Meesad, "Classification of Reliable Content on Cancer Thai Website using CancerDic+," Journal of information science and technology (JIST), vol. 5, no. 2, pp. 34-43, 2015. (in Thai)

[4] W. Teppabutre, "Credibility-Enhancing Communication Framework for Rajabhat Universities’ Website," SDU Research Journal, vol. 9, no. 2, pp. 187-198, 2013. (in Thai)

[5] S. Ummeepien and S. Thaiprayoon, "Web Plagiarism Monitoring System Using Informative Text Selection Method," Information Technology Journal, vol. 12, no. 2, pp. 1-9, 2016. (in Thai)

[6] N. Chirawichitchai, "The application of modeling to automatic classification of Thai document," JIT, vol. 2013, no. 1, pp. 141-149, 2013. (in Thai)

[7] J. R. Quinlan, "Induction of Decision Trees in Machine Learning," 1986. [Online]. Available: quinlan.pdf. [Accessed 10 January 2017].

[8] N. Chirawichitchai, P. Sanguansa and P. Meesad, "Effective Automatic Thai Document Categorization," NIDA Development Journal, vol. 51, no. 3, pp. 187-205, 2011. (in Thai)

[9] S. Tepdang, Improving Thai word segmentation with named entity recognition, Pathum Thani: Thammasat University, 2018. (in Thai)

[10] S. Bualerng and W. Songpan, "Question Classification for Answer Searching Using Semantic Web and Data Mining," in The 10th National Conference on Computing and Information Technology, Phuket, 2014. (in Thai)

[11] W. Aroonmanakun, "Collocation and Thai Word Segmentation," in SNLP-oriental COCOSDA 2002, Thailand, 2002. (in Thai)

[12] P. Urathamakun and K. Runapongsa, "Improved Rule-Based and New Dictionary for Thai Word Segmentation," JCSSE, vol. 2006, pp. 4-40, 2006. (in Thai)

[13] C. Mahatthanachai, PTTSF word parsing techniques, Chiang Mai: Chiang Mai Rajabhat University, 2012. (in Thai)

[14] C. Haruechaiyasak, C. Sangkeetrakarn, P. Palingoon, S. Kongyoung and C. Damrongrat, "A Collaborative Framework for Collecting Thai Unknown words from the web," in Proceeding COLING-ACL 2006, Sydney, 2006.

[15] S. Kertkid and P. Meesad, "Improvement of Search Performance for Cancer Contents Using Semantic Search," in The 8th National Conference on Information Technology, Krabi, 2015. (in Thai)

[16] C. Chongchorhor, "Using Ontology Tools for Information Services," 2012. [Online]. Available: Jul11/%a1%d2%c3%e3%aa%bb%c3%d0%e2%c2%aa%b9%a8%d2%a1.pdf. [Accessed 10 February 2017]. (in Thai)

[17] L. Liu, C. Wang, L. Bai and H. Chen, "Study of Ontology Technology in Field Word Segmentation System of Digital Library," in the 14th International Conference on Computer Supported Cooperative Work in Design 2010, Shanghai, 2010.

[18] D. W. Wang, "A new field word segmentation model based on ontology in digital library," IJACT, vol. 4, no. 17, pp. 418-425, 2012.

[19] S. Niwattanakul, " Access to Agricultural Knowledge by Semantic Web Technologies," Suranaree University of Technology, Nakhon ratchasima, 2013. (in Thai)

[20] W. Chotirat, P. Boonrawd and S. Na Wichian, "Developing an Ontology Knowledge Based for Automatic Online News Analysis," Information Technology Journal, vol. 7, no. 14, pp. 13-18, 2011. (in Thai)

[21] P. Nilaphruek and R. Khanankhoaw, "The Enhancement of Efficiency in e-Recruitment System using Semantic Matching Technique," Science and Technology RMUTT Journal, vol. 5, no. 1, p. 83–99, 2015. (in Thai)

[22] P. Butte and W. Puarungroj, "Development of Semantic Web for Searching Cultural Information In Loei Province," Information Technology Journal, vol. 12, no. 2, pp. 33-41, 2016. (in Thai)

[23] S. Pumikong, The design and development of an algorithm for safety-related news extraction, Nakhon ratchasima: Suranaree University of Technology, 2012. (in Thai)

[24] A. Ekwonganan, Identification of Thai and transliterated words by N-Gram Models, Bangkok: Chulalongkorn University, 2005. (in Thai)

[25] S. PhiaKoksong and N. Chamnongsri, "A Knowledge Navigation System for Accessing Contents in Printed Materials," Journal of Information Science, vol. 28, no. 3, pp. 9-20, 2010. (in Thai)

[26] G. Liu and Z. Chen, "Chinese Error Correction of Searching Engine under N-Gram Statistic Model," in The 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM), Chengdu, 2010.

[27] S. Ismail and M. S. Rahman, "Bangla word clustering based on N-gram language model," in 2014 International Conference on Electrical Engineering and Information & Communication Technology, Dhaka, 2014.

[28] K. Nur Hossain, K. Md. Farukuzzaman, I. Md. Mojahidul, R. Md. Habibur and S. Bappa, "Verification of Bangla Sentence Structure using N-Gram," Global Journal Inc, vol. 14, no. 1, pp. 1-5, 2014.

[29] The National Electronics and Computer Technology Center, "Thai Lexeme Tokenizer," 2016. [Online]. Available: http:// [Accessed 10 January 2017]. (in Thai)

[30] National Cancer Institute, 1971. [Online]. Available: [Accessed 10 January 2017]. (in Thai)

[31] The National Electronics and Computer Technology Center, "Hozo Ontology Editor," 2010. [Online]. Available: http://text.hlt. [Accessed 10 January 2017]. (in Thai)

[32] S. Suguna, V. Sundaravadivelu and B. Gomathi, "A Novel Semantic Approach in E-learning Information Retrieval System," in The 2nd IEEE ICETECH, Coimbatore, 2016.

[33] The National Electronics and Computer Technology Center, "Thai-English Electronic Dictionary," 2016. [Online]. Available: https:// [Accessed 10 January 2017]. (in Thai)

[34] S. Issaragrisil, "Coping “Multiple myeloma-MM”, 2012. [Online]. Available: http:// [Accessed 10 February 2017]. (in Thai)

Full Text: PDF


  • There are currently no refbacks.

Comments on this article

View all comments