| Peer-Reviewed

Software Development for Identifying Persian Text Similarity

Received: 21 October 2014     Accepted: 23 October 2014     Published: 29 October 2014
Views:       Downloads:
Abstract

The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.

Published in International Journal of Intelligent Information Systems (Volume 3, Issue 6-1)

This article belongs to the Special Issue Research and Practices in Information Systems and Technologies in Developing Countries

DOI 10.11648/j.ijiis.s.2014030601.21
Page(s) 61-66
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2014. Published by Science Publishing Group

Keywords

Text Similarity, TF-IDF, Semantic Similarity, Stemming

References
[1] WenyinL, Hao TY, ChenW, FengM “A web-based platform for user interactive question answering”. World Wide Web: Internet Web Inform Syst (2009) 12(2):107–124, 2009.
[2] Park EK, Ra DY, Jang MG, "Techniques for improving web retrieval effectiveness". Inform Process Manag 41:1207–1223, 2005.
[3] Atkinson-Abutridy J, Mellish C, Aitken S, "Combining information extraction with genetic algorithms for text mining", IEEE Intelligent Systems, pp: 22-30, 2004, Available on: http://homepages.abdn.ac.uk/c.mellish/pages/papers/atkinsonieee.pdf.
[4] K Metzler D, Dumais S, Meek C, "Similarity measures for short segments of text". In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science,vol 4425, Springer, Berlin , pp 16–27, 2007.
[5] Hassel, M., Resource Lean and Portable "Automatic Text Summarization", Stockholm, Sweden. p. 144, 2007.
[6] Turney, P. "Mining the web for synonyms: PMI-IR versus LSA on TOEFL". In Proceedings of the Twelfth European Conference on Machine Learning, 2001, Available on: http://www.extractor.com/turney-ecml2001.pdf.
[7] Landauer T. K., Foltz P., and Laham D, "Introduction to latent semantic analysis". Discourse Processes 25, 1998.
[8] K. Aas and L. Eikvil, “Text Categorisation: A Survey”, 1999, Available on: http://citeseer.nj.nec.com/aas99text.html.
[9] Wu Z., Palmer M., "Verb semantics and lexical selection". ACL' 94 Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp: 133-138, 1994. Available on: http://dl.acm.org/citation.cfm?id=981751.
[10] Voorhees E., "Using WordNet to disambiguate word senses for text retrieval", SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on research and development information retrieval, pp: 171-180, 1993, Available on: http://dl.acm.org/citation.cfm?id=160715.
[11] R. Krovetz, "Viewing morphology as an inference process", Proc. 16th ACM SIGIR Conference, Pittsburgh, June 27-July 1, pp. 191-202, 1993.
[12] Hessami Fard Reza, Ghasem sany Gholamreza, "Design of a stemming algorithm for Persian", 11th Annual Conference of Computer Society of Iran, Tehran, 2006. (Persian) Available on: http://www.civilica.com/Paper-ACCSI11-ACCSI11_066.html
[13] Qazvinian,Vahed.,SharifHassnabadi,Leila., Halavati, Ramin.,"Summarizing Text With a Genetic Algorithm-Based Sentence Extraction", Int. J. Knowledge Management Studies, Vol. 2, No. 4, pp:426-444, 2008, Available on: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.2201&rep=rep1&type=pdf.
[14] Rada Mihalcea, Courtney Corley, Carlo Strapparava, "Corpus-based and Knowledge-based measures of text semantic similarity", AAAI '06 Proceeding of the 21st national conference on Artificial intelligence, Vol. 1, pp: 775-780, 2006.
[15] Antonio Toral, Oscar Ferrandez, Eneko Agirre, Rafael Munoz, "A study on linking Wikipedia categories to Wordnet synsets using text similarity", International Conference RANLP 2009, Borovets, Bolgaria, pp: 449-454, 2009.
[16] Xiaojun Quan, Gang Liu, Zhi Lu, Xingliang Ni, Liu Wenyin, "Short text similarity based on probabilistic topics", Knowl Inf Syst, 25, pp:473-491, DOI:10.1007/s10115-009-0250-y, 2010.
Cite This Article
  • APA Style

    Elham Mahdipour, Rahele Shojaeian Razavi, Zahra Gheibi. (2014). Software Development for Identifying Persian Text Similarity. International Journal of Intelligent Information Systems, 3(6-1), 61-66. https://doi.org/10.11648/j.ijiis.s.2014030601.21

    Copy | Download

    ACS Style

    Elham Mahdipour; Rahele Shojaeian Razavi; Zahra Gheibi. Software Development for Identifying Persian Text Similarity. Int. J. Intell. Inf. Syst. 2014, 3(6-1), 61-66. doi: 10.11648/j.ijiis.s.2014030601.21

    Copy | Download

    AMA Style

    Elham Mahdipour, Rahele Shojaeian Razavi, Zahra Gheibi. Software Development for Identifying Persian Text Similarity. Int J Intell Inf Syst. 2014;3(6-1):61-66. doi: 10.11648/j.ijiis.s.2014030601.21

    Copy | Download

  • @article{10.11648/j.ijiis.s.2014030601.21,
      author = {Elham Mahdipour and Rahele Shojaeian Razavi and Zahra Gheibi},
      title = {Software Development for Identifying Persian Text Similarity},
      journal = {International Journal of Intelligent Information Systems},
      volume = {3},
      number = {6-1},
      pages = {61-66},
      doi = {10.11648/j.ijiis.s.2014030601.21},
      url = {https://doi.org/10.11648/j.ijiis.s.2014030601.21},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.s.2014030601.21},
      abstract = {The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.},
     year = {2014}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Software Development for Identifying Persian Text Similarity
    AU  - Elham Mahdipour
    AU  - Rahele Shojaeian Razavi
    AU  - Zahra Gheibi
    Y1  - 2014/10/29
    PY  - 2014
    N1  - https://doi.org/10.11648/j.ijiis.s.2014030601.21
    DO  - 10.11648/j.ijiis.s.2014030601.21
    T2  - International Journal of Intelligent Information Systems
    JF  - International Journal of Intelligent Information Systems
    JO  - International Journal of Intelligent Information Systems
    SP  - 61
    EP  - 66
    PB  - Science Publishing Group
    SN  - 2328-7683
    UR  - https://doi.org/10.11648/j.ijiis.s.2014030601.21
    AB  - The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.
    VL  - 3
    IS  - 6-1
    ER  - 

    Copy | Download

Author Information
  • Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran

  • Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran

  • Computer Engineering Department, Khavaran Institute of Higher Education, Mashhad, Iran

  • Sections