Sindhi Text Corpus using XML and Custom Tags
Keywords:Corpus; Sindhi; Sindhi Corpus; Natural Language Processing; XML
Sindhi language being one of the oldest languages of the world, has still very limited use in digital age due to lack of digital contents. The use of corpus for each language has been extremely important in facilitating the natural language processing of its script. This research work address the issue of building corpus for Sindhi Language using UML based Tagging. The tree based XML tag structure is designed to develop Sindhi Corpa, that has two main nodes namely metadata and sindhi Document which contains the main text.
The SJCMS holds the rights of all the published papers. Authors are required to transfer copyrights to journal to make sure that the paper is solely published in SJCMS, however, authors and readers can freely read, download, copy, distribute, print, search, or link to the full texts of its articles and to use them for any other lawful purpose.
The SJCMS is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.