Sindhi Text Corpus using XML and Custom Tags

  • Zeeshan Bhatti
  • Majid Shah University of Sindh, Jamshoro


Sindhi language being one of the oldest languages of the world, has still very limited use in digital age due to lack of digital contents. The use of corpus for each language has been extremely important in facilitating the natural language processing of its script. This research work address the issue of building corpus for Sindhi Language using UML based Tagging. The tree based XML tag structure is designed to develop Sindhi Corpa, that has two main nodes namely metadata and sindhi Document which contains the main text.


Download data is not yet available.
How to Cite
BHATTI, Zeeshan; SHAH, Majid. Sindhi Text Corpus using XML and Custom Tags. Sukkur IBA Journal of Computing and Mathematical Sciences, [S.l.], v. 2, n. 2, p. 30-37, dec. 2018. ISSN 2522-3003. Available at: <>. Date accessed: 13 june 2021. doi: