What is COSH?

COSH stands for Corpus Of Spoken Hindi. This is a large-scale web corpus with data collected from Internet pages written in Devanagari script (UTF8). It consists of nearly two hundred million words.

In order to analyze the web corpus, a dedicated concordancer – COSH Conc – was developed by Osaka University and the Lago Language Institute. This tool enables us, mostly language researchers, to extract the occurrence distribution of any Hindi word or phrase, including word clusters, collocation analysis, grammatical behavior, and word counts.

COSH and COSH Conc were developed for non-native speakers of Hindi to perform linguistic research on this language. In particular, COSH Conc is customized to handle texts in Unicode Devanagari fonts.

Use the following reference to cite the COSH:

Miki Nishioka (Osaka University) and Lago Language Institute (2016-2017). Corpus Of Spoken Hindi (COSH) and COSH Conc [Software]. Available from https://www.cosh.site