# Language Detector model for Apache OpenNLP # The model were trained with the Leipzig corpus, which can be found here: http://wortschatz.uni-leipzig.de/en/download/ The model can detect 103 languages in ISO 639-3 standard. The languages are: ``` afr Afrikaans ara Arabic ast Asturian aze Azerbaijani bak Bashkir bel Belarusian ben Bengali bos Bosnian bre Breton bul Bulgarian cat Catalan ceb Cebuano ces Czech che Chechen cmn Mandarin Chinese cym Welsh dan Danish deu German ekk Standard Estonian ell Greek, Modern eng English epo Esperanto est Estonian eus Basque fao Faroese fas Persian fin Finnish fra French fry Western Frisian gle Irish glg Galician gsw Swiss German guj Gujarati heb Hebrew hin Hindi hrv Croatian hun Hungarian hye Armenian ind Indonesian isl Icelandic ita Italian jav Javanese jpn Japanese kan Kannada kat Georgian kaz Kazakh kir Kirghiz kor Korean lat Latin lav Latvian lim Limburgan lit Lithuanian ltz Luxembourgish lvs Standard Latvian mal Malayalam mar Marathi min Minangkabau mkd Macedonian mlt Maltese mon Mongolian mri Maori msa Malay nan Min Nan Chinese nds Low German nep Nepali nld Dutch nno Norwegian Nynorsk nob Norwegian Bokmål oci Occitan pan Panjabi pes Iranian Persian plt Plateau Malagasy pnb Western Panjabi pol Polish por Portuguese pus Pushto ron Romanian rus Russian san Sanskrit sin Sinhala slk Slovak slv Slovenian som Somali spa Spanish sqi Albanian srp Serbian sun Sundanese swa Swahili swe Swedish tam Tamil tat Tatar tel Telugu tgk Tajik tgl Tagalog tha Thai tur Turkish ukr Ukrainian urd Urdu uzb Uzbek vie Vietnamese vol Volapük war Waray zul Zulu ``` There are more than 103 languages, it was decided to not to include all available languages from the Leipzig corpus into the model. If an important language is missing please contact us on the Apache OpenNLP dev mailing list (dev@opennlp.apache.org). ## Reproducing the work ### Preparing the data * Checkout Leipzig corpus ``` svn co https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus ``` ### Training and evaluation Execute ``` export OPENNLP_HOME= cd opennlp-corpus/leipzig sh create_langdetect_model.sh ``` The training result will be at `target` folder.