Towards standardized descriptions of linguistic features: ISOcat and procedures 
for using common data categories

Windhouwer, Menzo

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Towards standardized descriptions of linguistic features: ISOcat and procedures for using common data categories

MPS-Authors

/persons/resource/persons1227

Windhouwer, Menzo
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

Windhouwer_Konvens_2012.pdf
(Publisher version), 88KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Windhouwer, M. (2012). Towards standardized descriptions of linguistic features: ISOcat and procedures for using common data categories. In J. Jancsary (Ed.), Proceedings of the Conference on Natural Language Processing 2012, (SFLR 2012 workshop), September 19-21, 2012, Vienna (pp. 494). Vienna: Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI).

Cite as: https://hdl.handle.net/11858/00-001M-0000-0010-0C17-F

Abstract

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. State-of-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classification of two written varieties of Portuguese: European and Brazilian. Results reached 0.998 for accuracy using character 4-grams.