LEXICAL AND SUB-LEXICAL FREQUENCY EFFECTS IN CANTONESE
Jane S.Y. Li, Heikal Badrulhisham, John Alderete
This report gives the first detailed account of word and sub-lexical frequency in three large corpora of Cantonese. Word frequencies across the corpora have a similar structure overall, but pairwise comparisons between corpora showed low lexical overlap and weak correlations in the frequencies of individual words. By contrast, sound structure frequencies, including segment, syllable, and tone, are well-correlated, but nonetheless exhibit important differences due to the type/token distinction, orthographic encoding, word position, and speech genre. These differences inform psycholinguistic studies of Cantonese that include frequency as an experimental condition. In addition, we document the methods used to segment words from running text, encode words orthographically and phonologically, and extract token and type frequencies from large data sets, thereby providing further access to the data. Finally, we validate the word frequency data by using it as a predictor of speech error and word recognition data in Cantonese. All of these generalizations are summarized in public data sets.