Department of Intelligence Science and Technology,
Graduate School of Informatics,
Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
Email: murawaki (at) i (dot) kyoto-u (dot) ac (dot) jp
The Japanese language (1) does not delimit words by white space (like Chinese and Thai), (2) is written with several different character types such as kanji, hiragana and katakana, and (3) is agglutinative (rich in morphology). These features pose challenging problems in natural language processing. For example, we cannot use the split-on-space method to extract morphemes (words) from text. Simple string matching sometimes fails to find unknown morphemes because they are covered by shorter known morphemes. There is no orthographic distinction (i.e. capitalization) between common and proper nouns, and there seems no morphosyntactic (grammatical) distinction between them.
I have been working on automatic lexicon acquisition from text, as without it, we cannot correctly segment text into morphemes. I fully exploited the orthographical and linguistic features of Japanese: I used the mixed orthography to find unknown morphemes, and the agglutinative nature to identify their morphological categories. Currently I am working on classifying automatically acquired nouns into common and proper nouns with lexicosyntactic clues.
I am also interested in applying our findings in Japanese to typologically similar languages such as Mongolian, Uyghur and Manchu.
Last Updated: March 2017.