Use of bayesian model for word alignment in Swahili-English statistical machine translation

Weku, Vincent O

View/Open

Full Text (1.779Mb)

Date

2014-08

Author

Weku, Vincent O

Type

Thesis; en_US

Language

Metadata

Show full item record

Abstract

State of the art word alignment models such as IBM Models (Pietra et al.,1993), hidden Markov model(HMM)(Vogel et al.,1996), and the jointly-trained symmetric HMM, contain a large number of parameters such; word translation, transition and fertility probabilities, that need to be estimated in addition to desired alignment variables. The common method of inference in such models is expectation –maximization (EM) (Dumpster et al., 1977) or an approximation to EM when the exact EM is intractable. The EM algorithm finds the value of parameters that maximizes the likelihood of the observed variables. However, with many parameters to be estimated without prior, EM tends to explain the training data by over fitting the parameters. A well documented example of over fitting in EM-estimated word alignments is the case of rare words, where some of these words act as ‗garbage collectors‘ aligning to excessively many words on the other side of the sentence pair (Pietra et al.,1993). Moreover EM is generally prone to getting stuck in a local maximization of the likelihood. Finally EM is based on the assumption that there is one fixed value of parameters that explains the data, that is, EM gives a point estimate. The over fitting problem mentioned among others has been alleviated by the use of word alignment model that uses Bayesian theorem that uses Gibbs sampling for inference as published by (Mermer et al., 2013). This approach has been successively applied to English, Arabic, Chinese and a host of other languages. It has however not been investigated for a Bantu language. This research aimed at exploring the efficacy of a Bayesian based word alignment model for Kiswahili-English statistical machine translation problem. To achieve this, a Kiswahili-English corpus extracted from the Kiswahil-English corpus based on Tanzania constitution (Wagacha, 2014, (unpublished source)) with approximately 23 thousand pairs of sentences was used to train a Bayesian alignment model. The research shows that Bayesian model outperforms EM in the majority of test cases in Kiswahili-English corpus used. Further analysis reveals that the proposed method addresses the rare word problem. It also achieves higher vocabulary coverage rates. For example when using Bayesian, English has 3111 and Kiswahili has 2886 vocabularies compared to EM with 2544 and 2695 vocabularies. This research shows that Bayesian based alignment model can be used to improve alignment in Kiswahili-English statistical Machine Translation.

URI

http://hdl.handle.net/11295/73969

Citation

Degree Of Master Of Science In Computer Science, University Of Nairobi, 2014

Publisher

University of Nairobi

Collections

Faculty of Science & Technology (FST) [4213]