Monolingual states

More precisely, the advantages of RSL can be summarized as follows: To the best of our knowledge, we are the first self-training framework with reciprocal-supervision, which can correct the bias of each model and fully utilize the monolingual data of source side language. By contrast, RSL boosts the performance of base models through reciprocal-supervision from other just comparable or even weaker learners. However, KD is preliminary designed to improve a weak student model with a much stronger teacher model. Knowledge distillation (KD) hinton2015distilling mirzadeh2019improved is another relevant research topic. While most of previous works concentrate on monolingual data of target side such as back-translation ( edunov2018understanding), we pay more attention to the source side. RSL is also related to the data augmentation approaches for NMT. Besides, it cannot make use of the large scale monolingual data from source side. However, ensemble is resource-demanding during inference, which prevents its wide usage. RSL is inspired by the success of ensemble method. And in the M-step the generated pseudo data produced by different models are combined to tune all student models. Then in the E-step all individual models are used to translate the monolingual data. More specifically, we first learn multiple different models on the parallel data. In RSL, we use multiple separately learned models to provide diverse proper pseudo data, allowing us to enjoy the independence between different models and dramatically reduce the error through strategic aggregation.

We propose to replace the self-supervision with Reciprocal- Supervision in NMT, leading to a novel co-EM (Expectation-Maximization) scheme nigam2000analyzing named RSL. Ensemble is built upon the assumption that different models have different inductive biases and better predictions can be made by majority voting. To overcome this issue, in this paper we borrow the reciprocal teaching concept rosenshine1994reciprocal from the educational field and revisit the core idea of classic ensemble approaches. 1 1 footnotetext: Work was done during the internship at Bytedance. Extensive experiments demonstrate the superior performance of RSL on several benchmarks with significant margins. RSL can also be viewed as a more efficient alternative to ensemble. Unlike the previous knowledge distillation methods built upon a much stronger teacher, RSL is capable of boosting the accuracy of one model by introducing other comparable or even weaker models. RSL leverages the fact that different parameterized models have different inductive biases, and better predictions can be made by jointly exploiting the agreement among each other. RSL first exploits individual models to generate pseudo parallel data, and then cooperatively trains each model on the combined synthetic corpus. In this paper, we revisit the utilization of multiple diverse models and present a simple yet effective approach named Reciprocal- Supervised Learning ( RSL).

This is mainly due to the compositionality of the target space, where the far-away prediction hypotheses lead to the notorious reinforced mistake problem. Despite the recent success on image classification, self-training has only achieved limited gains on structured prediction tasks such as neural machine translation (NMT).