部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

论文作者

Diddee, Harshita, Dandapat, Sandipan, Choudhury, Monojit, Ganu, Tanuja, Bali, Kalika

论文摘要

通过大规模多语言模型利用共享学习，最先进的机器翻译模型通常能够适应低资源语言的数据匮乏。但是，这种性能是以显着肿的模型为代价的，这些模型实际上是不可部署的。知识蒸馏是一种开发竞争性，轻量级模型的流行技术：在这项工作中，我们首先评估其用于压缩MT模型的用途，这些模型专注于具有极为有限的培训数据的语言。通过我们跨8种语言的分析，我们发现蒸馏模型的性能差异是因为它们依赖于先验，包括用于蒸馏的合成数据的量，学生体系结构，培训超参数和教师模型的信心使蒸馏成为脆性的压缩机制。为了减轻这种情况，我们探讨了在压缩这些模型的训练后量化的使用。在这里，我们发现，虽然蒸馏可为一些低资源语言提供收益，但量化为整个语言范围（尤其是我们目标集合中最低的资源语言）提供了更一致的性能趋势。

Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this work, we first evaluate its use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题