Knowledge distillation (KD) is a game changer for me and opens up a lot of research directions. Though many existing studies leverage KD for model compression purposes, my research interest in KD is not only in model compression, but also in supervised compression, and human-annotation-free model training, and more!
Knowledge distillation approaches have been getting complex e.g., use of intermediate feature representations and auximiliary modules (trainable modules that are used in training session only), which makes the implementations more complex. To lower barrier to research on KD, I developed an ML OSS, torchdistill, a modular, configuration-driven framework for knowledge distillation. torchdistill is an installable Python package, and you can install it by "pip3 install torchdistill".
We empirically found that leveraging teacher models is key to further improve the tradeoffs between model accuracy and data size, learning compressed representations for supervised tasks. More details are abailable at the project page of Supervised Compression for Split Computing.
Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages
ACL 2023 (Findings)
This work was done while Shivanshu Gupta was an applied science intern at Amazon Alexa AI.Paper Amazon Science Preprint Xtr-WikiQA TyDi-AS2
Ensemble Transformer for Eﬀicient and Accurate Ranking Tasks: an Application to Question Answering Systems
EMNLP 2022 (Findings)
This work was done while I was an applied science intern at Amazon Alexa AI.Paper Amazon Science Preprint Code