标题:对抗性重量扰动可以注入神经后门吗?

摘要】对抗机器学习已经暴露了神经模型的一些安全隐患,并且已经成为最近的重要研究课题。迄今为止,“对抗性扰动”的概念已专门用于输入空间,指的是细微的,不可察觉的变化,这可能导致ML模型出错。在这项工作中,我们将“对抗性扰动”的概念扩展到模型权重的空间,特别是在经过训练的DNN中注入后门,这暴露了使用公开可用的经过训练的模型的安全风险。在这里,注入后门是指在将触发模式添加到输入时,从模型中获得期望的结果,同时将原始模型的预测保留在未触发的输入上。从对手的角度来看,我们将这些对手扰动的特征描述为围绕原始模型权重的$ \ ell _ {\ infty} $范数内。我们在模型权重中引入对抗性摄动,使用原始模型的预测上的复合损失和通过投影梯度下降所需的触发条件。我们凭经验表明,这些对抗性重量扰动在几种计算机视觉和自然语言处理任务中普遍存在。我们的结果表明,对于几种应用,模型重量值的平均相对变化很小,可以成功地注入后门。

Title: Can Adversarial Weight Perturbations Inject Neural Backdoors?

[abstract]  Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an “adversarial perturbation” has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of “adversarial perturbations” to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an $\ell_{\infty}$ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications.

【作者】Siddhant Garg, Adarsh Kumar, Vibhor Goel, Yingyu Liang

点击查看原文

头像

By szf