Class Material

【機器學習 2022】語音與影像上的神奇自督導式學習 -Self-supervised Learning- 模型

png

Self-supervised Learning 主要用在 NLP 中。

png

BERT 也可用在 Speech 中，根据声音识别文字。

png

SUPERB (superbbenchmark.org)

png

[2105.01051] SUPERB: Speech processing Universal PERformance Benchmark (arxiv.org)
[2203.06849] SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities (arxiv.org)

png

将 Self-supervised Learning 用在图像上：

Image Recognition
Object Detection
Semantic Segmentation
Visual Navigation

png

[2110.09327] Self-Supervised Representation Learning: Introduction, Advances and Challenges (arxiv.org)

有时 Self-supervised Learning 的效果比 Supervised Learning 还好。

Generative Approaches

png

BERT 使用 Masking 处理 NLP，同样的用此法处理 Speech 问题。

[1910.12638] Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders (arxiv.org)

png

对于 BERT：

[1910.12638] Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders (arxiv.org) Mask 语音的一个片段用于训练（片段不能太短，否则太好猜了）
[2007.06028] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech (arxiv.org) Mask 某些维度，而不是语音片段

png

对于 CPT：

预测接下来的语音序列（不能太短，不然太好猜了）

[1910.12607] Generative Pre-Training for Speech with Autoregressive Predictive Coding (arxiv.org)

png

Predictive Approach

png

不使用生成的方法训练模型。

png

给出一张旋转的图片，让计算机判断旋转的角度。

[1803.07728] Unsupervised Representation Learning by Predicting Image Rotations (arxiv.org)

png

给出两张图片片段，让计算机判断两张图片的位置关系。

[1505.05192] Unsupervised Visual Representation Learning by Context Prediction (arxiv.org)

判断两部分 Speech 片段是时间关系。

Pre-Training Audio Representations With Self-Supervision | IEEE Journals & Magazine | IEEE Xplore

png

让 Speech 直接判断 Clustering 后的分类结果。

Speech

Image

[1807.05520] Deep Clustering for Unsupervised Learning of Visual Features (arxiv.org)

Contrastive Learning

png

给几张图片，经过一些图像处理后让计算机还能认出原始图片（希望同一只猫输出的向量越接近越好，猫和狗输出的向量越远越好）。

[2002.05709] A Simple Framework for Contrastive Learning of Visual Representations (arxiv.org)

Speech 版本：

[2010.13991] Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning (arxiv.org)

png

Moco：比 SimCLR 多了 memory bank 和 momentum encoder

[1911.05722] Momentum Contrast for Unsupervised Visual Representation Learning (arxiv.org)

Moco v2：借鉴了 SimCLR

[2003.04297] Improved Baselines with Momentum Contrastive Learning (arxiv.org)

png

Encoder 出来的东西一部分给 Predicter，要求 Predicter 出来的东西与剩下 positive 类的尽可能相近，negative 类尽可能相远。CPC 中 Predicter 用的是 GRU，Wav2vec 用的是 CNN。

png

VQ-wav2vec 输出的不是连续的，而是离散的（利用 BERT、去除噪声）。

[1910.05453] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (arxiv.org)

png

[1911.03912] Effectiveness of self-supervised pre-training for speech recognition (arxiv.org)

png

[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (arxiv.org)

png

Classification 分类 v.s. Contrastive 对比

对比思想动机：人类不仅能从积极的信号中学习，还能从纠正不良行为中获益。对比学习其实是无监督学习的一种范式。

png

选择 Negative Examples 并不容易，不能太复杂，也不太简单。

Bootstrapping Approaches

png

如果单纯的不给负类，机器就会倾向输出完全一样的向量，产生 Collapse 错误。（左）

给图像做一定变换后，经过 Encoder-Predictor 后的向量学习尽可能与直接 Encoder 的相等，并更新 Encoder。（右）

png

Typical Knowledge Distillation：小模型（student）不断训练变得输出跟大模型一样（Teacher）。（左）

把接了 Predictor 的 Encoder 当作学生，每一轮回后变成新的老师。（右）

png

Simply Extea Regularization

png

分成三个部分：

Invariance
Variance
Covariance
- 想办法让 Invariance 和 Variance 的协方差矩阵非对角线元素接近于 0

Concluding Remarks

png

方法	图像	语音
生成	GPT for image	Mockingjay, APC
预测	Rotation Prediction, etc.	HuBERT
对比	SimCLR,MoCo	CPC, Wav2vec series
Bootstrapping	BYOL, SimSiam	Data2vec
正则化	Barlow Twins, VICReg	DeLoRes