Part1.RNN进阶
- GRU
- LSTM
- 深度循环神经网络
- 双向循环神经网络
GRU
RNN结构中容易出现梯度衰减或者爆炸,我们通过添加门控结构可以一定程度缓解这个问题,也可以捕捉时间序列中时间步距离较长的依赖关系。
$$R_{t} = σ(X_tW_{xr} + H_{t−1}W_{hr} + b_r)\\
Z_{t} = σ(X_tW_{xz} + H_{t−1}W_{hz} + b_z)\\
\widetilde{H}_t = tanh(X_tW_{xh} + (R_t ⊙H_{t−1})W_{hh} + b_h)\\
H_t = Z_t⊙H_{t−1} + (1−Z_t)⊙\widetilde{H}_t$$如上所示,GRU有一个重置门和一个更新门,两者都通过当前时间步的输入和上一时间步的隐藏状态计算,重置门可以决定上一个隐藏状态在计算当前候选隐藏状态的权重,有助于捕捉时间序列中的短期依赖关系。更新门是用来权衡当前候选隐藏状态与上一隐藏状态从而得到当前时间步的隐藏状态,有助于捕捉时间序列里的长期依赖关系。
从零实现GRU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
def get_params(): def _one(shape): ts = torch.tensor(np.random.normal(0, 0.01, size=shape), device=device, dtype=torch.float32) #正态分布 return torch.nn.Parameter(ts, requires_grad=True) def _three(): return (_one((num_inputs, num_hiddens)), _one((num_hiddens, num_hiddens)), torch.nn.Parameter(torch.zeros(num_hiddens, device=device, dtype=torch.float32), requires_grad=True)) W_xz, W_hz, b_z = _three() # 更新门参数 W_xr, W_hr, b_r = _three() # 重置门参数 W_xh, W_hh, b_h = _three() # 候选隐藏状态参数 # 输出层参数 W_hq = _one((num_hiddens, num_outputs)) b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device, dtype=torch.float32), requires_grad=True) return nn.ParameterList([W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]) def init_gru_state(batch_size, num_hiddens, device): #隐藏状态初始化 return (torch.zeros((batch_size, num_hiddens), device=device), ) def gru(inputs, state, params): W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params H, = state outputs = [] for X in inputs: Z = torch.sigmoid(torch.matmul(X, W_xz) + torch.matmul(H, W_hz) + b_z) R = torch.sigmoid(torch.matmul(X, W_xr) + torch.matmul(H, W_hr) + b_r) H_tilda = torch.tanh(torch.matmul(X, W_xh) + R * torch.matmul(H, W_hh) + b_h) H = Z * H + (1 - Z) * H_tilda Y = torch.matmul(H, W_hq) + b_q outputs.append(Y) return outputs, (H,) |
pytorch简洁调用
1 2 |
gru_layer = nn.GRU(input_size=vocab_size, hidden_size=num_hiddens) |
LSTM
LSTM成为长短期记忆单元,通过遗忘门、输入门、输出门来控制时序信息在单元内的流动,其中:
- 遗忘门可以控制上一时间步的记忆细胞参与计算当前步记忆细胞的权重;
- 输入门可以控制当前时间步计算的候选记忆细胞参与计算当前记忆细胞的权重;
- 输出门可以控制从记忆细胞到输出隐藏状态的权重;
- 记忆细胞是一种特殊的隐藏状态的信息流动。
$$I_t = σ(X_tW_{xi} + H_{t−1}W_{hi} + b_i) \\
F_t = σ(X_tW_{xf} + H_{t−1}W_{hf} + b_f)\\
O_t = σ(X_tW_{xo} + H_{t−1}W_{ho} + b_o)\
\widetilde{C}_t = tanh(X_tW_{xc} + H_{t−1}W_{hc} + b_c)\\
C_t = F_t ⊙C_{t−1} + I_t ⊙\widetilde{C}_t\\
H_t = O_t⊙tanh(C_t)$$从零实现LSTM
12345678910111213141516171819202122232425262728293031323334def get_params():def _one(shape):ts = torch.tensor(np.random.normal(0, 0.01, size=shape), device=device, dtype=torch.float32)return torch.nn.Parameter(ts, requires_grad=True)def _three():return (_one((num_inputs, num_hiddens)),_one((num_hiddens, num_hiddens)),torch.nn.Parameter(torch.zeros(num_hiddens, device=device, dtype=torch.float32), requires_grad=True))W_xi, W_hi, b_i = _three() # 输入门参数W_xf, W_hf, b_f = _three() # 遗忘门参数W_xo, W_ho, b_o = _three() # 输出门参数W_xc, W_hc, b_c = _three() # 候选记忆细胞参数# 输出层参数W_hq = _one((num_hiddens, num_outputs))b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device, dtype=torch.float32), requires_grad=True)return nn.ParameterList([W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq, b_q])def init_lstm_state(batch_size, num_hiddens, device):return (torch.zeros((batch_size, num_hiddens), device=device),torch.zeros((batch_size, num_hiddens), device=device))def lstm(inputs, state, params):[W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq, b_q] = params(H, C) = stateoutputs = []for X in inputs:I = torch.sigmoid(torch.matmul(X, W_xi) + torch.matmul(H, W_hi) + b_i)F = torch.sigmoid(torch.matmul(X, W_xf) + torch.matmul(H, W_hf) + b_f)O = torch.sigmoid(torch.matmul(X, W_xo) + torch.matmul(H, W_ho) + b_o)C_tilda = torch.tanh(torch.matmul(X, W_xc) + torch.matmul(H, W_hc) + b_c)C = F * C + I * C_tildaH = O * C.tanh()Y = torch.matmul(H, W_hq) + b_qoutputs.append(Y)return outputs, (H, C)pytorch简洁接口
1lstm_layer = nn.LSTM(input_size=vocab_size, hidden_size=num_hiddens)深度循环神经网络
$$\boldsymbol{H}_t^{(1)} = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh}^{(1)} + \boldsymbol{H}_{t-1}^{(1)} \boldsymbol{W}_{hh}^{(1)} + \boldsymbol{b}_h^{(1)})\\
\boldsymbol{H}_t^{(\ell)} = \phi(\boldsymbol{H}_t^{(\ell-1)} \boldsymbol{W}_{xh}^{(\ell)} + \boldsymbol{H}_{t-1}^{(\ell)} \boldsymbol{W}_{hh}^{(\ell)} + \boldsymbol{b}_h^{(\ell)})\\
\boldsymbol{O}_t = \boldsymbol{H}_t^{(L)} \boldsymbol{W}_{hq} + \boldsymbol{b}_q$$
深度循环神经网络就是级联了多个循环神经网络单元如LSTM,将上一个单元的隐藏状态输出作为下一个单元的时序输入。pytorch简洁实现
12gru_layer = nn.LSTM(input_size=vocab_size, hidden_size=num_hiddens,num_layers=2)#num_layers表示有几个单元双向循环神经网络
$$\overrightarrow{\boldsymbol{H}}_t=\phi(\boldsymbol{X}_t \boldsymbol{W}_{xh}^{(f)}+\overrightarrow{\boldsymbol{H}}_{t-1}\boldsymbol{W}_{hh}^{(f)}+\boldsymbol{b}_h^{(f)})\\
\overleftarrow{\boldsymbol{H}}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh}^{(b)} + \overleftarrow{\boldsymbol{H}}_{t+1} \boldsymbol{W}_{hh}^{(b)} + \boldsymbol{b}_h^{(b)})\\
\boldsymbol{H}_t=(\overrightarrow{\boldsymbol{H}}_{t}, \overleftarrow{\boldsymbol{H}}_t)\\
\boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q$$
双向循环神经网络不仅前向进行了一次计算得到隐藏状态,还反向进行了一次隐藏状态计算,最后将两个隐藏状态进行了contact得到输出。pytorch简洁实现
12gru_layer = nn.GRU(input_size=vocab_size, hidden_size=num_hiddens,bidirectional=True)#bidirectional设为true就是双向Part2.机器翻译
1:数据处理及准备
2:Sequence2Sequence
3:Attention Sequence2Sequence
数据处理
机器翻译(MT):将一段文本从一种语言自动翻译为另一种语言,用神经网络解决这个问题通常称为神经机器翻译(NMT)。 主要特征:输出是单词序列而不是单个单词。 输出序列的长度可能与源序列的长度不同。我们这里用一个英语到法语的数据集。
数据清洗
去除不需要的字符,大小写转换等
1 2 3 4 5 6 7 8 9 10 11 12 |
with open('/home/kesci/input/fraeng6506/fra.txt', 'r') as f: raw_text = f.read() def preprocess_raw(text): text = text.replace('\u202', ' ').replace('\xa0', ' ') out = ' ' for i, char in enumerate(text.lower()): if char in (' ', '!', '.') and i > 0 and text[i-1] != ' ': out += ' ' out += char return out text = preprocess_raw(raw_text) |
分词建立词典
将source和target语言分别进行分词并建立词典,用到了前面介绍的函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
num_examples = 50000 source, target = [], [] for i, line in enumerate(text.split('\n')): if i < num_examples: break parts = line.split('\t') if len(parts) >= 2: source.append(parts[0].split(' ')) target.append(parts[1].split(' ')) def build_vocab(tokens): tokens = [token for line in tokens for token in line] return d2l.data.base.Vocab(tokens, min_freq=3, use_special_tokens=True) src_vocab = build_vocab(source) |
建立数据集
通过pytorch的Dataset建立数据集
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def pad(line, max_len, padding_token): if len(line) < max_len: return line[:max_len] return line + [padding_token] * (max_len - len(line)) def build_array(lines, vocab, max_len, is_source): lines = [vocab[line] for line in lines] if not is_source: lines = [[vocab.bos] + line + [vocab.eos] for line in lines] array = torch.tensor([pad(line, max_len, vocab.pad) for line in lines]) valid_len = (array != vocab.pad).sum(1) #第一个维度 return array, valid_len def load_data_nmt(batch_size, max_len): # This function is saved in d2l. src_vocab, tgt_vocab = build_vocab(source), build_vocab(target) src_array, src_valid_len = build_array(source, src_vocab, max_len, True) tgt_array, tgt_valid_len = build_array(target, tgt_vocab, max_len, False) train_data = data.TensorDataset(src_array, src_valid_len, tgt_array, tgt_valid_len) train_iter = data.DataLoader(train_data, batch_size, shuffle=True) return src_vocab, tgt_vocab, train_iter |
建立模型
Sequence2Sequence模型是Encoder-Decoder的形式,通过encoder将输入序列转化为语义编码,然后利用decoder将语义编码转化为输出序列。
Sequnece2Sequence训练示意图:
Sequence2Sequence预测示意图:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
class Encoder(nn.Module): def __init__(self, **kwargs): super(Encoder, self).__init__(**kwargs) def forward(self, X, *args): raise NotImplementedError class Decoder(nn.Module): def __init__(self, **kwargs): super(Decoder, self).__init__(**kwargs) def init_state(self, enc_outputs, *args): raise NotImplementedError def forward(self, X, state): raise NotImplementedError class EncoderDecoder(nn.Module): def __init__(self, encoder, decoder, **kwargs): super(EncoderDecoder, self).__init__(**kwargs) self.encoder = encoder self.decoder = decoder def forward(self, enc_X, dec_X, *args): enc_outputs = self.encoder(enc_X, *args) dec_state = self.decoder.init_state(enc_outputs, *args) return self.decoder(dec_X, dec_state) class Seq2SeqEncoder(d2l.Encoder): def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, dropout=0, **kwargs): super(Seq2SeqEncoder, self).__init__(**kwargs) self.num_hiddens=num_hiddens self.num_layers=num_layers self.embedding = nn.Embedding(vocab_size, embed_size) self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout) def begin_state(self, batch_size, device): return [torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device), torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device)] def forward(self, X, *args): X = self.embedding(X) # X shape: (batch_size, seq_len, embed_size) X = X.transpose(0, 1) # RNN needs first axes to be time # state = self.begin_state(X.shape[1], device=X.device) out, state = self.rnn(X) # The shape of out is (seq_len, batch_size, num_hiddens). # state contains the hidden state and the memory cell # of the last time step, the shape is (num_layers, batch_size, num_hiddens) return out, state class Seq2SeqDecoder(d2l.Decoder): def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, dropout=0, **kwargs): super(Seq2SeqDecoder, self).__init__(**kwargs) self.embedding = nn.Embedding(vocab_size, embed_size) self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout) self.dense = nn.Linear(num_hiddens,vocab_size) def init_state(self, enc_outputs, *args): return enc_outputs[1] def forward(self, X, state): X = self.embedding(X).transpose(0, 1) out, state = self.rnn(X, state) # Make the batch to be the first dimension to simplify loss computation. out = self.dense(out).transpose(0, 1) return out, state |
损失函数
定义带有mask的损失函数,因为输出与输入长度不一致,我们通过有效长度来控制损失函数的有效计算。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def SequenceMask(X, X_len,value=0): maxlen = X.size(1) mask = torch.arange(maxlen)[None, :].to(X_len.device) < X_len[:, None] X[~mask]=value return X class MaskedSoftmaxCELoss(nn.CrossEntropyLoss): # pred shape: (batch_size, seq_len, vocab_size) # label shape: (batch_size, seq_len) # valid_length shape: (batch_size, ) def forward(self, pred, label, valid_length): # the sample weights shape should be (batch_size, seq_len) weights = torch.ones_like(label) weights = SequenceMask(weights, valid_length).float() self.reduction='none' output=super(MaskedSoftmaxCELoss, self).forward(pred.transpose(1,2), label) return (output*weights).mean(dim=1) |
模型训练与预测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
def train_ch7(model, data_iter, lr, num_epochs, device): # Saved in d2l model.to(device) optimizer = optim.Adam(model.parameters(), lr=lr) loss = MaskedSoftmaxCELoss() tic = time.time() for epoch in range(1, num_epochs+1): l_sum, num_tokens_sum = 0.0, 0.0 for batch in data_iter: optimizer.zero_grad() X, X_vlen, Y, Y_vlen = [x.to(device) for x in batch] Y_input, Y_label, Y_vlen = Y[:,:-1], Y[:,1:], Y_vlen-1 Y_hat, _ = model(X, Y_input, X_vlen, Y_vlen) l = loss(Y_hat, Y_label, Y_vlen).sum() l.backward() with torch.no_grad(): d2l.grad_clipping_nn(model, 5, device) num_tokens = Y_vlen.sum().item() optimizer.step() l_sum += l.sum().item() num_tokens_sum += num_tokens if epoch % 50 == 0: print('epoch {0:4d},loss {1:.3f}, time {2:.1f} sec'.format( epoch, (l_sum/num_tokens_sum), time.time()-tic)) tic = time.time() embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0 batch_size, num_examples, max_len = 64, 1e3, 10 lr, num_epochs, ctx = 0.005, 300, d2l.try_gpu() src_vocab, tgt_vocab, train_iter = d2l.load_data_nmt( batch_size, max_len,num_examples) encoder = Seq2SeqEncoder( len(src_vocab), embed_size, num_hiddens, num_layers, dropout) decoder = Seq2SeqDecoder( len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout) model = d2l.EncoderDecoder(encoder, decoder) train_ch7(model, train_iter, lr, num_epochs, ctx) def translate_ch7(model, src_sentence, src_vocab, tgt_vocab, max_len, device): src_tokens = src_vocab[src_sentence.lower().split(' ')] src_len = len(src_tokens) if src_len < max_len: src_tokens += [src_vocab.pad] * (max_len - src_len) enc_X = torch.tensor(src_tokens, device=device) enc_valid_length = torch.tensor([src_len], device=device) # use expand_dim to add the batch_size dimension. enc_outputs = model.encoder(enc_X.unsqueeze(dim=0), enc_valid_length) dec_state = model.decoder.init_state(enc_outputs, enc_valid_length) dec_X = torch.tensor([tgt_vocab.bos], device=device).unsqueeze(dim=0) predict_tokens = [] for _ in range(max_len): Y, dec_state = model.decoder(dec_X, dec_state) # The token with highest score is used as the next time step input. dec_X = Y.argmax(dim=2) py = dec_X.squeeze(dim=0).int().item() if py == tgt_vocab.eos: break predict_tokens.append(py) return ' '.join(tgt_vocab.to_tokens(predict_tokens)) for sentence in ['Go .', 'Wow !', "I'm OK .", 'I won !']: print(sentence + ' => ' + translate_ch7( model, sentence, src_vocab, tgt_vocab, max_len, ctx)) |
Attention Sequence2Sequence
注意力机制
在“编码器—解码器(seq2seq)”⼀节⾥,解码器在各个时间步依赖相同的背景变量(context vector)来获取输⼊序列信息。当编码器为循环神经⽹络时,背景变量来⾃它最终时间步的隐藏状态。将源序列输入信息以循环单位状态编码,然后将其传递给解码器以生成目标序列。然而这种结构存在着问题,尤其是RNN机制实际中存在长程梯度消失的问题,对于较长的句子,我们很难寄希望于将输入的序列转化为定长的向量而保存所有的有效信息,所以随着所需翻译句子的长度的增加,这种结构的效果会显著下降。
与此同时,解码的目标词语可能只与原输入的部分词语有关,而并不是与所有的输入有关。例如,当把“Hello world”翻译成“Bonjour le monde”时,“Hello”映射成“Bonjour”,“world”映射成“monde”。在seq2seq模型中,解码器只能隐式地从编码器的最终状态中选择相应的信息。然而,注意力机制可以将这种选择过程显式地建模。
注意力机制框架
Attention 是一种通用的带权池化方法,输入由两部分构成:询问(query)和键值对(key-value pairs)。\(𝐤_𝑖∈ℝ^{𝑑_𝑘}, 𝐯_𝑖∈ℝ^{𝑑_𝑣}\). Query\(𝐪∈ℝ^{𝑑_𝑞}\), attention layer得到输出与value的维度一致\(𝐨∈ℝ^{𝑑_𝑣}\). 对于一个query来说,attention layer 会与每一个key计算注意力分数并进行权重的归一化,输出的向量则是value的加权求和,而每个key计算的权重与value一一对应。
为了计算输出,我们首先假设有一个函数\(\alpha\)用于计算query和key的相似性,然后可以计算所有的 attention scores \(a_1, \ldots, a_n\) by$$a_i = \alpha(\mathbf q, \mathbf k_i).$$
我们使用 softmax函数 获得注意力权重:
$$b_1, \ldots, b_n = \textrm{softmax}(a_1, \ldots, a_n).$$
最终的输出就是value的加权求和:$$\mathbf o = \sum_{i=1}^n b_i \mathbf v_i.$$
Dot-product Attention
The dot product 假设query和keys有相同的维度, 即\(\forall i, 𝐪,𝐤_𝑖 ∈ ℝ_𝑑\). 通过计算query和key转置的乘积来计算attention score,通常还会除去\(\sqrt{d}\)减少计算出来的score对维度𝑑的依赖性,如下$$𝛼(𝐪,𝐤)=⟨𝐪,𝐤⟩/ \sqrt{d}$$
假设 \(𝐐∈ℝ^{𝑚×𝑑}\)有 m个query,\(𝐊∈ℝ^{𝑛×𝑑}\)有n个keys. 我们可以通过矩阵运算的方式计算所有mn个score:$$𝛼(𝐐,𝐊)=𝐐𝐊^𝑇/\sqrt{d}$$
现在让我们实现这个层,它支持一批查询和键值对。此外,它支持作为正则化随机删除一些注意力权重.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
def SequenceMask(X, X_len,value=-1e6): maxlen = X.size(1) #print(X.size(),torch.arange((maxlen),dtype=torch.float)[None, :],'\n',X_len[:, None] ) mask = torch.arange((maxlen),dtype=torch.float)[None, :] >= X_len[:, None] #print(mask) X[mask]=value return X def masked_softmax(X, valid_length): # X: 3-D tensor, valid_length: 1-D or 2-D tensor softmax = nn.Softmax(dim=-1) if valid_length is None: return softmax(X) else: shape = X.shape if valid_length.dim() == 1: try: valid_length = torch.FloatTensor(valid_length.numpy().repeat(shape[1], axis=0))#[2,2,3,3] except: valid_length = torch.FloatTensor(valid_length.cpu().numpy().repeat(shape[1], axis=0))#[2,2,3,3] else: valid_length = valid_length.reshape((-1,)) # fill masked elements with a large negative, whose exp is 0 X = SequenceMask(X.reshape((-1, shape[-1])), valid_length) return softmax(X).reshape(shape) class DotProductAttention(nn.Module): def __init__(self, dropout, **kwargs): super(DotProductAttention, self).__init__(**kwargs) self.dropout = nn.Dropout(dropout) # query: (batch_size, #queries, d) # key: (batch_size, #kv_pairs, d) # value: (batch_size, #kv_pairs, dim_v) # valid_length: either (batch_size, ) or (batch_size, xx) def forward(self, query, key, value, valid_length=None): d = query.shape[-1] # set transpose_b=True to swap the last two dimensions of key scores = torch.bmm(query, key.transpose(1,2)) / math.sqrt(d) attention_weights = self.dropout(masked_softmax(scores, valid_length)) print("attention_weight\n",attention_weights) return torch.bmm(attention_weights, value) |
多层感知机注意力机制
在多层感知器中,我们首先将 query and keys 投影到\(ℝ^ℎ\).为了更具体,我们将可以学习的参数做如下映射\(𝐖_𝑘∈ℝ^{ℎ×𝑑_𝑘}\),\(𝐖_𝑞∈ℝ^{ℎ×𝑑_𝑞}\) , and\(𝐯∈ℝ^h\). 将score函数定义$$𝛼(𝐤,𝐪)=𝐯^𝑇tanh(𝐖_𝑘𝐤+𝐖_𝑞𝐪)$$
. 然后将key 和 value 在特征的维度上合并(concatenate),然后送至 a single hidden layer perceptron 这层中 hidden layer 为 ℎ and 输出的size为 1 .隐层激活函数为tanh,无偏置.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
class MLPAttention(nn.Module): def __init__(self, units,ipt_dim,dropout, **kwargs): super(MLPAttention, self).__init__(**kwargs) # Use flatten=True to keep query's and key's 3-D shapes. self.W_k = nn.Linear(ipt_dim, units, bias=False) self.W_q = nn.Linear(ipt_dim, units, bias=False) self.v = nn.Linear(units, 1, bias=False) self.dropout = nn.Dropout(dropout) def forward(self, query, key, value, valid_length): query, key = self.W_k(query), self.W_q(key) #print("size",query.size(),key.size()) # expand query to (batch_size, #querys, 1, units), and key to # (batch_size, 1, #kv_pairs, units). Then plus them with broadcast. features = query.unsqueeze(2) + key.unsqueeze(1) #print("features:",features.size()) #--------------开启 scores = self.v(features).squeeze(-1) attention_weights = self.dropout(masked_softmax(scores, valid_length)) return torch.bmm(attention_weights, value) |
Attention Seq2Seq
下图展示encoding 和decoding的模型结构,在时间步为t的时候。此刻attention layer保存着encodering看到的所有信息——即encoding的每一步输出。在decoding阶段,解码器的t时刻的隐藏状态被当作query,encoder的每个时间步的hidden states作为key和value进行attention聚合. Attetion model的输出当作成上下文信息context vector,并与解码器输入\(D_t\)拼接起来一起送到解码器:
下图展示了seq2seq机制的所以层的关系,下面展示了encoder和decoder的layer结构:
由于带有注意机制的seq2seq的编码器与之前章节中的Seq2SeqEncoder相同,所以在此处我们只关注解码器。我们添加了一个MLP注意层(MLPAttention),它的隐藏大小与解码器中的LSTM层相同。然后我们通过从编码器传递三个参数来初始化解码器的状态:
- the encoder outputs of all timesteps:encoder输出的各个状态,被用于attetion layer的memory部分,有相同的key和values
- the hidden state of the encoder’s final timestep:编码器最后一个时间步的隐藏状态,被用于初始化decoder 的hidden state
- the encoder valid length: 编码器的有效长度,借此,注意层不会考虑编码器输出中的填充标记(Paddings)
在解码的每个时间步,我们使用解码器的最后一个RNN层的输出作为注意层的query。然后,将注意力模型的输出与输入嵌入向量连接起来,输入到RNN层。虽然RNN层隐藏状态也包含来自解码器的历史信息,但是attention model的输出显式地选择了enc_valid_len以内的编码器输出,这样attention机制就会尽可能排除其他不相关的信息。
1234567891011121314151617181920212223242526272829303132333435363738394041class Seq2SeqAttentionDecoder(d2l.Decoder):def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,dropout=0, **kwargs):super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)self.attention_cell = MLPAttention(num_hiddens,num_hiddens, dropout)self.embedding = nn.Embedding(vocab_size, embed_size)self.rnn = nn.LSTM(embed_size+ num_hiddens,num_hiddens, num_layers, dropout=dropout)self.dense = nn.Linear(num_hiddens,vocab_size)def init_state(self, enc_outputs, enc_valid_len, *args):outputs, hidden_state = enc_outputs# print("first:",outputs.size(),hidden_state[0].size(),hidden_state[1].size())# Transpose outputs to (batch_size, seq_len, hidden_size)return (outputs.permute(1,0,-1), hidden_state, enc_valid_len)#outputs.swapaxes(0, 1)def forward(self, X, state):enc_outputs, hidden_state, enc_valid_len = state#("X.size",X.size())X = self.embedding(X).transpose(0,1)# print("Xembeding.size2",X.size())outputs = []for l, x in enumerate(X):# print(f"\n{l}-th token")# print("x.first.size()",x.size())# query shape: (batch_size, 1, hidden_size)# select hidden state of the last rnn layer as queryquery = hidden_state[0][-1].unsqueeze(1) # np.expand_dims(hidden_state[0][-1], axis=1)# context has same shape as query# print("query enc_outputs, enc_outputs:\n",query.size(), enc_outputs.size(), enc_outputs.size())context = self.attention_cell(query, enc_outputs, enc_outputs, enc_valid_len)# Concatenate on the feature dimension# print("context.size:",context.size())x = torch.cat((context, x.unsqueeze(1)), dim=-1)# Reshape x to (1, batch_size, embed_size+hidden_size)# print("rnn",x.size(), len(hidden_state))out, hidden_state = self.rnn(x.transpose(0,1), hidden_state)outputs.append(out)outputs = self.dense(torch.cat(outputs, dim=0))return outputs.transpose(0, 1), [enc_outputs, hidden_state,enc_valid_len]Part3.Transformer
下图展示了Transformer模型的架构,与seq2seq模型相似,Transformer同样基于编码器-解码器架构,其区别主要在于以下三点:
- Transformer blocks:将seq2seq模型重的循环网络替换为了Transformer Blocks,该模块包含一个多头注意力层(Multi-head Attention Layers)以及两个position-wise feed-forward networks(FFN)。对于解码器来说,另一个多头注意力层被用于接受编码器的隐藏状态。
- Add and norm:多头注意力层和前馈网络的输出被送到两个“add and norm”层进行处理,该层包含残差结构以及层归一化。
- Position encoding:由于自注意力层并没有区分元素的顺序,所以一个位置编码层被用于向序列元素里添加位置信息。
多头注意力机制
在我们讨论多头注意力层之前,先来迅速理解以下自注意力(self-attention)的结构。自注意力模型是一个正规的注意力模型,序列的每一个元素对应的key,value,query是完全一致的。如下图自注意力输出了一个与输入长度相同的表征序列,与循环神经网络相比,自注意力对每个元素输出的计算是并行的,所以我们可以高效的实现这个模块。
多头注意力层包含h个并行的自注意力层,每一个这种层被成为一个head。对每个头来说,在进行注意力计算之前,我们会将query、key和value用三个现行层进行映射,这h个注意力头的输出将会被拼接之后输入最后一个线性层进行整合。12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455def transpose_qkv(X, num_heads):# Original X shape: (batch_size, seq_len, hidden_size * num_heads),# -1 means inferring its value, after first reshape, X shape:# (batch_size, seq_len, num_heads, hidden_size)X = X.view(X.shape[0], X.shape[1], num_heads, -1)# After transpose, X shape: (batch_size, num_heads, seq_len, hidden_size)X = X.transpose(2, 1).contiguous()# Merge the first two dimensions. Use reverse=True to infer shape from# right to left.# output shape: (batch_size * num_heads, seq_len, hidden_size)output = X.view(-1, X.shape[2], X.shape[3])return outputdef transpose_output(X, num_heads):# A reversed version of transpose_qkvX = X.view(-1, num_heads, X.shape[1], X.shape[2])X = X.transpose(2, 1).contiguous()return X.view(X.shape[0], X.shape[1], -1)class MultiHeadAttention(nn.Module):def __init__(self, input_size, hidden_size, num_heads, dropout, **kwargs):super(MultiHeadAttention, self).__init__(**kwargs)self.num_heads = num_headsself.attention = DotProductAttention(dropout)self.W_q = nn.Linear(input_size, hidden_size, bias=False)self.W_k = nn.Linear(input_size, hidden_size, bias=False)self.W_v = nn.Linear(input_size, hidden_size, bias=False)self.W_o = nn.Linear(hidden_size, hidden_size, bias=False)def forward(self, query, key, value, valid_length):# query, key, and value shape: (batch_size, seq_len, dim),# where seq_len is the length of input sequence# valid_length shape is either (batch_size, )# or (batch_size, seq_len).# Project and transpose query, key, and value from# (batch_size, seq_len, hidden_size * num_heads) to# (batch_size * num_heads, seq_len, hidden_size).query = transpose_qkv(self.W_q(query), self.num_heads)key = transpose_qkv(self.W_k(key), self.num_heads)value = transpose_qkv(self.W_v(value), self.num_heads)if valid_length is not None:# Copy valid_length by num_heads timesdevice = valid_length.devicevalid_length = valid_length.cpu().numpy() if valid_length.is_cuda else valid_length.numpy()if valid_length.ndim == 1:valid_length = torch.FloatTensor(np.tile(valid_length, self.num_heads))else:valid_length = torch.FloatTensor(np.tile(valid_length, (self.num_heads,1)))valid_length = valid_length.to(device)output = self.attention(query, key, value, valid_length)output_concat = transpose_output(output, self.num_heads)return self.W_o(output_concat)基于位置的前馈网络
Transformer 模块另一个非常重要的部分就是基于位置的前馈网络(FFN),它接受一个形状为(batch_size,seq_length, feature_size)的三维张量。Position-wise FFN由两个全连接层组成,他们作用在最后一维上。因为序列的每个位置的状态都会被单独地更新,所以我们称他为position-wise,这等效于一个1x1的卷积。
1234567class PositionWiseFFN(nn.Module):def __init__(self, input_size, ffn_hidden_size, hidden_size_out, **kwargs):super(PositionWiseFFN, self).__init__(**kwargs)self.ffn_1 = nn.Linear(input_size, ffn_hidden_size)self.ffn_2 = nn.Linear(ffn_hidden_size, hidden_size_out)def forward(self, X):return self.ffn_2(F.relu(self.ffn_1(X)))Add and Norm
Transformer还有一个重要的相加归一化层,它可以平滑地整合输入和其他层的输出,因此我们在每个多头注意力层和FFN层后面都添加一个含残差连接的Layer Norm层。 Layer Norm 与Batch Norm很相似,唯一的区别在于Batch Norm是对于batch size这个维度进行计算均值和方差的,而Layer Norm则是对最后一维进行计算。层归一化可以防止层内的数值变化过大,从而有利于加快训练速度并且提高泛化性能。
1234567class AddNorm(nn.Module):def __init__(self, hidden_size, dropout, **kwargs):super(AddNorm, self).__init__(**kwargs)self.dropout = nn.Dropout(dropout)self.norm = nn.LayerNorm(hidden_size)def forward(self, X, Y):return self.norm(self.dropout(Y) + X)位置编码
与循环神经网络不同,无论是多头注意力网络还是前馈神经网络都是独立地对每个位置的元素进行更新,这种特性帮助我们实现了高效的并行,却丢失了重要的序列顺序的信息。为了更好的捕捉序列信息,Transformer模型引入了位置编码去保持输入序列元素的位置。
假设输入序列的嵌入表示\(X\in \mathbb{R}^{l\times d}\), 序列长度为l嵌入向量维度为d,则其位置编码为\(P \in \mathbb{R}^{l\times d}\),输出的向量就是二者相加 X+P。
位置编码是一个二维的矩阵,i对应着序列中的顺序,j对应其embedding vector内部的维度索引。我们可以通过以下等式计算位置编码:$$P_{i,2j+1} = cos(i/10000^{2j/d})
\\P_{i,2j+1} = cos(i/10000^{2j/d})
\\for\ i=0,\ldots, l-1\ and\ j=0,\ldots,\lfloor (d-1)/2 \rfloor$$
12345678910111213141516class PositionalEncoding(nn.Module):def __init__(self, embedding_size, dropout, max_len=1000):super(PositionalEncoding, self).__init__()self.dropout = nn.Dropout(dropout)self.P = np.zeros((1, max_len, embedding_size))X = np.arange(0, max_len).reshape(-1, 1) / np.power(10000, np.arange(0, embedding_size, 2)/embedding_size)self.P[:, :, 0::2] = np.sin(X)self.P[:, :, 1::2] = np.cos(X)self.P = torch.FloatTensor(self.P)def forward(self, X):if X.is_cuda and not self.P.is_cuda:self.P = self.P.cuda()X = X + self.P[:, :X.shape[1], :]return self.dropout(X)
Encoder
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
class EncoderBlock(nn.Module): def __init__(self, embedding_size, ffn_hidden_size, num_heads, dropout, **kwargs): super(EncoderBlock, self).__init__(**kwargs) self.attention = MultiHeadAttention(embedding_size, embedding_size, num_heads, dropout) self.addnorm_1 = AddNorm(embedding_size, dropout) self.ffn = PositionWiseFFN(embedding_size, ffn_hidden_size, embedding_size) self.addnorm_2 = AddNorm(embedding_size, dropout) def forward(self, X, valid_length): Y = self.addnorm_1(X, self.attention(X, X, X, valid_length)) return self.addnorm_2(Y, self.ffn(Y)) class TransformerEncoder(d2l.Encoder): def __init__(self, vocab_size, embedding_size, ffn_hidden_size, num_heads, num_layers, dropout, **kwargs): super(TransformerEncoder, self).__init__(**kwargs) self.embedding_size = embedding_size self.embed = nn.Embedding(vocab_size, embedding_size) self.pos_encoding = PositionalEncoding(embedding_size, dropout) self.blks = nn.ModuleList() for i in range(num_layers): self.blks.append( EncoderBlock(embedding_size, ffn_hidden_size, num_heads, dropout)) def forward(self, X, valid_length, *args): X = self.pos_encoding(self.embed(X) * math.sqrt(self.embedding_size)) for blk in self.blks: X = blk(X, valid_length) return X |
Decoder
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
class DecoderBlock(nn.Module): def __init__(self, embedding_size, ffn_hidden_size, num_heads,dropout,i,**kwargs): super(DecoderBlock, self).__init__(**kwargs) self.i = i self.attention_1 = MultiHeadAttention(embedding_size, embedding_size, num_heads, dropout) self.addnorm_1 = AddNorm(embedding_size, dropout) self.attention_2 = MultiHeadAttention(embedding_size, embedding_size, num_heads, dropout) self.addnorm_2 = AddNorm(embedding_size, dropout) self.ffn = PositionWiseFFN(embedding_size, ffn_hidden_size, embedding_size) self.addnorm_3 = AddNorm(embedding_size, dropout) def forward(self, X, state): enc_outputs, enc_valid_length = state[0], state[1] # state[2][self.i] stores all the previous t-1 query state of layer-i # len(state[2]) = num_layers # If training: # state[2] is useless. # If predicting: # In the t-th timestep: # state[2][self.i].shape = (batch_size, t-1, hidden_size) # Demo: # love dogs ! [EOS] # | | | | # Transformer # Decoder # | | | | # I love dogs ! if state[2][self.i] is None: key_values = X else: # shape of key_values = (batch_size, t, hidden_size) key_values = torch.cat((state[2][self.i], X), dim=1) state[2][self.i] = key_values if self.training: batch_size, seq_len, _ = X.shape # Shape: (batch_size, seq_len), the values in the j-th column are j+1 valid_length = torch.FloatTensor(np.tile(np.arange(1, seq_len+1), (batch_size, 1))) valid_length = valid_length.to(X.device) else: valid_length = None X2 = self.attention_1(X, key_values, key_values, valid_length) Y = self.addnorm_1(X, X2) Y2 = self.attention_2(Y, enc_outputs, enc_outputs, enc_valid_length) Z = self.addnorm_2(Y, Y2) return self.addnorm_3(Z, self.ffn(Z)), state class TransformerDecoder(d2l.Decoder): def __init__(self, vocab_size, embedding_size, ffn_hidden_size, num_heads, num_layers, dropout, **kwargs): super(TransformerDecoder, self).__init__(**kwargs) self.embedding_size = embedding_size self.num_layers = num_layers self.embed = nn.Embedding(vocab_size, embedding_size) self.pos_encoding = PositionalEncoding(embedding_size, dropout) self.blks = nn.ModuleList() for i in range(num_layers): self.blks.append( DecoderBlock(embedding_size, ffn_hidden_size, num_heads, dropout, i)) self.dense = nn.Linear(embedding_size, vocab_size) def init_state(self, enc_outputs, enc_valid_length, *args): return [enc_outputs, enc_valid_length, [None]*self.num_layers] def forward(self, X, state): X = self.pos_encoding(self.embed(X) * math.sqrt(self.embedding_size)) for blk in self.blks: X, state = blk(X, state) return self.dense(X), state |