import torch import tensorflow as tf
First we define some parameters:
seq_len(Sequence length): The length of sequence. For RNN, it can vary from batch to batch. As a first step, we will define it as a constant and show how LSTM works in a single batch. We also call this timestep length, or timesteps.
input_size(Input feature size): For each input, e.g. a word in a sentence, we use a vector of
input_size-long to represent it. We normally call this the embedding of the actual individual data (word).
batch_size(Batch size): how many copies of data in a single batch
hidden_size(Hidden units size): the number of features in hidden state. Note that hidden state is the output features.
seq_len, batch_size, input_size, hidden_size = 17, 2, 10, 7 I_tc = torch.randn(seq_len, batch_size, input_size) # Input for PyTorch I_tf = tf.random.normal([batch_size, seq_len, input_size]) # Input for TensorFlow
Now, let’s show a simple LSTM layer in PyTorch first:
lstm_tc_simple = torch.nn.LSTM(input_size, hidden_size) O_tc_simple, (H_tc_simple, C_tc_simple) = lstm_tc_simple(I_tc)
assert O_tc_simple.shape == torch.Size((seq_len, batch_size, hidden_size))
assert H_tc_simple.shape == torch.Size((1, batch_size, hidden_size))
assert C_tc_simple.shape == torch.Size((1, batch_size, hidden_size))
Next, another example in TensorFlow:
lstm_tf_simple = tf.keras.layers.LSTM(hidden_size) O_tf_simple = lstm_tf_simple(I_tf)
assert O_tf_simple.shape == tf.TensorShape((batch_size, hidden_size))
As you can see, the input of two
lstms are identical (modulo the order).
But, there are several questions we can ask about the difference between the
- What does
- What does
O_tf_simplemean? How does it correspond to the PyTorch gang?
Let’s look at some diagrams from Colah’s post on LSTM:
The above is a general structure of RNN:
In the above diagram, a chunk of neural network, $A$, looks at some input $x_t$ and outputs a value $h_t$. A loop allows information to be passed from one step of the network to the next.
So, LSTM (or any general RNN)’s basic functionality is converting a sequence of $x$ to a sequence of $h$.
O_tc_simple contains a sequence of hidden state (or output features) $h$
multiplied with batch size.
H_tc_simple contains the hidden state for most recent timestep,
in shape of
(num_layers * num_directions, batch_size, hidden_size).
Since by default we constructed LSTM of only one layer and one direction,
the first dimension is just 1.
C_tc_simple contains the cell state. You can think of
each hidden state is associated with a cell state $c$:
In fact, we can verify that when there is only one layer,
tensor([[[True, True, True, True, True, True, True], [True, True, True, True, True, True, True]]])
O_tf_simple is actuall just the flatten version of
H_tc_simple. There is another
configuration of TF’s LSTM to make it output differently:
lstm_tf_simple2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True, return_state=True) O_tf_simple2, H_tf_simple2, C_tf_simple2 = lstm_tf_simple2(I_tf)
assert O_tf_simple2.shape == tf.TensorShape((batch_size, seq_len, hidden_size))
assert H_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))
assert C_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))
Now, correspondence between PyTorch’s and TF’s outputs is much clearer.
Bidirectional and Stacked LSTMs
However, it looks like PyTorch’s
LSTM is much more powerful. You can specify
num_directions = 2 bilstm_tc = torch.nn.LSTM(input_size, hidden_size, bidirectional=True) O_tc_bidir, (H_tc_bidir, C_tc_bidir) = bilstm_tc(I_tc)
assert O_tc_bidir.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size))
assert H_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))
assert C_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))
Or stack multiple layers together:
num_layers = 3 bi_lstm_tc_stacked = torch.nn.LSTM(input_size, hidden_size, num_layers=num_layers, bidirectional=True) O_tc_bidir_stacked, (H_tc_bidir_stacked, C_tc_bidir_stacked) = bi_lstm_tc_stacked(I_tc)
assert O_tc_bidir_stacked.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size)), O1.shape
assert H_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))
assert C_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))
How to do the same extension in TF?
Bidirectional is a wrapper over an RNN layer.
bilstm_tf = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(hidden_size), input_shape=I_tf.shape) O_tf_bidir = bilstm_tf(I_tf)
assert O_tf_bidir.shape == tf.TensorShape((batch_size, 2 * hidden_size))
Stacking LSTM layers in TF requires composing them in a sequential model.
# Stack 3 layers model_tf_stacked = tf.keras.Sequential() model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True, input_shape=(seq_len, input_size))) model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True)) model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size)) model_tf_stacked.compile(optimizer='adam', loss='mse') model_tf_stacked.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_5 (LSTM) (None, 17, 7) 504 _________________________________________________________________ lstm_6 (LSTM) (None, 17, 7) 420 _________________________________________________________________ lstm_7 (LSTM) (None, 7) 420 ================================================================= Total params: 1,344 Trainable params: 1,344 Non-trainable params: 0 _________________________________________________________________
O_tf_stacked = model_tf_stacked.predict(I_tf)
assert O_tf_stacked.shape == tf.TensorShape((batch_size, hidden_size))
Source code. Please star it if you like it! You can open an issue if you want to see more articles like this.