import torch
import tensorflow as tf

First we define some parameters:

  • seq_len (Sequence length): The length of sequence. For RNN, it can vary from batch to batch. As a first step, we will define it as a constant and show how LSTM works in a single batch. We also call this timestep length, or timesteps.
  • input_size (Input feature size): For each input, e.g. a word in a sentence, we use a vector of input_size-long to represent it. We normally call this the embedding of the actual individual data (word).
  • batch_size (Batch size): how many copies of data in a single batch
  • hidden_size (Hidden units size): the number of features in hidden state. Note that hidden state is the output features.
seq_len, batch_size, input_size, hidden_size = 17, 2, 10, 7

I_tc = torch.randn(seq_len, batch_size, input_size) # Input for PyTorch
I_tf = tf.random.normal([batch_size, seq_len, input_size]) # Input for TensorFlow

Simple LSTMs

Now, let’s show a simple LSTM layer in PyTorch first:

lstm_tc_simple = torch.nn.LSTM(input_size, hidden_size)
O_tc_simple, (H_tc_simple, C_tc_simple) = lstm_tc_simple(I_tc)
assert O_tc_simple.shape == torch.Size((seq_len, batch_size, hidden_size))
assert H_tc_simple.shape == torch.Size((1, batch_size, hidden_size))
assert C_tc_simple.shape == torch.Size((1, batch_size, hidden_size))

Next, another example in TensorFlow:

lstm_tf_simple = tf.keras.layers.LSTM(hidden_size)
O_tf_simple = lstm_tf_simple(I_tf)
assert O_tf_simple.shape == tf.TensorShape((batch_size, hidden_size))

As you can see, the input of two lstms are identical (modulo the order). But, there are several questions we can ask about the difference between the outputs:

  • What does O_tc_simple, H_tc_simple and C_tc_simple mean?
  • What does O_tf_simple mean? How does it correspond to the PyTorch gang?

Let’s look at some diagrams from Colah’s post on LSTM:

The above is a general structure of RNN:

In the above diagram, a chunk of neural network, $A$, looks at some input $x_t$ and outputs a value $h_t$. A loop allows information to be passed from one step of the network to the next.

So, LSTM (or any general RNN)’s basic functionality is converting a sequence of $x$ to a sequence of $h$.

O_tc_simple contains a sequence of hidden state (or output features) $h$ multiplied with batch size.

H_tc_simple contains the hidden state for most recent timestep, in shape of (num_layers * num_directions, batch_size, hidden_size). Since by default we constructed LSTM of only one layer and one direction, the first dimension is just 1.

Similar to H_tc_simple’s shape, C_tc_simple contains the cell state. You can think of each hidden state is associated with a cell state $c$:

In fact, we can verify that when there is only one layer, O_tc_simple contains H_tc_simple:

torch.eq(O_tc_simple[-1,:,:], H_tc_simple)
tensor([[[True, True, True, True, True, True, True],
         [True, True, True, True, True, True, True]]])

Tensorflow’s O_tf_simple is actuall just the flatten version of H_tc_simple. There is another configuration of TF’s LSTM to make it output differently:

lstm_tf_simple2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True, return_state=True)
O_tf_simple2, H_tf_simple2, C_tf_simple2 = lstm_tf_simple2(I_tf)
assert O_tf_simple2.shape == tf.TensorShape((batch_size, seq_len, hidden_size))
assert H_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))
assert C_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))

Now, correspondence between PyTorch’s and TF’s outputs is much clearer.

Bidirectional and Stacked LSTMs

However, it looks like PyTorch’s LSTM is much more powerful. You can specify two directions:

num_directions = 2
bilstm_tc = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
O_tc_bidir, (H_tc_bidir, C_tc_bidir) = bilstm_tc(I_tc)
assert O_tc_bidir.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size))
assert H_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))
assert C_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))

Or stack multiple layers together:

num_layers = 3
bi_lstm_tc_stacked = torch.nn.LSTM(input_size, hidden_size, num_layers=num_layers, bidirectional=True)
O_tc_bidir_stacked, (H_tc_bidir_stacked, C_tc_bidir_stacked) = bi_lstm_tc_stacked(I_tc)
assert O_tc_bidir_stacked.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size)), O1.shape
assert H_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))
assert C_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))

How to do the same extension in TF?

Bidirectional is a wrapper over an RNN layer.

bilstm_tf = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(hidden_size), input_shape=I_tf.shape)
O_tf_bidir = bilstm_tf(I_tf)
assert O_tf_bidir.shape == tf.TensorShape((batch_size, 2 * hidden_size))

Stacking LSTM layers in TF requires composing them in a sequential model.

# Stack 3 layers

model_tf_stacked = tf.keras.Sequential()
model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True, input_shape=(seq_len, input_size)))
model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True))
model_tf_stacked.compile(optimizer='adam', loss='mse')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 17, 7)             504       
lstm_6 (LSTM)                (None, 17, 7)             420       
lstm_7 (LSTM)                (None, 7)                 420       
Total params: 1,344
Trainable params: 1,344
Non-trainable params: 0
O_tf_stacked = model_tf_stacked.predict(I_tf)
assert O_tf_stacked.shape == tf.TensorShape((batch_size, hidden_size))

Source code. Please star it if you like it! You can open an issue if you want to see more articles like this.