# LSTM for Programmers

```
import torch
import tensorflow as tf
```

First we define some parameters:

`seq_len`

(Sequence length): The length of sequence. For RNN, it can vary*from batch to batch*. As a first step, we will define it as a constant and show how LSTM works in a single batch. We also call this timestep length, or timesteps.`input_size`

(Input feature size): For each input, e.g. a word in a sentence, we use a vector of`input_size`

-long to represent it. We normally call this the*embedding*of the actual individual data (word).`batch_size`

(Batch size): how many copies of data in a single batch`hidden_size`

(Hidden units size): the number of features in hidden state. Note that hidden state is the*output*features.

```
seq_len, batch_size, input_size, hidden_size = 17, 2, 10, 7
I_tc = torch.randn(seq_len, batch_size, input_size) # Input for PyTorch
I_tf = tf.random.normal([batch_size, seq_len, input_size]) # Input for TensorFlow
```

## Simple LSTMs

Now, let’s show a simple LSTM layer in PyTorch first:

```
lstm_tc_simple = torch.nn.LSTM(input_size, hidden_size)
O_tc_simple, (H_tc_simple, C_tc_simple) = lstm_tc_simple(I_tc)
```

```
assert O_tc_simple.shape == torch.Size((seq_len, batch_size, hidden_size))
```

```
assert H_tc_simple.shape == torch.Size((1, batch_size, hidden_size))
```

```
assert C_tc_simple.shape == torch.Size((1, batch_size, hidden_size))
```

Next, another example in TensorFlow:

```
lstm_tf_simple = tf.keras.layers.LSTM(hidden_size)
O_tf_simple = lstm_tf_simple(I_tf)
```

```
assert O_tf_simple.shape == tf.TensorShape((batch_size, hidden_size))
```

As you can see, the input of two `lstm`

s are identical (modulo the order).
But, there are several questions we can ask about the *difference* between the
outputs:

- What does
`O_tc_simple`

,`H_tc_simple`

and`C_tc_simple`

mean? - What does
`O_tf_simple`

mean? How does it correspond to the PyTorch gang?

Let’s look at some diagrams from Colah’s post on LSTM:

The above is a general structure of RNN:

In the above diagram, a chunk of neural network, $A$, looks at some input $x_t$ and outputs a value $h_t$. A loop allows information to be passed from one step of the network to the next.

So, LSTM (or any general RNN)’s basic functionality is converting
a *sequence* of $x$ to a *sequence* of $h$.

`O_tc_simple`

contains a *sequence* of hidden state (or output features) $h$
multiplied with batch size.

`H_tc_simple`

contains the hidden state for most recent timestep,
in shape of `(num_layers * num_directions, batch_size, hidden_size)`

.
Since by default we constructed LSTM of only one layer and one direction,
the first dimension is just 1.

Similar to `H_tc_simple`

’s shape,
`C_tc_simple`

contains the *cell* state. You can think of
each hidden state is associated with a cell state $c$:

In fact, we can verify that when there is only one layer, `O_tc_simple`

contains `H_tc_simple`

:

```
torch.eq(O_tc_simple[-1,:,:], H_tc_simple)
```

```
tensor([[[True, True, True, True, True, True, True],
[True, True, True, True, True, True, True]]])
```

Tensorflow’s `O_tf_simple`

is actuall just the flatten version of
`H_tc_simple`

. There is another
configuration of TF’s LSTM to make it output differently:

```
lstm_tf_simple2 = tf.keras.layers.LSTM(hidden_size, return_sequences=True, return_state=True)
O_tf_simple2, H_tf_simple2, C_tf_simple2 = lstm_tf_simple2(I_tf)
```

```
assert O_tf_simple2.shape == tf.TensorShape((batch_size, seq_len, hidden_size))
```

```
assert H_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))
```

```
assert C_tf_simple2.shape == tf.TensorShape((batch_size, hidden_size))
```

Now, correspondence between PyTorch’s and TF’s outputs is much clearer.

## Bidirectional and Stacked LSTMs

However, it looks like PyTorch’s `LSTM`

is much more powerful. You can specify
two directions:

```
num_directions = 2
bilstm_tc = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
O_tc_bidir, (H_tc_bidir, C_tc_bidir) = bilstm_tc(I_tc)
```

```
assert O_tc_bidir.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size))
```

```
assert H_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))
```

```
assert C_tc_bidir.shape == torch.Size((num_directions, batch_size, hidden_size))
```

Or stack multiple layers together:

```
num_layers = 3
bi_lstm_tc_stacked = torch.nn.LSTM(input_size, hidden_size, num_layers=num_layers, bidirectional=True)
O_tc_bidir_stacked, (H_tc_bidir_stacked, C_tc_bidir_stacked) = bi_lstm_tc_stacked(I_tc)
```

```
assert O_tc_bidir_stacked.shape == torch.Size((seq_len, batch_size, num_directions * hidden_size)), O1.shape
```

```
assert H_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))
```

```
assert C_tc_bidir_stacked.shape == torch.Size((num_directions * num_layers, batch_size, hidden_size))
```

How to do the same extension in TF?

Bidirectional is a wrapper over an RNN layer.

```
bilstm_tf = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(hidden_size), input_shape=I_tf.shape)
O_tf_bidir = bilstm_tf(I_tf)
```

```
assert O_tf_bidir.shape == tf.TensorShape((batch_size, 2 * hidden_size))
```

Stacking LSTM layers in TF requires composing them in a sequential model.

```
# Stack 3 layers
model_tf_stacked = tf.keras.Sequential()
model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True, input_shape=(seq_len, input_size)))
model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size, return_sequences=True))
model_tf_stacked.add(tf.keras.layers.LSTM(hidden_size))
model_tf_stacked.compile(optimizer='adam', loss='mse')
model_tf_stacked.summary()
```

```
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_5 (LSTM) (None, 17, 7) 504
_________________________________________________________________
lstm_6 (LSTM) (None, 17, 7) 420
_________________________________________________________________
lstm_7 (LSTM) (None, 7) 420
=================================================================
Total params: 1,344
Trainable params: 1,344
Non-trainable params: 0
_________________________________________________________________
```

```
O_tf_stacked = model_tf_stacked.predict(I_tf)
```

```
assert O_tf_stacked.shape == tf.TensorShape((batch_size, hidden_size))
```

Source code. Please star it if you like it! You can open an issue if you want to see more articles like this.