Learn
Generating Text with Deep Learning
Training Setup (part 1)

For each sentence, Keras expects a NumPy matrix containing one-hot vectors for each token. What’s a one-hot vector? In a one-hot vector, every token in our set is represented by a `0` except for the current token which is represented by a `1`. For example given the vocabulary `["the", "dog", "licked", "me"]`, a one-hot vector for “dog” would look like `[0, 1, 0, 0]`.

In order to vectorize our data and later translate it from vectors, it’s helpful to have a features dictionary (and a reverse features dictionary) to easily translate between all the 1s and 0s and actual words. We’ll build out the following:

• a features dictionary for English
• a features dictionary for Spanish
• a reverse features dictionary for English (where the keys and values are swapped)
• a reverse features dictionary for Spanish

Once we have all of our features dictionaries set up, it’s time to vectorize the data! We’re going to need vectors to input into our encoder and decoder, as well as a vector of target data we can use to train the decoder.

Because each matrix is almost all zeros, we’ll use `numpy.zeros()` from the NumPy library to build them out.

``````import numpy as np

encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens),
dtype='float32')``````

Let’s break this down:

We defined a NumPy matrix of zeros called `encoder_input_data` with two arguments:

• the shape of the matrix — in our case the number of documents (or sentences) by the maximum token sequence length (the longest sentence we want to see) by the number of unique tokens (or words)
• the data type we want — in our case NumPy’s `float32`, which can speed up our processing a bit

### Instructions

1.

Hang on… where did all that code go from the previous exercise? Don’t worry, it’s still there; we just moved it over to preprocess.py (all the necessary variables are imported at the top of script.py) to make some room for the new influx of code!

Take a look at the new code. You’ll see we’ve defined a features dictionary for our input vocabulary called `input_features_dict`.

Below `input_features_dict`, build `target_features_dict` the same way, but using the set of target tokens instead of input tokens.

2.

We’ve also built out the `reverse_input_features_dict`, which just swaps keys for values of the `input_features_dict`.

Build the `reverse_target_features_dict` in the same way, reversing the key-value pairs in `target_features_dict`.

3.

We already have the `encoder_input_data` numpy matrix done for you.

Your task is to create the following NumPy matrices with the same arguments as `encoder_input_data`, except they should should use the max sequence length for decoder sentences instead of encoder sentences, and the number of decoder tokens instead of encoder tokens:

• `decoder_input_data`: a matrix for the data we’ll pass into the decoder
• `decoder_target_data`: a matrix for the data we expect the decoder to produce

(The two new matrices you create should be identical for now.)