Skip to Content
Learn
Generating Text with Deep Learning
Preprocessing for seq2seq

If you’re feeling a bit nervous about building this all on your own, never fear. You don’t need to start from scratch — there are a few neural network libraries at your disposal. In our case, we’ll be using TensorFlow with the Keras API to build a pretty limited English-to-Spanish translator (we’ll explain this later and you’ll get an opportunity to improve it).

We can import Keras from Tensorflow like this:

from tensorflow import keras

Also, do not worry about memorizing anything we cover here. The purpose of this lesson is for you to make sense of what each part of the code does and how you can modify it to suit your own needs. In fact, the code we’ll be using is mostly derived from Keras’s own tutorial on the seq2seq model.

First things first: preprocessing the text data. Noise removal depends on your use case — do you care about casing or punctuation? For many tasks they are probably not important enough to justify the additional processing. This might be the time to make changes.

We’ll need the following for our Keras implementation:

  • vocabulary sets for both our input (English) and target (Spanish) data
  • the total number of unique word tokens we have for each set
  • the maximum sentence length we’re using for each language

We also need to mark the start and end of each document (sentence) in the target samples so that the model recognizes where to begin and end its text generation (no book-long sentences for us!). One way to do this is adding "<START>" at the beginning and "<END>" at the end of each target document (in our case, this will be our Spanish sentences). For example, "Estoy feliz." becomes "<START> Estoy feliz. <END>".

Before you dig into the instructions, read through the existing code in script.py and try to make sense of each line.

Instructions

1.

If you take a look at span-eng.txt, you’ll see that we’re working with a very tiny data set right now, which will make this a very terrible translator indeed. This is because we don’t want codecademy.com to crash on you! When you build your own translator later, you’ll be using a much larger data set, which will require a great deal more time to process everything.

Use string concatenation to reassign each target_doc to the value of target_doc surrounded by "<START> " and " <END>".

Then append the target_doc to target_docs.

2.

Complete each for loop by adding each token to the corresponding (input or target) tokens set if it hasn’t already been added.

3.

Create two new variables:

  • num_encoder_tokens: the length of the input tokens set
  • num_decoder_tokens: the length of the target tokens set
Folder Icon

Sign up to start coding

Already have an account?