ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray) with array size exceeding 4000

my code seems to produce this error when the "input_data" is over 4000 long. But I'd like to train it on 180,000 long array. I just finished a text generation class and trying to make my model generate some Eminem lyrics, and it's actually not doing too bad only using about 5% of all Eminem's words (4k out of 180k).

'''

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import string
import numpy as np
import pandas as pd


# Eminem lyrics https://www.kaggle.com/thaddeussegura/eminem-lyrics-from-all-albums

from urllib.request import urlopen

data = urlopen('https://storage.googleapis.com/kagglesdsdata/datasets/835677/1426970/eminem_lyrics/ALL_eminem.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20200924%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200924T201536Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=9e8afd7dba5915b209e33905c68e93f2bfb1d3baac9456e1a0d16d1b74a0b482baa26bb6f348c2f901b46b63555b1a2bcc900c9db7d17321c27fe4578cc5d12463ca6b3e7c8998cf66a05a33b4b324dba3e48341d010f13a423debb8d1c2f52536870a9cc3ddfa72a4ca9bda874e934bcfdd21512e413e068bbd8c0a2a4042df66358d978080d164ead2f9e0edf1eee4bf66cf2f5c0aa63a5b7e9cea80ca6c211a0558aca9e7671235f105074f5f3f74abb882001acec29573c84b8ed9bf044b7233fb270a12fefe01bd40fe64b44cc0b89d54469357719d14404bb3c6033961c25af43c5c5f9c20fc090cf38fe03946058ecb9b67ebdfe4022c564480a2c73c').read().decode('utf-8')


# split
text = data.split()

# remove puctuation, make all lowercase
dataset = []
import re
for s in text:
    s = re.sub(r'[^\w\s]','',s).lower()
    dataset.append(s)


def tokenize_corpus(corpus, num_words=-1):
  # Fit a Tokenizer on the corpus
  if num_words > -1:
    tokenizer = Tokenizer(num_words=num_words)
  else:
    tokenizer = Tokenizer()
  tokenizer.fit_on_texts(corpus)
  return tokenizer

# Tokenize the corpus
tokenizer = tokenize_corpus(dataset)

total_words = len(tokenizer.word_index) + 1
print(total_words)


# get inputs and outputs
input_data = []
labels = []
for i in range(180000):
    tokens = np.array(sum(tokenizer.texts_to_sequences(dataset[i:i+11]), []))
    input_data.append(tokens[:-1])
    labels.append(tokens[-1])

input_data = np.array(input_data)
labels = np.array(labels)

#print(input_data)
#print(labels)

# One-hot encode the labels
one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)

I also tried to convert 'input_data' to tensor, make it different dtype, etc. and that only produces all kinds of different errors. But if change 180000 to anything less than 4000 everything works fine

If the model can't process all 180,000 sequences at once, can I break it down in 45 arrays of 4000 each and train for 5-10 epochs on each?

model:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

model = Sequential()
model.add(Embedding(total_words, 64, input_length=10))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(input_data, one_hot_labels, epochs=100, verbose=1)

last line gives the error, maybe I should change something in the model itself?

The rest here but the "seed_text" is just copied from the class lab:

seed_text = "im feeling chills getting these bills still while having meal"
next_words = 100
  
for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=10, padding='pre')
  predicted_probs = model.predict(token_list)[0]
  predicted = np.random.choice([x for x in range(len(predicted_probs))],
                               p=predicted_probs)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == predicted:
      output_word = word
      break
  seed_text += " " + output_word
print(seed_text)

Please help to resolve this error, and let me know if you have any ideas how I can improve the model overall.



Read more here: https://stackoverflow.com/questions/64070336/valueerror-failed-to-convert-a-numpy-array-to-a-tensor-unsupported-object-type

Content Attribution

This content was originally published by Petr Frolkin at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: