How to train with mxnet using parquet files

I am not sure about the best practice for training a model using mxnet when the training data is in parquet files that cannot be held in memory. I am using mxnet with python running in aws sagemaker, and until recently I could get away with training on instances with 732gb memory, but now I need much more than that and need to find a feasible solution.

Mxnet requires a dataset with random access (implements __getitem__). I have thousands of parquet files generated with spark and I am unable to hold the data in memory(2tb+ training data when unpacked).

I am currently trying to read 10000 rows into an array, save that to disk, and then work with batches of 10000 rows, using a custom dataset which loads and unloads for the idx being looked up.

It feels wrong, so it probably is. Is there something I am missing which is possible to do for getting this to work smarter/smoother/faster? I have spent months working on this issue now.

Read more here:

Content Attribution

This content was originally published by Mikkel F at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: