I am not sure about the best practice for training a model using mxnet when the training data is in parquet files that cannot be held in memory. I am using mxnet with python running in aws sagemaker, and until recently I could get away with training on instances with 732gb memory, but now I need much more than that and need to find a feasible solution.
Mxnet requires a dataset with random access (implements
__getitem__). I have thousands of parquet files generated with spark and I am unable to hold the data in memory(2tb+ training data when unpacked).
I am currently trying to read 10000 rows into an array, save that to disk, and then work with batches of 10000 rows, using a custom dataset which loads and unloads for the idx being looked up.
It feels wrong, so it probably is. Is there something I am missing which is possible to do for getting this to work smarter/smoother/faster? I have spent months working on this issue now.