Monday, October 05, 2009

Large CSV files

A bug submitted the last month was showing that Matrex needs a lot of memory when importing large CSV files.
The CSV file mentioned in the bug has a size of 22 MBytes.
The file contains around 200000 lines. Each line has 14 fields and is around 100 characters long.

To import this file the memory used by the Matrex process increases of 300 MBytes.
The CSV file is large but it does not justify so much RAM to handle it.

I checked the code that imports CSV files in Matrex; nothing is wrong.
Matrex uses the Java CSV library to read CSV files, which works fine.

The file is loaeded in memory row after row.
I checked how much memory is used for each loaded row. This is not easy in Java, since nothing similar to a sizeof function exists in the standard libraries. But I have found the Javabi library, which is able to measure the amount of memory used in total by a java object.
Each row with its fields is handled as an array of strings, which uses around 800 bytes, 8 times the original row's size.
This is because:
  • Java strings use Unicode, which means that they use 2 bytes for each character
  • Strings use additional memory for their fields and their fields alignments
800 bytes * 200000 rows = ~160 MBytes.

As far as I understood the rest of the memory used to import the file is allocated to the intermediate strings that the CSV reader uses to read the file, and that remain allocated until the garbage collector frees them.

There are some solutions that could be applied to reduce the memory use:
  1. Avoid loading all rows of the CSV file in memory: in other words, make the import editor extract the displayed lines directly from the file, and let it extract the lines only when they are actually displayed.
    I'm not sure about the effectiveness of this solution, because in general the user wants to import to a matrix an entire column of the file. Therefore the file, sooner or later, has to be read entirely.
    Another problem with this solution is that the Java CSV library, as far as I understood, does not allow to count the number of rows without reading them and does not allow to jump between rows without reading all the intermediate rows.

  2. Read less fields: immediately at the start of the import process give the possibility to the user to discard some fields, so that they are not loaded in the import dialog.
    This can work, but I am not sure that it can dramatically reduce the amount of memory used to load the file.

  3. Optimize the reading process so that it uses less memory: this means to look for an alternative to the Java CSV library that uses less memory to read the files (for example using CharSequence objects that use less memory). There are alternative libraries, for example opencsv and the Ostermiller utilities. They need to be tested and see if they are better than the Java CSV library to reduce the used memory.
I will try to apply these solutions and explain, in one of the next articles, what has been done.

No comments: