Tuesday, October 27, 2009

CSV files again


The alpha release of Matrex 2.0 solves the problem I reported about importing large CSV files.

Already in the previous versions Matrex used a virtual table to show the imported CSV files; this means the table (grid) loads from the memory only the rows that it needs to display.
The next step was simple: in version 2.0 the file is not even loaded in memory; only the rows that are displayed are actually loaded from the file.
In this way the memory used to import the CSV file decreased dramatically compared to the previous versions of Matrex.
To avoid performance losses with this new version, Matrex keeps a cache of 2000 rows from the file (the 2000 rows around the last row loaded from the file); in this way scrolling the table up and down is still fluid.
The following picture shows the 3 levels of the CSV file import: file, memory cache, table.




If the table is scrolled a lot up and down it can still be that many rows are loaded in memory, and released immediately after; to avoid memory losses because of this, Matrex calls directly the garbage collector every 50000 rows loaded.
In this way it was possible to import data from the 22 MBytes CSV file for which the memory problem was reported, running Matrex without any special memory option.

This fix will be part of version 2.0, but I also back-ported it to version 1.3.8, which will be published in a few days.

Wednesday, October 14, 2009

Matrex 2.0 alpha

I published an alpha (unstable) version of the new Matrex 2.0, which adds the possibility to use Matrex as a client/server system.
You can download it from here.
To test it as a client/server system:
  • Install this version of Matrex. It is a generic version, so remember that when you start it the first time it will download the graphical library SWT. Only the second time you start it it will really start.
  • Install the Matrex Server. The setup file is in the Matrex directory, called matrex_server_2_0.jar.
  • Execute rmiregistry. It is the RMI registry server (Matrex Desktop and Server use RMI to communicate). It is part of the Java Runtime Environment (JRE).
  • In the Matrex Server directory, execute matrex_server.bat (Windows) or matrex_server.sh (Linux, MacOSX...) to start the server. Check that there are no errors.
  • In the Matrex directory start Matrex.
  • Follow this to let Matrex open a server project.
You can login as guest (password guest).
If you want to login as a different user, you need to change the config/accounts.xml file in the Matrex Server directory, adding an account element with the userid and password, always setting the encrypted attribute to false.

To become final release, Matrex 2.0 needs the following changes:
  • Fix potential issues when a single project in one server is opened concurrently by several users.
  • Some operations, like adding functions or functions expression (expression parser) cause the addition or update of several items in the project. Therefore these operations must be done atomically, possibly using some kind of transaction.
  • Check that the resources allocated to a client in the server are cleaned up correctly when the client disconnects.
If you find problems with this alpha version, please add a comment to this article.

Monday, October 05, 2009

Large CSV files

A bug submitted the last month was showing that Matrex needs a lot of memory when importing large CSV files.
The CSV file mentioned in the bug has a size of 22 MBytes.
The file contains around 200000 lines. Each line has 14 fields and is around 100 characters long.

To import this file the memory used by the Matrex process increases of 300 MBytes.
The CSV file is large but it does not justify so much RAM to handle it.

I checked the code that imports CSV files in Matrex; nothing is wrong.
Matrex uses the Java CSV library to read CSV files, which works fine.

The file is loaeded in memory row after row.
I checked how much memory is used for each loaded row. This is not easy in Java, since nothing similar to a sizeof function exists in the standard libraries. But I have found the Javabi library, which is able to measure the amount of memory used in total by a java object.
Each row with its fields is handled as an array of strings, which uses around 800 bytes, 8 times the original row's size.
This is because:
  • Java strings use Unicode, which means that they use 2 bytes for each character
  • Strings use additional memory for their fields and their fields alignments
800 bytes * 200000 rows = ~160 MBytes.

As far as I understood the rest of the memory used to import the file is allocated to the intermediate strings that the CSV reader uses to read the file, and that remain allocated until the garbage collector frees them.

There are some solutions that could be applied to reduce the memory use:
  1. Avoid loading all rows of the CSV file in memory: in other words, make the import editor extract the displayed lines directly from the file, and let it extract the lines only when they are actually displayed.
    I'm not sure about the effectiveness of this solution, because in general the user wants to import to a matrix an entire column of the file. Therefore the file, sooner or later, has to be read entirely.
    Another problem with this solution is that the Java CSV library, as far as I understood, does not allow to count the number of rows without reading them and does not allow to jump between rows without reading all the intermediate rows.

  2. Read less fields: immediately at the start of the import process give the possibility to the user to discard some fields, so that they are not loaded in the import dialog.
    This can work, but I am not sure that it can dramatically reduce the amount of memory used to load the file.

  3. Optimize the reading process so that it uses less memory: this means to look for an alternative to the Java CSV library that uses less memory to read the files (for example using CharSequence objects that use less memory). There are alternative libraries, for example opencsv and the Ostermiller utilities. They need to be tested and see if they are better than the Java CSV library to reduce the used memory.
I will try to apply these solutions and explain, in one of the next articles, what has been done.