Tuesday, October 27, 2009

CSV files again


The alpha release of Matrex 2.0 solves the problem I reported about importing large CSV files.

Already in the previous versions Matrex used a virtual table to show the imported CSV files; this means the table (grid) loads from the memory only the rows that it needs to display.
The next step was simple: in version 2.0 the file is not even loaded in memory; only the rows that are displayed are actually loaded from the file.
In this way the memory used to import the CSV file decreased dramatically compared to the previous versions of Matrex.
To avoid performance losses with this new version, Matrex keeps a cache of 2000 rows from the file (the 2000 rows around the last row loaded from the file); in this way scrolling the table up and down is still fluid.
The following picture shows the 3 levels of the CSV file import: file, memory cache, table.




If the table is scrolled a lot up and down it can still be that many rows are loaded in memory, and released immediately after; to avoid memory losses because of this, Matrex calls directly the garbage collector every 50000 rows loaded.
In this way it was possible to import data from the 22 MBytes CSV file for which the memory problem was reported, running Matrex without any special memory option.

This fix will be part of version 2.0, but I also back-ported it to version 1.3.8, which will be published in a few days.

Wednesday, October 14, 2009

Matrex 2.0 alpha

I published an alpha (unstable) version of the new Matrex 2.0, which adds the possibility to use Matrex as a client/server system.
You can download it from here.
To test it as a client/server system:
  • Install this version of Matrex. It is a generic version, so remember that when you start it the first time it will download the graphical library SWT. Only the second time you start it it will really start.
  • Install the Matrex Server. The setup file is in the Matrex directory, called matrex_server_2_0.jar.
  • Execute rmiregistry. It is the RMI registry server (Matrex Desktop and Server use RMI to communicate). It is part of the Java Runtime Environment (JRE).
  • In the Matrex Server directory, execute matrex_server.bat (Windows) or matrex_server.sh (Linux, MacOSX...) to start the server. Check that there are no errors.
  • In the Matrex directory start Matrex.
  • Follow this to let Matrex open a server project.
You can login as guest (password guest).
If you want to login as a different user, you need to change the config/accounts.xml file in the Matrex Server directory, adding an account element with the userid and password, always setting the encrypted attribute to false.

To become final release, Matrex 2.0 needs the following changes:
  • Fix potential issues when a single project in one server is opened concurrently by several users.
  • Some operations, like adding functions or functions expression (expression parser) cause the addition or update of several items in the project. Therefore these operations must be done atomically, possibly using some kind of transaction.
  • Check that the resources allocated to a client in the server are cleaned up correctly when the client disconnects.
If you find problems with this alpha version, please add a comment to this article.

Monday, October 05, 2009

Large CSV files

A bug submitted the last month was showing that Matrex needs a lot of memory when importing large CSV files.
The CSV file mentioned in the bug has a size of 22 MBytes.
The file contains around 200000 lines. Each line has 14 fields and is around 100 characters long.

To import this file the memory used by the Matrex process increases of 300 MBytes.
The CSV file is large but it does not justify so much RAM to handle it.

I checked the code that imports CSV files in Matrex; nothing is wrong.
Matrex uses the Java CSV library to read CSV files, which works fine.

The file is loaeded in memory row after row.
I checked how much memory is used for each loaded row. This is not easy in Java, since nothing similar to a sizeof function exists in the standard libraries. But I have found the Javabi library, which is able to measure the amount of memory used in total by a java object.
Each row with its fields is handled as an array of strings, which uses around 800 bytes, 8 times the original row's size.
This is because:
  • Java strings use Unicode, which means that they use 2 bytes for each character
  • Strings use additional memory for their fields and their fields alignments
800 bytes * 200000 rows = ~160 MBytes.

As far as I understood the rest of the memory used to import the file is allocated to the intermediate strings that the CSV reader uses to read the file, and that remain allocated until the garbage collector frees them.

There are some solutions that could be applied to reduce the memory use:
  1. Avoid loading all rows of the CSV file in memory: in other words, make the import editor extract the displayed lines directly from the file, and let it extract the lines only when they are actually displayed.
    I'm not sure about the effectiveness of this solution, because in general the user wants to import to a matrix an entire column of the file. Therefore the file, sooner or later, has to be read entirely.
    Another problem with this solution is that the Java CSV library, as far as I understood, does not allow to count the number of rows without reading them and does not allow to jump between rows without reading all the intermediate rows.

  2. Read less fields: immediately at the start of the import process give the possibility to the user to discard some fields, so that they are not loaded in the import dialog.
    This can work, but I am not sure that it can dramatically reduce the amount of memory used to load the file.

  3. Optimize the reading process so that it uses less memory: this means to look for an alternative to the Java CSV library that uses less memory to read the files (for example using CharSequence objects that use less memory). There are alternative libraries, for example opencsv and the Ostermiller utilities. They need to be tested and see if they are better than the Java CSV library to reduce the used memory.
I will try to apply these solutions and explain, in one of the next articles, what has been done.

Thursday, September 17, 2009

Matrex 2.0 is not just a specification anymore

After more than two months of work I was able in these days to let Matrex open a project in a Matrex server.
This is how it works:

In Matrex click on the menu File->Connect. The following dialog appears:


In this case the server is on my PC, the same one in which I run Matrex, so I write localhost as server address.
Clicking on Names, the Matrex Server combo box is populated with the list of available servers in the PC with the given address.
In my case there is only one server, default.
I press OK. The login dialog appears:


Guest is the default user, with password guest. It is the user that is available by default in the server, if it has not been configured.
I write user id and password and press OK.
The remote machine tab for the localhost server appears, beside the local machine:


In the machine menu I click on Open Project. The Open Remote Project dialog appears:


Differently from the local projects there is only the possibility to choose in a list of projects. In facts on the server side the projects are all under the same directory, projects.
I select the example project popcolorado and the project opens:



By now I checked that it is possible to open matrix and functions viewers and editors.
In the next days I will check all the project's functionalities.
As usual, the sources for the last version of Matrex are under the Matrex subversion repository.

As soon as I have a version that is tested enough, I'll publish it as a pre-alpha.

Monday, August 24, 2009

Matrex 1.3.7

I released version 1.3.7, which fixes the bug reported as a comment to the Matrex 1.3.6 blog entry, and that I entered in the Matrex bug tracker:

Matrex seems to forget the Project settings (threading, etc.) after a restart;

This was caused by Matrex not being able to overwrite the project file.
The files are the following (in alphabetic order):

matrex_1_3_7_generic.jar for any platform
matrex_1_3_7_linux_gtk.jar for Linux
matrex_1_3_7_macosx_32.jar for MacOSX with 32 bits Java (Java 5)
matrex_1_3_7_macosx_64.jar for MacOSX with 64 bits Java (Java 6)
matrex_1_3_7_win32.jar for Windows

The bug has also been fixed in the code of the next release, 2.0.

Saturday, August 08, 2009

Client/Server: technical view

As told in the previous article, I'm changing Matrex from a pure standalone desktop application to an application that allows to work standalone or in a client/server architecture.
To support the client/server architecture I use the RMI protocol.

This means that the calculation engine, the one that calculates the functions and therefore generates the content of matrices, presentations, charts, will be both in the desktop application and in the server.
For this reason, the GUI has to use in the same way the objects involved in the calculation (projects, matrices, functions...) , whether they are on the client side or on the server side.
To do this, the original calculation objects (projects, matrices functions,...) are wrapped in two different new categories of objects: Local (client) and Server:



Both the wrappers, share the same remote interface (which extends the RMI's Remote interface).

The reasons I use wrappers instead of the original objectts is because all the methods of a RMI business object must throw the RemoteException exception.
RemoteException is needed to understand when the server is down or there are problems of connection, so I would never do without it.
On the other side, it becomes annoying to catch it every time some code calls a method of a business object, so I want to do it only when it is strictly needed.
So I use the wrappers only in the GUI, where it is needed. Instead the calculation engine uses the original objects.

Now, why Local and Server wrappers? Why not use only Server wrappers, both on the server and on the client side?
There are several reasons:
  • Server machines and projects have slight different interfaces when they are on the client and on the server side, mainly because projects on the server side can only be saved in a specific directory, projects on the client side can be saved in any directory of the disk.
  • The server wrappers extends the UnicastRemoteObject, local wrappers don't. I don't really understand completely how the Java compiler and RMI compiler handle these objects, so I cannot be sure that they don't have some effects on the application's performance. If these performance effects are needed with the server business objects, I don't want them on the local objects.
And why I did not use the original classes instead of the Local wrappers? Because I needed a special wrapper for the Matrix class when it is used in the GUI, and only when used in the GUI, called SafeMatrix, which makes the Matrix methods thread safe.
But this means that all the other calculation classes need to have parameters of type SafeMatrix when called by the GUI, and instead use parameters of type Matrix when called by the calculation engine. And this means that I need special wrappers that use SafeMatrex parameters, the Local wrappers.


So, now I'm working on it. I will take some time, because in the GUI all the references to the original objects must be changed to the new remote interfaces.
Which means:
  • remote exceptions to handle.
  • utility functions to convert the original classes to the wrappers.
  • some code duplication.
  • many wrappers to write, expecially for the charts, for which there is one class for each chart type.
Also, I expect to reduce the number of methods in the calculation classes to reduce the number of remote calls.

When I have something that more or less works I'll publish it as an alpha version.

Wednesday, July 22, 2009

Working on client/server

I started to work on the client/server version of Matrex, the 2.0.

This version, as explained in the specification, will give the possibility to use Matrex in two ways:
  • standalone, as today
  • connected to one or more Matrex servers.
When Matrex opens a project in a server, all calculations for that project are done in the server: Matrex acts only as a graphical interface.
One would open a project in a server:
  • to use the CPU of the PC running the server instead of the one of his own PC.
  • to share the project with other people. In fact two or more Matrex clients can work on the same project in the same server in the same time, without problems.
Matrex has been written from the start to become one day a client/server system, so the GUI will not change so much: not much more than a new menu to connect to a server (I will publish some pictures as soon as I have a stable version).

The protocol used is RMI, but I will keep the possibility to use different protocols in the future. It could be nice to have a version (based on REST?) that can work on the internet through the firewalls.