Does aspose.cell for java support "event-driven" parsing

Hi,
we currently run an evaluation if Aspose.Cells for Java fulfills the business needs in our company. The two major use-cases we are interested in are the calculation of MS Excel Workbooks within Java web applications and parsing of MS Excel Workbooks. So far we used an Open Source product (Apache POI) to do this. Due to the fact this product does not support xlsx files we would like to migrate to Aspose.Cells.

We have already prototyped the calculation use-case and it worked well, despite the fact it was not possible to open an xlsx file created by Aspose.Cells (1.9.1) again (neither by MS Excel nor by Aspose). But the rest worked fine :slight_smile:

The second use-case is the parsing of very large Excel files (up to 30 MB). For this use-case we neither need direct cell access nor calculation capabilities. If we read the whole file into the memory and a Java Object Tree is created, this will consume a lot of memory and bring down the web servers (in our applications we typically have a lot of concurrent users and hence a high system load). Apache POI provides the feature ‘event-driven parsing’. This is compareable to a SAX parser where the application is notified on a new XML element and saves a lot of memory. In the case of Apache POI the application is notified if for example a “cell-event” occurs and no Workbook object is created (for the parsing use-case we don’t need direct cell access or calculation)

Is this feature also supported by Aspose.Cells or does Aspose.Cells provide other memory optimization mechanisms?

Thanks a lot,
kind regards
Gunter


Hi Gunter,

For the issue of opening xlsx file created by aspose.cells, would you please try the latest versions:

Aspose.Cells for .NET (Latest Version)
Aspose.Cells for Java (Latest Version)

If the issue still persists, please give us a template file to reproduce this issue, we will check and figure out it soon.

For large excel file, does it meet your requirement if we provide method to load and parse only one specific worksheet at a time and leave other worksheets in stream? And we will provide method to load only Styles and cell values for a workbook, ignore other data such as picture, shape, and so on, that will save memory efficiently when there are many large embedded objects.

Hi,
This would indeed help a lot. However you don’t have to provide the method right now, we’re still in an evaluation phase for Aspose.
For now I am confident that we will be able to implement the second use-case with Aspose and your help in future.

Thanks,
Gunter

Hi Gunter,

Which solution can meet your requirement more efficiently, loading only one worksheet at a time or loading only cell values without objects? As you know, we can implement that separately. For the later one, that is, loading only cell values without objects, we can provide it soon for your testing.

Hi,
After a closer look and some tests we found out that loading only one worksheet at a time will not fulfill our requirements. I don’t know if loading the cell values only will scale. The workbooks don’t have a lot of formatings or other things inside. They are just a huge data collection. The Strings in the cells are mostly different, hence we won’t profit of the string memory management done by Java. What do you think - what would be the size of a 30 MB Excel File loaded into memory (only cell-values)?

For a real scalable approach it would be nice to provide mechanisms which can parse the Excel file line by line and maybe calling a callback function or providing an abstract class which can parse an Excel line by line.

Thanks,
kind regards
Gunter


Hi Gunter,

I don't think loading cells value only can be helpful for your situation if there is just a huge data collection. Commonly an original file 30M will need about 300M memory when load it. If you need only the String values cell by cell, and need no any formula, formatting, or anything others that potentially refer to other global data such as other cells, sheets, or workbook and so on, we will look into it.

Hi,
our workbooks constist of several sheets which contain string, decimal, numeric and date data (we need the data type too - providing only the string values would not be sufficient). There are no references or formulas inside. There is also no need for formatting… So they are really just a data collection. If you could provide a scaleable function to parse such large files it would be great !

Thank you,
Regards
Gunter

Hi Gunter,

Which version of Excel file do you need to support this feature for first, xlsx file of Excel2007 or xls of Excel 2003? We will look into it soon. But for any verison of Excel file, something such as String, data type, need be read in completely as global data, otherwise it will be a huge burden for performance when fetch the value of Cell. So, compare to using existed API of Aspose.Cells such as Workbook, Worksheet and Cell, this "event-driven" feature can get obvious improvement for performance, but I am afraid it will not be so much as the "event-driven" feature of SAX parser.

Hi,
For first we would need to parse the old xls format, but later - in one year our customers will switch to Office 2007. As mentioned above we are using Apache POI. Maybe have a look how they do it (it’s open source)

Thank you,
Regards
Gunter

Hi Gunter,

Now we support parsing cells data in Event-Driven mode. Please try this attached patch. The new APIs for this mode are class LightCells, LightCell and CellHandler. LightCells is the entry to open and process a workbook, LightCell is the wrapper for data of one cell. CellHandler is the interface user should implements to process a cell. Following sample code maybe help you:

...
LightCells lightCells = LightCells.getInstance("t.xls", FileFormatType.EXCEL2003);
CellHandler handler = new CellHandlerSample();

lightCells.processWorkbook(handler);//lightCells.processWorksheet(0, handler);
...

class CellHandlerSample implements CellHandler
{
private int _sheetIndex;


/**start with a worksheet, check whether process it. If this handler need to refer to sheetIndex later in startCell() or process() method, that is, if the process need to know which worksheet is being processed, the implementation should retain the sheetIndex value here. */
public boolean startSheet(int sheetIndex)
{
_sheetIndex = sheetIndex;
return true;
}


/**start with a cell, check whether process it*/
public boolean startCell(int row, int column)
{
return true;
}


/**process current cell.
Note: In consideration of performance, the LightCell object maybe reused for all cells, so if it is needed to keep current cell's data for later use, please call LightCell.copy() and keep the cloned one*/
public void process(LightCell cell)
{
int row = cell.getRowIndex();
int col = cell.getColumnIndex();
int type = cell.getValueType();
Object val = cell.getValue();
System.out.println(_sheetIndex+"-"+row+"&"+col+":"+type+", "+val);
}
}