Re: xlsread in Octave 3.6.4

On Wed, Sep 4, 2013 at 10:07 PM, Philip Nienhuis <address@hidden> wrote:

Markus Bergholz wrote:

On Tue, Sep 3, 2013 at 11:42 PM, Philip Nienhuis <address@hidden
<mailto:address@hidden>> wrote:

<moved from help-octave to octave-maintainers ML>

Markus Bergholz wrote:

On Mon, Sep 2, 2013 at 11:38 AM, Markus Bergholz
<address@hidden <mailto:address@hidden>
<mailto:address@hidden <mailto:address@hidden>>> wrote:
On Mon, Sep 2, 2013 at 12:10 AM, Markus Bergholz
<address@hidden <mailto:address@hidden>
<mailto:address@hidden <mailto:address@hidden>>> wrote:
On Sun, Sep 1, 2013 at 11:42 PM, PhilipNienhuis
<address@hidden <mailto:address@hidden>
<mailto:address@hidden <mailto:address@hidden>>__>

wrote:
Markus Bergholz wrote

<snip>

>>>> Markus Bergholz wrote
>>>> > I haven't follow this thread and it's issue, but
i've wrote a
>>>> xlsxread
>>>> > function whitch don't need java.
>>>> > but it's very very rudimentary, works just with
linux and is a
>>>> quick&dirty
>>>> > write-down.
>>>> > furthermore, you have to remove the string-analyse
part, if your
>>>> sheet
>>>> > don't contain strings.
>>>> > but maybe it helps someone else or someone want to
improve it or
>>>> someone
>>>> > rewrite it in c/c++ as oct file, to get it even
faster than
>>>> matlab (for me
>>>> > it's still faster than the java stuff atm).

<snip, see thread on help-octave ML>

i've made a few quick and dirty changes, change to gpl
licence and
commit the broken range part too.

https://github.com/markuman/__xlsxread

<https://github.com/markuman/xlsxread>

it's now plattform indepentend and - once again - faster
than before
(~58 seconds). now it's nearly twice as fast as matlab
(~110 seconds).
enough time to waste it for ranges, strings etc in future.

here comes version 0.6 - https://github.com/markuman/__xlsxread

<https://github.com/markuman/xlsxread>

* strings and calculations are now replaced with NaN (without
any speed
losses!)
* testet with a excel 2007 and excel for mac 2011 file
(example files
are added)
* it's using now nested functions. this should be easier to
ingetrate it
in octave-io

ranges and empty columns still don't work!

Good work Markus.

Anyway, sorry to come up with a few more potential gotchas:

i know it's not finished ;)

- Interesting would be if your code properly handles merged and
hidden cells. I don't know what they look like in raw OOXML.

- Does OOXML have repeated-rows and repeated-columns "folding"?
E.g., ODS1.2 has the table:TableNumberRowsRepeated and
table:__TableNumberColumnsRepeated tags.

Yes, I think we are talking about the same. The last time I've take a
look at it is 3 month ago. That's the next step.
And it would be helpful if some others can commit some example files for
this situation. That's all i got
https://github.com/markuman/xlsxread/tree/master/example

Oh, I have several specially cooked-up spreadsheet files lying around for testing (of course I have), but only a few in OOXML format. But translating from BIFF8 (Excel'97) or ODS to OOXML is easily done.

These days I have a test script in the io package that is used to test basic functionality of spreadsheet I/O for all supported interfaces (xls, ods). I started adding more complicated test cases to it earlier this year, incl. merged cells, but that is still unfinished.

In io-1.2.x there's a io_xls_testscript.m that can invoke interface "A" to create a spreadsheet file and invokes interface "B" to read it back (I made that for OpenXLS as its write support is currently disabled). Shouldn't be hard to amend it for testing your version.

It would be really good to have a Java-free (and ActiveX-free)
spreadsheet reading capability in Octave, even if only a basic one.

Sergei suggested a Perl-based solution; but Perl would still be a
dependency, not all systems have Perl installed (e.g., Windows).

So this is obsolete now.

Well...

1. I myself like choice, and I'd like to offer users some choice too, so I'm not opposed to adding it in if it is not too much work.

2. I think/hope it'll give me a chance to learn a little bit of Perl.

3. Perl offers write support as well (according to Sergei; I didn't check).

4. Many Linux distros have Perl installed in the base system.

==> So I keep this option on my TODO list; in the "sometime, somewhere" section, that is.

i've take a look at the perl stuff, but it seems to be very slow with three nested loops. however, i'd like to know how to interact between perl and octave (any examples?).

You've made a first try for OOXML; I have a basis for decoding ODS
lying around, it doesn't work at all yet but might not need undue
amounts of attention.
You made the vital piece: unzipping the spreadsheet file to disk.

For inclusion in the OF io package (in a later stage, first try to
get your version robust and fail-safe) I'd suggest to see how the
various "interfaces" are built and called in the OF io package.

I was guessing that this was the message from the "ToDo" and "ChangeLog"
part from the README.md file.
* do the cell folding (e.g. empty columns) part
* make it more robust (there may be >100 disregarded xlsx variation)
* cleanup the code and fit it for the io package
but next week i'm on travel so there won't be any updates in the next ~2
weeks from me.

There's absolutely no hurry. I'd rather see a well-tested and robust version than a sloppy Q&D job that provokes heaps of bug reports.
Take your time.

For the time being, just keep building & testing it as a stand-alone self-contained xlsread.m
When it's time I'll help integrating it in the OF io package.

<snip>

Oh BTW another idea (that I explored in 2009 but couldn't get to
work at the time):
There is a binary (compiled) xmlread function, currently it is in
the io package. Maybe with a proper "template" it could just read
the worksheets into a struct in RAM, faster than regexp can decode
it. The missing piece is the "template". (sorry for my lack of XML
proficiency & lingo) Unfortunately that xmlread is very tersely, if
not badly documented.
However there are xml toolboxes around that could be gotten to work
in Octave.

I'll take a look at it.

Again, no hurry.

But at the moment i'm happy with this solution. It's easy and fast to
kill several needs (like t="s" and formulars like <f>) with just one
regexp command - even if it's nested.

Yep.

Okey, some good and some bad news.

first i change tactics how i read the values. now i read first the dimension and creat a corresponding matrix full of NaN.

in the second step i read all numeric values (no t="s", no <f> etc) and their position (e.g. B2). the third step is to transform B2 in the octave index style (2,2) (atm limited from A1 to ZZZ9999) and copy the value in the matrix full of NaN (vectorize, no loops. it's a small regexp inferno and one time i use cellfun (there is much space for improvements and code cleanup i guess).

This works very stable (http://p.osuv.de/index.php/Dlu1/) but it increase run time and memory usage!

E.g. for excel2007.xlsx it was "Elapsed time is 0.078457 seconds" and is now "Elapsed time is 0.17204 seconds".

But the huge example.xlsx (1234x1234) feels now like it uses java again! :'-( :D

So atm i lost one goal (speed/runtime) and reach one goal (robustness) at the same time. but hey, it's still java-free 8-)

next week i'll going to test some more sheets at work (huge, folding etc).

Thanks,

Philip

--
icq: 167498924
XMPP|Jabber: address@hidden

From:	Markus Bergholz
Subject:	Re: xlsread in Octave 3.6.4
Date:	Fri, 13 Sep 2013 23:36:27 +0200