Re: Table I/O [WAS: io-2.4.13 released]

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Table I/O [WAS: io-2.4.13 released]

From:	Andrew Janke
Subject:	Re: Table I/O [WAS: io-2.4.13 released]
Date:	Fri, 18 Oct 2019 14:58:56 -0700
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0



On 10/18/19 2:19 PM, PhilipNienhuis wrote:

apjanke-floss wrote

On 10/18/19 2:29 AM, Philip Nienhuis wrote:

Kai Torben Ohlhus wrote:

On 10/18/19 5:30 PM, PhilipNienhuis wrote:

[...]

For readTable() and friends Octave needs to have the Table class
implemented. To my knowlegde there are no definite plans for that;
there is
a prospect of a start by Markus Bergholz somewhere in (IIRC) the bug
tracker, from several years back.


Regarding this, I want to remind of Andrews project [1].


Thanks, I forgot about that. Yeah, an impressive piece of work. But,
still no Table I/O there AFAICS.
On github (or gitlab?) I saw a readTable() that effectively is a wrapper
for xlsread. Could be a temporary solution.


Thanks Kai and Philip! Table I/O is on my TODO list for Tablicious. Now
that I know there's appetite for it, I'll bump up its priority.

https://github.com/apjanke/octave-tablicious/issues/49

It looks like Forge io's spreadsheet reading is mature enough that I
could build Table I/O on top of that. And that seems like a good design:
no need for another package to re-implement basic spreadsheet I/O.

With one exception I can see: cell formatting. I need to be able to
detect which columns are formatted as dates, so they can automatically
be converted to @datetime objects. (And eventually I'd like to provide
the option to efficiently read selected columns as @categorical, but
that's much lower priority.)

Is there an efficient way in io's spreadsheet support to detect the
"type"/format of cells/columns? Or would you be willing to work with me
to add one to io? (Maybe by just exposing the cell format as yet another
output arg of xlsread()?)

Same with writing tables: I'd need a way to control cell formatting so
that dates could be formatted as dates explicitly instead of shooting
them in as strings and relying on Excel's auto-parsing functionality
(which I think only works when you're using the Excel COM interface
anyway). Would also be nice to do stuff like bold column headers and
freeze panes.


First off, sorry for a long reply.

I think, no I'm sure that all that you want there is possible. But it isn't
going to be easy.


Glad to hear!

The first thing you'll hit (at least, I hit it) is that Octave itself has no
Date or Time type. Only since classdef got implemented there may be a way
out, but for classdef objects there still is no reliable I/O to e.g., .mat
files.
This (no dedicated date or time type) is one of the reasons I left dates and
times aside for spreadsheet I/O. In fact, for file types and support
libraries that do offer date/time types I made the io package convert
date/time values into Octave datenums = doubles.

Tablicious provides a Matlab-compatible Octave @datetime class. If iocan pass me datenums (or strings) and some indication that "this cell isa date value", Tablicious can do the rest.


The mat file I/O I'll leave for upstream core Octave to sort out.

Then, when you write about cell formatting to uncover cell type I can't help
smelling "Excel". But .ods has a much richer cell type spectrum, maybe you
can largely skip formatting there.

You definitely smell Excel. That's all I'm familiar with; the .odsformat is new to me. If it exposes a true DateTime cell type indication,then probably no need for sniffing the formatting.

Furthermore, it looks like you imply something like dataframes with headers
and row id's but the spreadsheet file types I know of really only have
individual cell types inside. Spreadsheet I/O usually happens at a fairly
low level (i.e., individual cells). Formatting ("styles") might happen over
ranges but not necessarily along table paradigms.

Yep, table works like a dataframe, and contains homogeneously-typedcolumns. Converting from spreadsheet per-cell typing/formatting totable's column-oriented typing would be done with heuristics like "ifall cells in a column contain date values, store that column as a@datetime vector; if they're all numbers (detected using otherheuristics), use a primitive double vector; otherwise, use aheterogeneous cell vector".

But I know of Java support SW (e.g., jOpendocument) that might also offer
higher level I/O - the level of entire tables. ActiveX also does that
("Ranges"). I once tried such I/O with jOpenDocument but it was too
complicated for me at the time as there's also a lot of Java itself
involved, i.e., it seemed I needed to build my own .jars. (Oh, and now I
remember) javaArrays didn't work well at the time; even these days I have
the impression they're not so robust (but I may well be wrong).

When I've done this in the past for Matlab, I've used Apache POI. Itworked fine, except for performance: to be fast at reading large arraysof numerics, you need to write a small custom buffering layer in Java sothat you're not looping over per-cell read() calls in an M-code levelloop to do the type discovery heuristics and array buffering there. So Iended up needing to build my own JARs, and expect we would for Octavetoo if we wanted to Go Fast using a Java spreadsheet interface. A Rangein either ActiveX or POI (and I assume jOpendocument) doesn't work forMatlab/Octave well because the contents of each cell are themselvesindividual objects that need to be accessed with method or attributecalls, whereas to be Fast, you need to pass an entire *primitive*numeric array across the M-code/external language boundary.

But I'm happy to have it be slow for now, as long as the basics work.Then, doing accelerated spreadsheet I/O could be an enhancement for theio package some day, and if Tablicious were built on top of it, it wouldget the speedup automatically when it happened.

A bit of context:

.xls files (the old BIFF5 and BIFF8 ones) have no dedicated Date, Time or
DateTime cell type. In fact they basically only have Double, Text and
Boolean. The contents are further indicated by cell formatting; that way one
can find out -in theory!- which text cells are formulas and/or date/time. I
think this is what you referred to when mentioning "cell formatting"

Yep, that's what I meant. Dates are detected by seeing that a cell holdsa Double, and then looking at the cell format to see if it's beingpresented as a date.

BUT: AFAICS most of the Java libraries for spreadsheet I/O that I used (and
on Windows, ActiveX) shield this from you and offer at most a Formula cell
type. So DateTime etc. is harder to uncover; although in Visual Basic
(ActiveX) there is a dedicated type (see __COM__.cc in the OF windows
package).

The newer .xlsx files do have dedicated Date/Time cell types, see e.g.
__OCT_xlsx2oct__.m; the OCT interface explicitly processes them into
datenums :-)

I didn't know this! That'll make things easier. Or, rather, morecomplicated, since a robust readtable() will need to support both .xlsand .xlsx, so it'll need code for both kinds of cell type detection.

.ods has a richer type spectrum, it does have Date, Time and even Currency
cell types. But again, not all spreadsheet interfaces return all cell types.

All in all, asking to uncover spreadsheet cell types beyond Double, Boolean,
String and Formula is asking to open a big and fully stuffed can of worms
:-)   It is all file type and interface dependent.

I know it's a big can of worms and a mess, and will never be 100% Right;I've done this whole shebang before for Matlab. But dates/times are *so*important in data analysis, I want to support those out of the box inTablicious.

I don't care about Currency or other advanced types, I think. They don'tcrop up as much in data analysis, there's probably no correspondingOctave type to convert them to, and Matlab doesn't support them AFAIK,so there's no Matlab-compatibility need for them.

FYI, down in the io package NEWS file there's a table outlining which file
formats can be processed by what interfaces. That'll give a first indication
of complexity.

I am surely willing to help you out, given that I do not have not so much
time anymore for Octave these days. But yeah I think it'll be fun :-)
As the io package's spreadsheet I/O is so old (it started > 10 years ago)
and has been Just Working all that time I have to admit that my
"operational" knowledge of its innards got fairly rusty.

Thanks! I'll have a look through that table and start reading the iocode to get familiar with it.

Maybe open a task in the Task Tracker? that tracker is largely, but IMO
unduly, dormant.

Where can I find the Task Tracker? I don't see it linked from the ioOctave Forge package page.


Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

io-2.4.13 released, Philip Nienhuis, 2019/10/17
- Re: io-2.4.13 released, Olaf Till, 2019/10/18
  - Re: io-2.4.13 released, PhilipNienhuis, 2019/10/18
    - Re: io-2.4.13 released, Kai Torben Ohlhus, 2019/10/18
    - Re: io-2.4.13 released, Philip Nienhuis, 2019/10/18
    - Re: io-2.4.13 released, Andrew Janke, 2019/10/18
    - Table I/O [WAS: io-2.4.13 released], PhilipNienhuis, 2019/10/18
    - Re: Table I/O [WAS: io-2.4.13 released], Andrew Janke <=
    - Re: Table I/O [WAS: io-2.4.13 released], Julien Bect, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], Andrew Janke, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], PhilipNienhuis, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], Andrew Janke, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], PhilipNienhuis, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], Andrew Janke, 2019/10/19
    - Re: Table I/O [WAS: io-2.4.13 released], PhilipNienhuis, 2019/10/20
    - Re: io-2.4.13 released, Julien Bect, 2019/10/18
- Re: io-2.4.13 released, Andrew Janke, 2019/10/18

Prev by Date: Table I/O [WAS: io-2.4.13 released]
Next by Date: Moose etc dependencies for Forge package doco tools
Previous by thread: Table I/O [WAS: io-2.4.13 released]
Next by thread: Re: Table I/O [WAS: io-2.4.13 released]
Index(es):
- Date
- Thread