[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [igraph] GraphML
From: |
Tamás Nepusz |
Subject: |
Re: [igraph] GraphML |
Date: |
Fri, 13 Dec 2013 13:33:00 +0100 |
Hi,
> It's in GraphML format and over 900MB in size. I let it run overnight
> and it's still not done. The file contains email content that I don't
> need - I'm really just after who sent an email to whom. Is there any
> way to just read this in, and ignore the rest, that might be faster?
I would do some preprocessing on the GraphML file; in particular, remove those
subtrees from the GraphML file that are within a <data key=“body”>...</data>
section. Since GraphML is just plain XML, your best bet is probably some
command-line XML manipulation tool. I was told that XMLStarlet
(http://xmlstar.sourceforge.net/download.php) is quite good at such
manipulations; I haven’t used it personally but a quick glance into its
documentation shows that you can probably achieve your goal with:
xmlstarlet ed -N ns=http://graphml.graphdrawing.org/xmlns -d
“//ns:address@hidden']” input.graphml
(The above command line may not entirely be correct, but the idea is that you
select all the “data” elements in the file where its “key” attribute is equal
to “body” and delete those. The -N option declares the XML namespace within
which the data element is to be found).
Note that the start of the file downloaded from infochimps seems to have some
metadata at the front; I had to skip the first 1024 bytes to get to the first
XML tag.
T.