gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: libextractor - key-value pairs and mime types


From: Christian Grothoff
Subject: Re: libextractor - key-value pairs and mime types
Date: Tue, 8 Feb 2022 18:00:39 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0

On 2/8/22 2:38 PM, madmurphy wrote:
> Got it! I agree about your solution for the duplicate mime types.
> 
>     but until that is done, a key-value pair type would at least be
>     better than 'unknown'.
> 
> “Unknown” can continue to exist as an identifier for other cases, just
> not the key-value ones :)

Yes, of course. That's what I meant, too.

> Also I forgot to mention a third point:
> 
> 3. Add an |EXTRACTOR_METATYPE_NO_METATYPE = -1| to |enum
> EXTRACTOR_MetaType| (more or less like |NULL| if that was a pointer).
> Without a |EXTRACTOR_METATYPE_NO_METATYPE| a programmer is forced to
> save the |have_metatype| information in another variable. The fact that
> it is a negative number is not a problem, because as the name suggests,
> /it is not a metatype/.

Are you sure that EXTRACTOR_METATYPE_RESERVED (0) is not good enough
here? We usually use it to terminate lists, but as far as I see, it
should be what you think of as NO_METATYPE. But if there is a use-case
RESERVED doens't cover, I'm not categorically against introducing a -1
value.

Patches welcome ;-).

-Christian

> P.S. Sorry for picking the wrong mailing list!
> 
> 
> On Tue, Feb 8, 2022 at 9:57 AM Christian Grothoff <grothoff@gnunet.org
> <mailto:grothoff@gnunet.org>> wrote:
> 
>     Hi madmurphy,
> 
>     The 'correct' place for GNU libextractor discussions would be
> 
>       https://lists.gnu.org/mailman/listinfo/libextractor
> 
>     Alas, with my libextractor maintainer hat on, I would say this:
> 
>     On 2/7/22 10:01 PM, madmurphy wrote:
>     > Hi again, GNUnet people.
>     >
>     > Is this the place where to discuss about libextractor? I have two
>     points.
>     >
>     > #1 I often see something interesting. Key-value pairs are
>     categorized as
>     > |EXTRACTOR_METATYPE_UNKNOWN|:
>     >
>     > unknown: chroma-format=4:2:0
>     > unknown: bit-depth-chroma=8
>     > unknown: colorimetry=bt709
>     > unknown: stream-format=avc
>     > unknown: stream-format=raw
>     > unknown: bit-depth-luma=8
>     > unknown: base-profile=lc
>     > unknown: mpegversion=4
>     > unknown: profile=high
>     > unknown: alignment=au
>     > unknown: parsed=true
>     > unknown: framed=true
>     > unknown: variant=iso
>     > unknown: profile=lc
>     > unknown: level=4.1
>     >
>     > But one point is that they are often numerous, and another point
>     is that
>     > that of a key-value type is a really interesting metatype to have (and
>     > is not really “unknown”, since the key is self-explanatory). Would it
>     > not make sense to add an |EXTRACTOR_METATYPE_KEY_VALUE_PAIR| to
>     the list
>     > of MetaTypes?
> 
>     We could do that. Sometimes I think it would be better to add new
>     specific LE types for some of the above, but until that is done, a
>     key-value pair type would at least be better than 'unknown'.
> 
>     > ...
>     >
>     >   /* generic attributes */
>     >   EXTRACTOR_METATYPE_UNKNOWN = 45,
>     >   EXTRACTOR_METATYPE_DESCRIPTION = 46,
>     >   EXTRACTOR_METATYPE_COPYRIGHT = 47,
>     >   EXTRACTOR_METATYPE_RIGHTS = 48,
>     >   EXTRACTOR_METATYPE_KEYWORDS = 49,
>     >   EXTRACTOR_METATYPE_ABSTRACT = 50,
>     >   EXTRACTOR_METATYPE_SUMMARY = 51,
>     >   EXTRACTOR_METATYPE_SUBJECT = 52,
>     >   EXTRACTOR_METATYPE_CREATOR = 53,
>     >   EXTRACTOR_METATYPE_FORMAT = 54,
>     >   EXTRACTOR_METATYPE_FORMAT_VERSION = 55,
>     >   *EXTRACTOR_METATYPE_KEY_VALUE_PAIR* = XXX,
>     >
>     > ...
>     >
>     > #2 I often see that files get tagged with multiple mime types
>     according
>     > to libextractor:
>     >
>     > mimetype: video/quicktime
>     > mimetype: video/x-h264
>     > mimetype: audio/mpeg
>     > mimetype: video/mp4
> 
>     That is because different plugins (using different methods/libraries)
>     disagree on the 'correct' mime-type. Ideally, we'd identify which plugin
>     gets it wrong (and why), and unify the mime-types.
> 
>     > But that never reflects the reality, since files should have only one
>     > mime type (or at most, multiple mime types that mean the same thing).
>     > But then I see what happens with file names: there is only one
>     > |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME|, but there can be many
>     > |EXTRACTOR_METATYPE_FILENAME|s (in the case of archives, for example):
>     >
>     > EXTRACTOR_METATYPE_FILENAME = 2,
>     > ...
>     > EXTRACTOR_METATYPE_GNUNET_ORIGINAL_FILENAME = 180,
>     >
>     > Would it not make sense to do something similar for mime types?
>     Only one
>     > “original mime type”, and an infinity of secondary mime types…?
>     >
>     > EXTRACTOR_METATYPE_MIMETYPE = 1,
>     > ...
>     > *EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE* = XXX,
> 
>     I guess it depends. If this is for archives where files _inside_ the
>     archive are given mime-types, then a different metatype makes sense
>     (ditto for FILENAME: here we probably could have two types, one for the
>     'archive' and one for the 'contents'). But if the different mime-types
>     are all about the 'original' file, then we should rather figure out
>     which plugin gets it wrong. As for the "_GNUNET_" in the
>     "_GNUNET_ORIGINAL_FILENAME" there, IIRC this again different because
>     that is NOT a metatype used by GNU libextractor, but one that GNUnet
>     itself generates and puts with the 'rest ' of the metadata.
> 
>     > So, two simple proposals:
>     >
>     >  1. Create |EXTRACTOR_METATYPE_KEY_VALUE_PAIR|
>     >  2. Create |EXTRACTOR_METATYPE_GNUNET_ORIGINAL_MIMETYPE|
>     >
>     > What do you think? Does it make sense?
> 
>     It should definitively not be "GNUNET_ORIGINAL_MIMETYPE", and the real
>     question is what is the origin of the different mime-types. If this is
>     from an archive, maybe we should introduce
> 
>     EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_FILENAME
>     EXTRACTOR_MIMETYPE_ARCHIVE_CONTENT_MIMETYPE
> 
>     and reserve
> 
>     EXTRACTOR_MIMETYPE_FILENAME
>     EXTRACTOR_MIMETYPE_MIMETYPE
> 
>     for the top-level file. But AFAIK that won't solve your mime-type issue,
>     which should really be resolved by going over the plugins and finding
>     out why and where they disagree and picking the 'right' answer.
> 
>     My 2 cents
> 
>     Christian
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]