bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Re: Thoughts on regex support


From: Matthew Woehlke
Subject: [Bug-wget] Re: Thoughts on regex support
Date: Wed, 23 Sep 2009 11:52:34 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.23) Gecko/20090825 Fedora/2.0.0.23-1.fc10 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0

Micah Cowan wrote:
[stuff about regex matching]

How will you handle nested boolean expressions? Same as 'find'?

IOW, how do you do this?
[url matches foo] AND ( [domain matches bar] OR [query matches baz] )

(Obviously I am intentionally choosing an example where the 'or' part can't be easily expressed in the regex.)

  --no-match ':field:action=(edit|print)'

Something like 'param[eter]' or 'arg[ument]' seems more sensible to me (though as a programmer I am not the best to ask about usability things). Such URL's coming from a form isn't always obvious... and in some cases is even untrue.

  . Don't follow links for producing printer-ready output, or editing
    pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?',
    but somewhat easier to write.

Just in case you're planning on a conversion to that regex in the code, remember that it is really:
  '^.*[?]([^&]*&)*action=print(&.*)?$'

This simplification is probably safe:
  '[?&]action=print&*.*$'

(I don't believe '&' has special meaning in a regex... it does on the RHS of a substitution, but we aren't discussing those.)

For that matter, if you support '\b', I wonder if you need "components" at all...

Components may be combined; to match against the combination of path and
query string, you just specify :path+query:. That could be abbreviated
as :p+q:. Combinations are only allowed if all the components involved
are consecutive; :domain+query: (no path) would be illegal.

I can probably figure out technical reasons for that, but it doesn't make much sense from a user perspective. Why shouldn't I be able to write:
  -z ':d,f:foo'
...and have it match both
   'http://foobar.com/'
 and
   'http://baz.org/index?title=foobar'
?

My expectation would be that it tries the match against the domain, then, if that fails, tries it against the fields/params/args/whatever.

You could support both syntaxes easily enough:
-z ':d..f:expr' # match 'expr' in concatenation of domain through f/p/a.
-z ':p,q:expr'  # match 'expr' in protocol or query

(And of course you can combine the above, e.g. 'p,file..args'. Another reason to use 'args', you can use 'file' and still abbreviate to one letter.)

BTW, what exactly are the components? Is this right?

[u]rl: http://foobar.com/site/images/thumb.php?name=baz.jpg&x=64&y=64
p[r]otocol: "http"
[d]omain: "foobar.com"
[p]ath: "site/images"
[f]ile: "thumb.php"
[q]uery: "name=baz.jpg&x=64&y=64"
[a]rgs: "name=baz.jpg", "x=64", "y=64"

(We could have also host/tld, but that seems like overkill when you can match against '^www(\.|$)' and '\.com$', respectively. Or - did I mention you should support '\b'? ;-) - '^www\b'.)

  - Avoid adding both a --match and a --no-match option, by making
    negation a flag instead (/n or something: --match 'p/ni:.*\.js'
    would reject any paths ending in any case variant of ".js").

Similar ideas:
 -z '(?!expr)'
 -z ':opts!expr' # instead of ':opts:expr'

Personally I think there should be a way to do inverse matches with the short option. At the same time, I don't feel strongly either way about having a long option --no-match (i.e. to have both).

  - Other anchoring options. I suspect that the many common use cases
    will begin with '.*'. We could remove the implicit anchoring, but
    then we'd probably usually want it at the end, forcing us to write
    the final '$'. That's one character versus two, but my gut tells me
    it's easier to forget anchors than it is to forget "match-any"
    patterns, which is why I lean toward implicit anchors.

MHO: implicit anchoring violates traditional regex usage. There is probably an example of implicit anchoring somewhere, but offhand I can't think of it. (And at any rate, sed/grep sure don't use implicit anchoring.)

That's inconvenient for args, but for everything else I still lean toward no implicit anchoring.

Of course, if you support '\b' (and require explicit anchoring), then it is somewhat hard to justify args (as you can just use '\bexpr\b' against query, instead of '^expr$' against args).

--
Matthew
Please do not quote my e-mail address unobfuscated in message bodies.
--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so good.' I said 'Sure, but Microsoft still ships it.'
  -- Anonymous (from cluefire.net)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]