Re: documentation for (web ...)

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation for (web ...)

From:	Andy Wingo
Subject:	Re: documentation for (web ...)
Date:	Sun, 26 Dec 2010 12:45:47 -0500
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

Heya Neil,

Happy holidays, and thanks for the comments!

On Thu 23 Dec 2010 18:51, Neil Jerram <address@hidden> writes:

> address@hidden (Ludovic Courtès) writes:
>
>> I’m not keen on the comparison to POSIX.
>
> I agree with Ludo that the comparison with POSIX isn't quite right.

Sure.  It's truthy but not true, I think ;-)  I'll revise it.

> [After reading all through: I'd say you've demonstrated that data types
> are good, but haven't shown any link with security problems, so the hook
> here remains dangling.]

Will add some context.  Here are some examples though.

  * "Cross-site scripting" (XSS) is where a user submits something to a
    web site, which is incorporated into that website (like a
    comment). For example:

        (define (bad submission)
          (string-append "<b>You entered: " submission "</b>"))

        (define (good submission)
          `(b "You entered: " ,submission))

    In the first case, the application works with text. In the second,
    it works with SXML, and something in the continuation does a
    sxml->xml on the composed result.

    Both `bad' and `good' are the same for a submission of "Hello".

    But for a submission of "<i>foo</i>", `bad' yields "<b>You entered:
    <i>foo</i></b>", which effectively treats the submission with the
    same status as the template -- you can paste in anything.

    On the other hand, `good' would produce "<b>You entered:
    &lt;i&gt;foo&lt;/i&gt;</b>", when serialized.  The submission is in
    a textual context, so it is treated as text and not as HTML.

    This seems somewhat academic, but XSS vulnerabilities occur exactly
    for this reason: treating both template and input as text, instead
    of using data types to prove certain characteristics about your web
    application (in this case, that user-submitted text will never
    become javascript, executing within your domain's privileges).

    http://en.wikipedia.org/wiki/Cross-site_scripting

  * "Cross-site request forgery" (CSRF) often involves a dynamic
    payload, generated by an XSS attack. For example I might comment on
    your web log, exploiting an XSS vulnerability, adding some
    javascript on your web site which will then be run by all
    viewers. That javascript could then perform a CSRF attack.

    Anyway, XSS is often CSRF, and the above XSS arguments apply.

    http://en.wikipedia.org/wiki/Cross-site_request_forgery

  * URL encoding attacks: decoding and encoding URLs is tricky. Using a
    separate data type for URLs with a limited number of operations on
    it can help you to make sure you are doing the right
    thing. Furthermore, using proper data types to parse path and query
    components helps avoid a number of the ad-hoc string-parsing errors
    that one might have.

    http://www.technicalinfo.net/papers/URLEmbeddedAttacks.html was the
    first hit I found on the various issues.

    I wrote a short article on URLs here:
    http://wingolog.org/archives/2010/12/23/doing-it-wrong

  * Viewing headers values as strings rather than instances of
    particular data types -> "HTTP response splitting"

    http://en.wikipedia.org/wiki/HTTP_response_splitting

    Note that we are not yet entirely "protected" from this
    issue. http://lwn.net/Articles/419350/ for a recent vulnerability.

Hey, thinking on this answer a bit: perhaps the key issue is
composability.  Strings don't compose nicely, because they are missing
information: does they need escaping or not, and if they do, what kind
of escaping?  Many vulnerabilities result from confusing this issue.
Using other data types (SXML, URI records, the request/response records)
allows the data to speak for themselves.  You can safely encode your
assumptions in types.

>>    Also, strictly speaking, a URI with a fragment is a "URI reference".
>> A fragment is typically not serialized when sending a URI over the
>> wire; that is, it is not part of the identifier of a resource.  It only
>> identifies a part of a given resource.
>
> I found that a bit tricky to understand.  I think an example of what you
> mean is that a web browser would only request the URI up to and
> excluding the #, and process the #... part itself (by scrolling to that
> point in the page).  It might help to say that.

Yes, will do.

>>  -- Function: build-uri scheme [#:userinfo] [#:host] [#:port] [#:path]
>>           [#:query] [#:fragment] [#:validate?]
>
> Why is the path arg not mandatory?

Because it has a default value: "". The path arg in "my-foo-scheme:" is
"".

>>  -- Function: declare-default-port! scheme port
>>      Declare a default port for the given URI scheme.
>> 
>>      Default ports are for printing URI objects: a default port is not
>>      printed.
>
> Does this really belong here?  Seems like mixing a bit of the `model'
> into the `view'.  I'd expect a URI without an explicit port to give
>
>   (uri-port uri) => #f

This is the case.

> and that if I do
>
>   (set! (uri-port uri) 80)

Note that we actually don't export port accessors -- you have to make a
new port.

We could export these accessors though.

> the :80 would be there in the string representation of the URI.

Quoth the RFC (3986):

    6.2.3.  Scheme-Based Normalization

       The syntax and semantics of URIs vary from scheme to scheme, as
       described by the defining specification for each scheme.
       Implementations may use scheme-specific rules, at further processing
       cost, to reduce the probability of false negatives.  For example,
       because the "http" scheme makes use of an authority component, has a
       default port of "80", and defines an empty path to be equivalent to
       "/", the following four URIs are equivalent:

          http://example.com
          http://example.com/
          http://example.com:/
          http://example.com:80/

       In general, a URI that uses the generic syntax for authority with an
       empty path should be normalized to a path of "/".  Likewise, an
       explicit ":port", for which the port is empty or the default for the
       scheme, is equivalent to one where the port and its ":" delimiter are
       elided and thus should be removed by scheme-based normalization.  For
       example, the second URI above is the normal form for the "http"
       scheme.

So this default port stuff is a poor-man's scheme-specific
normalization, to not display a port component in a serialization, if
the port is the default for the scheme.

>>  -- Function: parse-uri string
>>      Parse STRING into a URI object. Returns `#f' if the string could
>>      not be parsed.
>> 
>>  -- Function: unparse-uri uri
>>      Serialize URI to a string.
>
> Or uri->string ?  And I guess parse-uri could be string->uri.
> Cf. string->number and number->string.

Sure. I think I will add a keyword arg to switch between throwing errors
and returning #f, also.

>>  -- Function: uri-decode str [#:charset]
>>      Percent-decode the given STR, according to CHARSET.
>
> So the return value is a bytevector if CHARSET is #f, and a string if
> not?

Yes.

>>  -- Function: uri-encode str [#:charset] [#:unescaped-chars]
>>      Percent-encode any character not in UNESCAPED-CHARS.
>
> UNESCAPED-CHARS is a vector, a list, ...?

A character set, actually... Will indicate.

>>      Percent-encoding first writes out the given character to a
>
> s/character/string

Actually this is correct, it's just the docstring which is odd:
"percent-encode any _character_ not in unescaped-chars". Will reword
though.

>>  -- Function: split-and-decode-uri-path path
>>      Split PATH into its components, and decode each component,
>>      removing empty components.
>> 
>>      For example, `"/foo/bar/"' decodes to the two-element list,
>>      `("foo" "bar")'.
>
> Presumably this does % decoding too, so it would be good to give another
> example to show that.

Good idea.

>>     `parser'
>>           A procedure which takes a string and returns a parsed value.
>> 
>>     `validator'
>>           A predicate, returning `#t' iff the value is valid for this
>>           header.
>
> Maybe say something here about validator function often being very
> similar to parsing function?

They are not quite the same. A parser takes a string and produces a
Scheme value, and a validator takes a Scheme value and returns #t iff
the value is valid for that header. I will add an example.

>>  -- Function: declare-header! sym name [#:multiple?] [#:parser]
>>           [#:validator] [#:writer]
>>      Make a header declaration, as above, and register it by symbol and
>>      by name.
>
> Are the keyword args really optional?  If so, what are the defaults?

Only `multiple?' is optional. Will indicate the defaults for these and
other kwargs .They are keyword args just to allow for extensibility, and
for declare-header! invocations to read better.

> A possibly important point: what is the scope of the space in which
> these header declarations are made?  My reason for asking is that this
> infrastructure looks applicable for other HTTP-like protocols too, such
> as SIP.  But the detailed rules for a given header in SIP may be
> different from a header with the same name in HTTP, and hence different
> header-decl objects would be needed.  Therefore, even though we claim no
> other protocol support right now, perhaps we should anticipate that by
> enhancing declare-header! so as to distinguish between HTTP-space and
> other-protocol-spaces.

It's a good question. HTTP is deliberately MIME-like, but specifies a
number of important differences (see appendix 19.4 of RFC 2616).

For now, the scope is limited to HTTP headers.

> [After reading all through, I remain confused about exactly how general
> this server infrastructure is intended to be]

The ultimate intention is to allow the "web handler" stuff I mentioned
at the end of the section, and to allow the web app author to not care
very much what server is being used. To do this, we have to allow all
kinds of "server" implementations -- CGI, direct HTTP to a socket,
zeromq messages, etc. Regardless of how the request comes and the
response goes, we need to be able to recognize and parse HTTP headers
into their various appropriate data types -- and (web http) is really
the middle, here. The (web server) stuff is a higher-level abstraction
-- not a necessary abstraction, but helpful, if you can use it.

>>  -- Function: lookup-header-decl name
>>      Return the HEADER-DECL object registered for the given NAME.
>> 
>>      NAME may be a symbol or a string. Strings are mapped to headers in
>>      a case-insensitive fashion.
>> 
>>  -- Function: valid-header? sym val
>>      Returns a true value iff VAL is a valid Scheme value for the
>>      header with name SYM.
>
> Note slight inconsistency in the two above deffns: "Return" vs
> "Returns".

Which is the right one? "Return"? Will change.

>>    Now that we have a generic interface for reading and writing
>> headers, we do just that.
>> 
>>  -- Function: read-header port
>>      Reads one HTTP header from PORT. Returns two values: the header
>>      name and the parsed Scheme value.
>
> As multiple values?  Is that more helpful than as a cons?

Yes, as multiple values. The advantage is that returning multiple values
from a Scheme procedure does not cause any allocation.

>>      Returns #F for both values if the end of the message body was
>>      reached (i.e., a blank line).
>
> I'd find #<eof> more intuitive.

OK, will change.

>>  -- Function: parse-header name val
>>      Parse VAL, a string, with the parser for the header named NAME.
>> 
>>      Returns two values, the header name and parsed value. If a parser
>>      was found, the header name will be returned as a symbol. If a
>>      parser was not found, both the header name and the value are
>>      returned as strings.
>
> Again, multiple values or a cons?

Multiple values.

>>  -- Function: read-headers port
>>      Read an HTTP message from PORT, returning the headers as an
>>      ordered alist.
>
> s/Read/Read the headers of/  ?  i.e. Should the caller have already read
> the request/response line?

Indeed, the headers of. Will change.

>> The `(web http)' module defines parsers and unparsers for all headers
>> defined in the HTTP/1.1 standard.  This section describes the parsed
>> format of the various headers.
>> 
>>    We cannot describe the function of all of these headers, however, in
>> sufficient detail.
>
> I don't get the point here.

Do you mean that the reason is not apparent at this point in the
document? I don't think the intro is worded very well, and indeed it
appears to be a bit of buildup without knowing where you go... Maybe an
example in the beginning would be apropos?

Or should we give brief descriptions of the meanings of all of these
headers as well? That might be a good idea too.

>> `transfer-encoding'
>>      A param list of transfer codings.  `chunked' is the only known key.
>
> OK, why a param list rather than key-value?  How are elements in the
> second key-value list, say, different from elements in the first
> key-value list?

Well, some of these headers are quite unfortunate in their
construction.  In this case:

     Transfer-Encoding       = "Transfer-Encoding" ":" 1#transfer-coding

So really, this is a list. But:

       transfer-coding         = "chunked" | transfer-extension
       transfer-extension      = token *( ";" parameter )
       parameter               = attribute "=" value
       attribute               = token
       value                   = token | quoted-string

Given that a transfer-extension is really a toeken with a number of
parameters, the thing gets complicated. You could have:

  Transfer-Encoding: chunked,abcd,newthing;foo="bar, baz; qux";xyzzy

which is hard to parse if you do it ad-hoc.  (web http) parses it as:

   (transfer-encoding . ((chunked) ("abcd") ("newthing" ("foo . "bar, baz; 
quz") "xyzzy")))

Still complicated, but more uniform at least. Saying that `chunked' is
the only known key means that it's the only one that's translated to a
symbol; i.e. `abcd' is parsed to a string.  (This is to prevent attacks
to intern a bunch of symbols; though symbols can be gc'd in guile.)

Does that help? I'll see about replacing usages of "param list" as "list
of key-value lists", as it's probably clearer, and we can save ourselves
a definition.

>> `www-authenticate'
>>      A string.
>
> Obviously there's lots of substructure there (in WWW-Authenticate) that
> we just don't support yet.  Is there a clear compatibility story for
> if/when Guile is enhanced to parse that out?
>
> I guess yes; calling code will just need something like
>
>   (if (string? val)
>       ;; An older Guile that doesn't parse authentication fully.
>       (do-application-own-parsing)
>       ;; A newer Guile that does parse authentication.
>       (use-the-parsed-authentication-object))

That's a very good question. The problem is that if we change the parsed
representation, then old code breaks. That's why I put in the effort to
give (hopefully) good representations for most headers, to avoid that
situation -- though you appear to have caught one laziness on my part
here, and in Authorizaton, Proxy-Authenticate, and Proxy-Authorization.

So maybe the right thing to do here is just to bite the bullet, parse as
the RFC says we should, and avoid this situation.

>>  -- Function: read-request port [meta]
>>      Read an HTTP request from PORT, optionally attaching the given
>>      metadata, META.
>> 
>>      As a side effect, sets the encoding on PORT to ISO-8859-1
>>      (latin-1), so that reading one character reads one byte. See the
>>      discussion of character sets in "HTTP Requests" in the manual, for
>>      more information.
>
> That last sentence is OK for a docstring, but strange here _in_ the
> manual.

Good point.

> And, where is that discussion?

Heh, good point :)

> Hmm, I think the provision of this data type needs a bit more
> motivation.  It doesn't appear to offer much additional value, compared
> with reading or writing the components of a request individually, and on
> the other hand it appears to bake in assumptions about charsets and
> content length that might not always be true.

I probably didn't explain it very well then. A request record holds the
data from a request -- the method, uri, headers, etc. Additionally it
can be read or written. It does not actually bake in assumptions about
character sets or the like. It's simply that that HTTP protocol is
flawed in this respect, that it mixes textual and binary data. We want
to be able to read and parse requests, responses, and their headers
using string and char routines, and that's fine as the character set for
HTTP messages is restricted to a subset of the lower ASCII set. But then
the body of an HTTP message is fundamentally binary -- the
content-length is specified in bytes, not characters.

So the right way to read off a body is as a bytevector of the specified
length (potentially with chunked transfer encoding of course, though we
don't do that yet). Then if you want text, you decode using the
character set specified in the request. If you are particularly lucky
and it is a textual type and the charset is not specified, you can read
it as a latin-1 string directly, otherwise you convert. Or you can deal
with the binary data as a string.

Setting the charset on the port is a bit of a hack, but it is the right
thing to do if you are reading HTTP. And it doesn't matter what the
charset is when you read the body as it's specified in bytes anyway and
should be read in bytes (and then, possibly, decoded).

Some more organized discussion should go in the manual... but what do
you think?

>>  -- Function: extend-response r k v . additional
>>      Extend an HTTP response by setting additional HTTP headers K, V.
>>      Returns a new HTTP response.
>
> What does the ADDITIONAL arg mean?

More k-v pairs. Will note in the manual.

>>  -- Function: adapt-response-version response version
>>      Adapt the given response to a different HTTP version. Returns a
>>      new HTTP response.
>
> Interesting, and adds more value to the idea of the response object.
> Why not for the request object too - are you assuming that Guile will
> usually be acting as the HTTP server?  (Which I'm sure is correct, but
> "usually" is not "always".)

The thing is that the request initiates the transaction -- so it's the
requestor that makes the version decision. If you want to decide on
another version, presumably you do so when you build-version. But
perhaps for some sort of "request middleware", this could be
interesting.

>>   1. The `open' hook is called, to open the server. `open' takes 0 or
>>      more arguments, depending on the backend,
>
> How is that possible?  (immediate thought... perhaps it will be
> explained later)

Yes let's make sure we explain that later (we don't yet). "depending on
the backend" should make that clear.

>>   2. The `read' hook is called, to read a request from a new client.
>>      The `read' hook takes one argument, the server socket.  It should
>
> It feels surprising for the infrastructure to pass the server socket to
> the read hook.  I'd expect the infrastructure to do the `accept' itself
> and pass the client socket to the read hook.

It's the opaque "server socket object". Doing it this way has a two
advantages:

  1) Works with other socket architectures (zeromq, particularly).

  2) Allows the server to make its own implementation of keepalive
  (or not).

Particularly the latter is interesting -- the http implementation makes
a big pollset (from (ice-9 poll), not yet documented), and polls on the
server socket and all the keepalive sockets.

> Also, does the infrastructure assume that each client socket will only
> be used for one request and response, and then closed?  Would it be hard
> to remove that assumption, so that the <server-impl> idea is more
> general?

I don't think it makes that assumption, no. Client lifecycle is totally
up to the implementation. I think the deal is you check out the client,
run the handler, then always call the "write" hook, even if there was
an error in the handler. If the request fails to be read, the read hook
should not return a socket.

>>      case a default response object is constructed with those headers.
>
> What about response status?  (perhaps represented as a "status" header,
> a la modlisp)

The response object has that in it -- it will be 200 by default. If you
want another one, it's easy to build-response #:code xxx #:headers
header, no?

>>    The `(web server)' module defines a number of routines that use
>> `<server-impl>' objects to implement parts of a web server.  Given that
>> we don't expose the accessors for the various fields of a
>> `<server-impl>', indeed these routines are the only procedures with any
>> access to the impl objects. 
>
> How general is <server-impl> hoping to be?  Correspondingly, is the (web
> server) module name appropriate?
>
> To me, "web" => "http", so (web server http) is a tautological name.
> And in fact it sounds like you intend <server-impl> to cover more than
> just web/HTTP, so I suppose it should be in a module like (server),
> rather than (web server).
>
> It seems we could do with some more server impls in order to validate
> that the infrastructure is all defined correctly.  Time-permitting, I'd
> like to play with writing modlisp support for this new system, analogous
> to what I did already in guile-www.

I have written mod-lisp support. You can see it in tekuti/mod-lisp.scm,
attached here:

mod-lisp.scm
Description: (tekuti mod-lisp)

Does that document help? This is only http, not other servers.

> I think there's a one-request-per-connection assumption here, isn't
> there?

Heh, no. See the (web server http) implementation. Perhaps misnamed
though -- should it be (web server socket)? But most servers have
sockets? I chose HTTP as it is the name of the wire protocol, though
other protocols are possible that have that same semantic content
(mod-lisp for example). Other suggestions welcome.

>>      The default server implementation is `http', which accepts
>>      OPEN-PARAMS like `(#:port 8081)', among others. See "Web Server"
>>      in the manual, for more information.
>
> Last sentence should be removed from the manual version of the
> docstring.

ACK.

>> 7.3.7.1 Hello, World!
>> .....................
>
> The thunder here has been somewhat stolen by the fact that you already
> presented this example above!

True! Well, we didn't actually run it, but hey... perhaps elide from the
previous section, and point people here?

>>    Instead of returning the body as a string, here we give a
>>    procedure,
>
> Insert "Also, " before "Instead"?  Otherwise this reads as moving onto a
> new example.

Good point!

Whew! Long mail, but it's a lot of new code and docs, so the feedback is
much appreciated. I've inserted notes in my web.texi, and will poke this
shortly.

Happy holidays,

Andy
-- 
http://wingolog.org/

[Prev in Thread]

Current Thread

[Next in Thread]

documentation for (web ...), Andy Wingo, 2010/12/14
- Re: documentation for (web ...), Ludovic Courtès, 2010/12/16
  - Re: documentation for (web ...), Neil Jerram, 2010/12/23
    - Re: documentation for (web ...), Andy Wingo <=

Prev by Date: Re: documentation for (web ...)
Next by Date: crash at loading gds-client module
Previous by thread: Re: documentation for (web ...)
Next by thread: schelog repl.
Index(es):
- Date
- Thread