po4a-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Po4a-dev]HTML module (first revision)


From: Laurent Hausermann
Subject: Re: [Po4a-dev]HTML module (first revision)
Date: Tue, 18 Feb 2003 10:48:00 +0100
User-agent: Internet Messaging Program (IMP) 3.1

Hi all, 

> In fact, HTML being a DTD of SGML, I guess it could be easier to handle
> this
> format with the Sgml.pm module, which offers the whole mecanism to do what
> I
> wanted from the HTML.pm...
> 
> You would only have to add the specific parts to HTML after the specific
> parts of docbook and debiandoc, and provided that your documents start with
> a line like
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> (as they should), it will work (I guess) !

Ok , I'll look deeper in SGML.pm it seems to be a good "theorical" idea , but I 
am wondering if a SGML module could parse some "particular" HTML generated by 
people that doesn't mind HTML 3.2 or 4.0 DTD :) !

> > > I have developped an HTML module for po4a. It has still some bugs and
> > > it's not  perfect, but I think it's a good starting point. 
> > > It uses HTML::TokeParser ( apt-get install libhtml-parser-perl ) 
> > > I sent the whole diff to Martin Quinson, not to this list 
> > Ok, I commited this to the CVS, so that others can see it.

Thanks.

> > This module isn't ready to release yet in my opinion. Here are my
> objections:
> >  * The parser you used don't allow to retrieve the line number. Why not
> >    to use the HTML::Parser module, which seems somehow more powerfull ?

You are right. HTML::Parser is more powerful, but seemed to me more difficult 
to parse HTML with it...
I am not a i18n expert, can you explain me why line number is so important.. 
Espacially for SGML/XML/HTML ?

> >     That is to say that sentences are broken in subparts, which is BAD.
> >     (see http://www.ens-lyon.fr/~mquinson/l10n.html for a rational).

Yes, you are probably right also, but for example poedit, a tool that can be 
used by translators won't print <b> or <i> tags in bold or in italic... And I 
think a translator should not be an HTML expert. The <a> tag is too much 
difficult to "translate" to let a translator have a control on it.

> >   * Your version don't put entry type in the po, which prevents from
> >     using
> >     gettextization (see po4a(7) for more details). I quickly hacked a
> >     support for that in the version in CVS, but that's not perfect yet.

Ooops, I missed that point. I had a look at your "hack" but, I don't see a 
better way to handle "gettization". Have you got any more idea for that point ?
     
> > I suggest that:
> >   - you move to a parser that allows you to retrieve the line number (or
> >     explain me that I'm an idiot and that this parser do allow you to
> >     retrieve the line number, and how)
I will look to internals of HTML::TokeParser

> >   - you look at the sgml module to see how we handle the fact that some
> >     tags delimit a paragraph (like <p>), and should be translated, and that
> >     some other tags shouldn't be touched because they don't delimit a   
> >    sentence (like <b>, <i> and so on)

OK. I look if I can add HTML to SGML module.

> > Sorry, but I really can't release this module as is...
> > Anyway, thanks for your contribution, it IS a good start.

Don't be sorry, you are responsible to provide to community good po4a 
software , and I am a "mongers beginner" :)






reply via email to

[Prev in Thread] Current Thread [Next in Thread]