Jonathan McDowell noodles@earth.li wrote:
Secondly it seems to validate just fine for http://www.andrewsavory.com/blog/index.rdf with http://feedvalidator.org/
Should escaped html be in description or content:encoded? It's not something I looked at too deeply when writing it, as both seem to happen in the wild (and even link and title can be tricky too).
[...] I fed Andrew's blog to Planet to see what it did and it appears to do the right thing. So it looks like a schycyroll issue. I don't see why it should need to parse HTML to cope with this, but I haven't looked at its code at all.
The xml parser doesn't like the missing </li> in the Links item.
I think Planet just prints the unescaped code into the output, so can it reliably produce a valid page? (Schycyroll doesn't yet produce a valid page, but has a reasonably easy fix for the current bug stopping it.)
Trying to regex out all the stupid things people do with html isn't feasible. Parsing it, merging it and then serialising should give reliably valid pages and enable some other features. The downside is that the encoded html needs to be parsed. The xml parser is used anyway, so schycyroll just blindly tries that on the description (hoping most people use 1999 xhtml by now) and uses the text "parser" if the attempt fails.
Hope that explains why I want it to parse things.