Daniel Lemire's blog

, 2 min read

Wget doesn´t eat XML

I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.

A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.

This meant I had to write my own scripts to recover the course. First, a bash script:

wget -m -r -l inf -v -p http://www.teluq.uquebec.ca/inf6450/index-fr.htm
find -path ".htm" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "
.html" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path ".xhtml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "
.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p

You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!

Next I need a python script to extract the URLs I need (Perl or Ruby would also do):

#!/bin/env python
import re,sys
for filename in sys.argv[1:]:
#print "from ", file
for line in file:

better hope that we don't have repeated spaces!

for m in re.findall( "(?< =<a\shref=["'])(?!http)(?!javascrip)[^"'](?=["'])", line) +\ re.findall( "(?<="<img\ssrc=["'])(?!http)(?!javascrip)[^"'](?=["'])"," line) +\ re.findall("(?<="<quiz">).(?=)",line)+\
re.findall("(?< =openwindow\(').
?(?=')",line)+\ re.findall("(?< =stylesheet href=["']).?(?=["'])",line): print "http://"+re.search("www./",filename).group()+m

This is a pretty awful hack, but it works!

Here is a project for the tech savvy among you: extend wget so that it can parse XML!