21st December 2005, 2 min read

Wget doesn´t eat XML

I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.

A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.

This meant I had to write my own scripts to recover the course. First, a bash script:

You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!

Next I need a python script to extract the URLs I need (Perl or Ruby would also do):

#!/bin/env python import re,sys for filename in sys.argv[1:]: file=open(filename) #print "from ", file for line in file:


better hope that we don't have repeated spaces!

for m in re.findall( "(?< =<a\shref=["'])(?!http)(?!javascrip)[^"'](?=["'])", line) +\ re.findall( "(?<="<img\ssrc=["'])(?!http)(?!javascrip)[^"'](?=["'])"," line) +\ re.findall("(?<="<quiz">).(?=)",line)+\ re.findall("(?< =openwindow\(').?(?=')",line)+\ re.findall("(?< =stylesheet href=["']).?(?=["'])",line): print "http://"+re.search("www./",filename).group()+m

This is a pretty awful hack, but it works!

Here is a project for the tech savvy among you: extend wget so that it can parse XML!