, 2 min read
Wget doesn´t eat XML
I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.
A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.
This meant I had to write my own scripts to recover the course. First, a bash script:
wget -m -r -l inf -v -p http://www.teluq.uquebec.ca/inf6450/index-fr.htm
find -path ".htm" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path ".html" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path ".xhtml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path ".xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!
Next I need a python script to extract the URLs I need (Perl or Ruby would also do):
#!/bin/env python
import re,sys
for filename in sys.argv[1:]:
file=open(filename)
#print "from ", file
for line in file:
better hope that we don't have repeated spaces!
for m in re.findall( "(?< =<a\shref=["'])(?!http)(?!javascrip)[^"'](?=["'])", line) +\ re.findall( "(?<="<img\ssrc=["'])(?!http)(?!javascrip)[^"'](?=["'])"," line) +\ re.findall("(?<="<quiz">).(?=)",line)+\
re.findall("(?< =openwindow\(').?(?=')",line)+\
re.findall("(?< =stylesheet href=["']).?(?=["'])",line):
print "http://"+re.search("www./",filename).group()+m
This is a pretty awful hack, but it works!
Here is a project for the tech savvy among you: extend wget so that it can parse XML!