Daniel Lemire's blog

, 4 min read

An Amazon Web Services (AWS) 4.0 application in just a few lines

I have somewhat of a debate with my friend Yuhong about the correct way to use a Web Service. Yuhong seems to prefer SOAP. I much prefer REST. What is a REST Web Service? For the most part, a REST Web Service is really, really simple. You simply follow a URL and magically, you get back an XML file which is the answer to your query. One benefit of REST is that you can easily debug it and write quick scripts for it. What is a SOAP Web Service? I don’t know. I really don’t get it. It seems to be a very complicated way to do the same thing: instead of sending the query as a URL, you send the query as a XML file. The bad thing about it is that if it breaks, then you have no immediate way to debug it: you can’t issue a SOAP request using your browser (or maybe you can, but I just don’t know how). Now, things never break, do they? Well, that is the problem, they break often because either I’m being stupid or I don’t know what I’m doing or the people on the other side don’t know what they are doing or the people on the other side are experimenting a bit or whatever else. I find that being able to quickly debug my code is the primary feature any IT technology should have. The last thing I want from a technology is for it to be hard to debug.

Here is the problem that I solved this week-end. I have this list of artists and I want to get a list of all corresponding music albums so I can put it all into a relational (SQL) database. Assuming that your list of artists are in a file called artists_big.txt and that you want the result to be in a file called amazonresults.sql, the following does a nice job thanks to the magic of Amazon Web Services:

Yes: the code goes over because I cannot allow HTML to wrap lines (Python allows wrapping lines, but not arbitrarily so, white space is significant in Python). There is just no way around it that I know: suggestions with sample HTML code are invited.

import libxml2, urllib2, urllib, sys, re, traceback
ID=""# please enter your own ID here
uri="http://webservices.amazon.com/AWSECommerceService/2004-10-19"
url="http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&SubscriptionId=%s&Operation=ItemSearch&SearchIndex=Music&Artist=%s&ItemPage=%i&ResponseGroup=Request,ItemIds,SalesRank,ItemAttributes,Reviews"
outputcontent = ["ASIN","Artist","Title","Amount","NumberOfTracks","SalesRank","AverageRating","ReleaseDate"]
input = open("artists_big.txt")
output = open("amazonresults.sql", "w")
log = open("amazonlog.txt", "w")
output.write("DROP TABLE music;nCREATE TABLE music (ASIN TEXT, Artist TEXT, Title TEXT, Amount INT, NumberOfTracks INT, SalesRank INT, AverageRating NUMERIC, ReleaseDate DATE);n")
def getNodeContentByName(node, name):
for i in node:
if (i.name==name): return i.content
return None
for artist in input:#go through all artists
print "Recovering albums for artist : ", artist
page = 1
while(True):# recover all pages
resturl = url %(ID,urllib.quote(artist),page)
log.write("Issuing REST request: "+resturl+"n")
try :
data = urllib2.urlopen(resturl).read()
except urllib2.HTTPError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not retrieve :n"+resturl+"n")
continue
try :
doc = libxml2.parseDoc(data)
except libxml2.parserError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not parse (is valid XML?):n"+data+"n")
continue
ctxt=doc.xpathNewContext()
ctxt.xpathRegisterNs("aws",uri)
isvalid = (ctxt.xpathEval("//aws:Items/aws:Request/aws:IsValid")[0].content == "True")
if not isvalid :
log.write("The query %s failed " % (resturl))
errors = ctxt.xpathEval("//aws:Error/aws:Message")
for message in errors: log.write(message.content+"n")
continue
for itemnode in ctxt.xpathEval("//aws:Items/aws:Item"):
attr = {}
for nodename in outputcontent:
content = getNodeContentByName(itemnode,nodename)
if(content <> None):
content = re.sub("'","'",content)
if(nodename == "SalesRank"):
content = re.sub(",","",content)
attr[nodename] = content
columns = "("
keys = attr.keys()
for i in range(len(keys)-1):
columns += keys[i]+","
columns+=keys[len(keys)-1]+")"
row = "("
values = attr.values()
for i in range(len(values)-1):
row+="'"+str(values[i])+"',"
row+="'"+str(values[len(values)-1])+"')"
command = "INSERT INTO music "+columns+" VALUES "+row+";n"
output.write(command)
NumberOfPages = int(ctxt.xpathEval("//aws:Items/aws:TotalPages")[0].content)
if(page >= NumberOfPages): break
page += 1
input.close()
output.close()
log.close()
print "You should now be able to run the file in postgresql. Start the postgres client doing psql, and using i amazonresults.sql in the postgresql shell."

Update : this was updated to take into account these comments from Amazon following the upgrade of AWS 4.0 from beta to release:

Service Name change: You will need to modify the Service Name parameter in your application from AWSProductData to AWSECommerceService. We realize that it may take some time to implement this change in your applications. In order to make this transition as easy as possible, we will continue supporting AWSProductData for a short time.

  1. REST/WSDL Endpoints: You will need to modify your application to connect to webservices.amazon.com instead of aws-beta.amazon.com. For other locales, the new endpoints are webservices.amazon.co.uk, webservices.amazon.de and webservices.amazon.co.jp.