I'm having trouble parsing xml data from websites to identify their most recent update date. This is the code I'm using:
def fetch_dates(self, response): sitemap = scrapy.selector.XmlXPathSelector(response) sitemap.register_namespace( # ns is just a namespace and the second param should be whatever the # xmlns of your sitemap is 'ns', 'http://www.sitemaps.org/schemas/sitemap/0.9' ) # this gets you a list of all the "loc" and "last modified" fields. locsList = sitemap.select('//ns:loc/text()').extract() lastModifiedList = sitemap.select('//ns:lastmod/text()').extract() # zip() the 2 lists together pageList = list(zip(locsList, lastModifiedList)) for page in pageList: if os.path.exists('1url-to-date.csv'): append_write = 'a' else: append_write = 'w' with open('1url-to-date.cav', append_write) as url_f: url_f.write(locsList + "&,&" + lastModifiedList + "/n") return Item()
But it's not returning any values for dates and it's not even writing my file. So there's clearly something wrong with the code. I don't see any errors when it runs, but I'm not returning anything. Any suggestions on how to fix it?
What I'm ultimately looking for is a list of the HTML pages the webcrawler finds and the updated date. If there isn't a date available, I'll use today's date, and then the number of days since last update.