Python Scrapy parsing for dates

I'm having trouble parsing xml data from websites to identify their most recent update date. This is the code I'm using:

def fetch_dates(self, response):
        sitemap = scrapy.selector.XmlXPathSelector(response)
        sitemap.register_namespace(
                # ns is just a namespace and the second param should be whatever the 
                # xmlns of your sitemap is
            'ns', 'http://www.sitemaps.org/schemas/sitemap/0.9'
        )
  
        # this gets you a list of all the "loc" and "last modified" fields.
        locsList = sitemap.select('//ns:loc/text()').extract()
        lastModifiedList = sitemap.select('//ns:lastmod/text()').extract()

        # zip() the 2 lists together
        pageList = list(zip(locsList, lastModifiedList))

        for page in pageList:
            if os.path.exists('1url-to-date.csv'):
                append_write = 'a'
            else:
                append_write = 'w'
                
            with open('1url-to-date.cav', append_write) as url_f:
                url_f.write(locsList + "&,&" + lastModifiedList + "/n")
        
        return Item()

But it's not returning any values for dates and it's not even writing my file. So there's clearly something wrong with the code. I don't see any errors when it runs, but I'm not returning anything. Any suggestions on how to fix it?

What I'm ultimately looking for is a list of the HTML pages the webcrawler finds and the updated date. If there isn't a date available, I'll use today's date, and then the number of days since last update.



Read more here: https://stackoverflow.com/questions/65707230/python-scrapy-parsing-for-dates

Content Attribution

This content was originally published by Meredith Abrams at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: