python - Stuck scraping a specific table with scrapy -


so table i'm trying scrape can found here: http://www.betdistrict.com/tipsters

i'm after table titled 'june stats'.

here's spider:

from __future__ import division decimal import *  import scrapy import urlparse  ttscrape.items import ttscrapeitem   class betdistrictspider(scrapy.spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"]  def parse(self, response):     sel in response.xpath('//table[1]/tr'):         item = ttscrapeitem()         name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]         url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0]         tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'         item['tipster'] = tipster         won = sel.xpath('td[2]/text()').extract()[0]         lost = sel.xpath('td[3]/text()').extract()[0]         void = sel.xpath('td[4]/text()').extract()[0]         tips = int(won) + int(void) + int(lost)         item['tips'] = tips         strike = decimal(int(won) / tips) * 100         strike = str(round(strike,2))         item['strike'] = [strike + "%"]         profit = sel.xpath('//td[5]/text()').extract()[0]         if profit[0] in ['+']:             profit = profit[1:]         item['profit'] = profit         yield_str = sel.xpath('//td[6]/text()').extract()[0]         yield_str = yield_str.replace(' ','')         if yield_str[0] in ['+']:             yield_str = yield_str[1:]         item['yield'] = '<span style="color: #40aa40">' + yield_str + '%</span>'         item['site'] = 'bet district'         yield item 

this gives me list index out of range error on first variable (name).

however, when rewrite xpath selectors starting //, e.g:

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0] 

the spider runs, scrapes first tipster on , on again.

i think has the table not having thead, containing th tags within first tr of tbody.

any appreciated.

----------edit----------

in response lars suggestions:

i've tried use you've suggested still list out of range error:

from __future__ import division decimal import *  import scrapy import urlparse  ttscrape.items import ttscrapeitem   class betdistrictspider(scrapy.spider):     name = "betdistrict"     allowed_domains = ["betdistrict.com"]     start_urls = ["http://www.betdistrict.com/tipsters"]  def parse(self, response):     sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):         item = ttscrapeitem()         name = sel.xpath('a/text()').extract()[0]         url = sel.xpath('a/@href').extract()[0]         tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'         item['tipster'] = tipster         yield item  

also, i'm assuming doing things way, multiple loops required since not cells have same class?

i've tried doing things without loop, in case once again scrapes first tipster multiple times :s

thanks

when say

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] 

the xpath expression starts td , relative context node have in variable sel (i.e. tr element in set of tr elements for loop iterates over).

however when say

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0] 

the xpath expression starts //td, i.e. select td elements anywhere in document; not relative sel, , results same on every iteration of for loop. that's why scrapes first tipster on , on again.

why first xpath expression fail list index out of range error? try taking xpath expression 1 location step @ time, printing out results, , you'll find problem. in case, appears because first tr child of table[1] not have td child (only th children). xpath() selects nothing, extract() returns empty list, , try reference first item in empty list, giving list index out of range error.

to fix this, change loop xpath expression loop on tr elements have td children:

for sel in response.xpath('//table[1]/tr[td]'): 

you fancier, requiring td of right class:

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'): 

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -