python - Stuck scraping a specific table with scrapy -
so table i'm trying scrape can found here: http://www.betdistrict.com/tipsters
i'm after table titled 'june stats'.
here's spider:
from __future__ import division decimal import * import scrapy import urlparse ttscrape.items import ttscrapeitem class betdistrictspider(scrapy.spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def parse(self, response): sel in response.xpath('//table[1]/tr'): item = ttscrapeitem() name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0] tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' item['tipster'] = tipster won = sel.xpath('td[2]/text()').extract()[0] lost = sel.xpath('td[3]/text()').extract()[0] void = sel.xpath('td[4]/text()').extract()[0] tips = int(won) + int(void) + int(lost) item['tips'] = tips strike = decimal(int(won) / tips) * 100 strike = str(round(strike,2)) item['strike'] = [strike + "%"] profit = sel.xpath('//td[5]/text()').extract()[0] if profit[0] in ['+']: profit = profit[1:] item['profit'] = profit yield_str = sel.xpath('//td[6]/text()').extract()[0] yield_str = yield_str.replace(' ','') if yield_str[0] in ['+']: yield_str = yield_str[1:] item['yield'] = '<span style="color: #40aa40">' + yield_str + '%</span>' item['site'] = 'bet district' yield item
this gives me list index out of range error on first variable (name).
however, when rewrite xpath selectors starting //, e.g:
name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]
the spider runs, scrapes first tipster on , on again.
i think has the table not having thead, containing th tags within first tr of tbody.
any appreciated.
----------edit----------
in response lars suggestions:
i've tried use you've suggested still list out of range error:
from __future__ import division decimal import * import scrapy import urlparse ttscrape.items import ttscrapeitem class betdistrictspider(scrapy.spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def parse(self, response): sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'): item = ttscrapeitem() name = sel.xpath('a/text()').extract()[0] url = sel.xpath('a/@href').extract()[0] tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' item['tipster'] = tipster yield item
also, i'm assuming doing things way, multiple loops required since not cells have same class?
i've tried doing things without loop, in case once again scrapes first tipster multiple times :s
thanks
when say
name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]
the xpath expression starts td
, relative context node have in variable sel
(i.e. tr
element in set of tr
elements for
loop iterates over).
however when say
name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]
the xpath expression starts //td
, i.e. select td
elements anywhere in document; not relative sel
, , results same on every iteration of for
loop. that's why scrapes first tipster on , on again.
why first xpath expression fail list index out of range error? try taking xpath expression 1 location step @ time, printing out results, , you'll find problem. in case, appears because first tr
child of table[1]
not have td
child (only th
children). xpath()
selects nothing, extract()
returns empty list, , try reference first item in empty list, giving list index out of range error.
to fix this, change loop xpath expression loop on tr
elements have td
children:
for sel in response.xpath('//table[1]/tr[td]'):
you fancier, requiring td
of right class:
for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
Comments
Post a Comment