python - Stuck scraping a specific table with scrapy -
so table i'm trying scrape can found here: http://www.betdistrict.com/tipsters
i'm after table titled 'june stats'.
here's spider:
from __future__ import division decimal import * import scrapy import urlparse ttscrape.items import ttscrapeitem class betdistrictspider(scrapy.spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def parse(self, response): sel in response.xpath('//table[1]/tr'): item = ttscrapeitem() name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0] tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' item['tipster'] = tipster won = sel.xpath('td[2]/text()').extract()[0] lost = sel.xpath('td[3]/text()').extract()[0] void = sel.xpath('td[4]/text()').extract()[0] tips = int(won) + int(void) + int(lost) item['tips'] = tips strike = decimal(int(won) / tips) * 100 strike = str(round(strike,2)) item['strike'] = [strike + "%"] profit = sel.xpath('//td[5]/text()').extract()[0] if profit[0] in ['+']: profit = profit[1:] item['profit'] = profit yield_str = sel.xpath('//td[6]/text()').extract()[0] yield_str = yield_str.replace(' ','') if yield_str[0] in ['+']: yield_str = yield_str[1:] item['yield'] = '<span style="color: #40aa40">' + yield_str + '%</span>' item['site'] = 'bet district' yield item this gives me list index out of range error on first variable (name).
however, when rewrite xpath selectors starting //, e.g:
name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0] the spider runs, scrapes first tipster on , on again.
i think has the table not having thead, containing th tags within first tr of tbody.
any appreciated.
----------edit----------
in response lars suggestions:
i've tried use you've suggested still list out of range error:
from __future__ import division decimal import * import scrapy import urlparse ttscrape.items import ttscrapeitem class betdistrictspider(scrapy.spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def parse(self, response): sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'): item = ttscrapeitem() name = sel.xpath('a/text()').extract()[0] url = sel.xpath('a/@href').extract()[0] tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' item['tipster'] = tipster yield item also, i'm assuming doing things way, multiple loops required since not cells have same class?
i've tried doing things without loop, in case once again scrapes first tipster multiple times :s
thanks
when say
name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] the xpath expression starts td , relative context node have in variable sel (i.e. tr element in set of tr elements for loop iterates over).
however when say
name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0] the xpath expression starts //td, i.e. select td elements anywhere in document; not relative sel, , results same on every iteration of for loop. that's why scrapes first tipster on , on again.
why first xpath expression fail list index out of range error? try taking xpath expression 1 location step @ time, printing out results, , you'll find problem. in case, appears because first tr child of table[1] not have td child (only th children). xpath() selects nothing, extract() returns empty list, , try reference first item in empty list, giving list index out of range error.
to fix this, change loop xpath expression loop on tr elements have td children:
for sel in response.xpath('//table[1]/tr[td]'): you fancier, requiring td of right class:
for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
Comments
Post a Comment