python - How to read a asp.net page with BeautifulSoup? -

- May 15, 2012

i trying scrape data webpage using beautiful soup.

i running problems when try convert html document beautifulsoup object.

when run code

soup = beautifulsoup(html_doc)

the error message im getting :

syntaxerror: non-ascii character '\xa9' in file      c:/users/mlee/pycharmprojects/bstest/htmlparse.py on line 683, no encoding declared; see http://python.org/dev/peps/pep-0263/ details

i believe because there asp.net viewstate objects in html base64 encoded.

is there suggested workaround or have use different tool?

also, interested in getting javascript generated portions of text. there better way of doing this?

thank you!

put header

#!/usr/bin/env python # -*- coding: utf-8 -*-

on first line of htmlparse.py file, make sure pycharm saves file utf-8 encoded.

this has nothing asp/viewstate. have utf characters in file.

i interested in getting javascript generated portions of text. there better way of doing this?

you might want use selenium webdriver + python bindings doing task. option phantomjs

Search This Blog

Th

python - How to read a asp.net page with BeautifulSoup? -

Comments

Post a Comment

Popular posts from this blog

xslt - Substring before throwing error -

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -