python - Extracting specific src attributes from script tags -
i want js file names input content contains jquery
substring re.
this code:
step 1: extract js file content.
>>> data = """ <script type="text/javascript" src="js/jquery-1.9.1.min.js"/> ... <script type="text/javascript" src="js/jquery-migrate-1.2.1.min.js"/> ... <script type="text/javascript" src="js/jquery-ui.min.js"/> ... <script type="text/javascript" src="js/abc_bsub.js"/> ... <script type="text/javascript" src="js/abc_core.js"/> ... <script type="text/javascript" src="js/abc_explore.js"/> ... <script type="text/javascript" src="js/abc_qaa.js"/>""" >>> import re >>> re.findall('src="js/([^"]+)"', data) ['jquery-1.9.1.min.js', 'jquery-migrate-1.2.1.min.js', 'jquery-ui.min.js', 'abc_bsub.js', 'abc_core.js', 'abc_explore.js', 'abc_qaa.js']
step 2: js file have sub string jquery
>>> [ii ii in re.findall('src="js/([^"]+)"', data) if "jquery" in ii] ['jquery-1.9.1.min.js', 'jquery-migrate-1.2.1.min.js', 'jquery-ui.min.js']
can above step 2 in step 1 means re pattern result?
sure can. 1 way use
re.findall('src="js/([^"]*jquery[^"]*)"', data)
this match after "js/
until nearest "
if contains jquery
anywhere. if know more position of jquery
(for example, if it's @ start) can adjust regex accordingly.
if want make sure jquery
not directly surrounded other alphanumeric characters, use word boundary anchors:
re.findall(r'src="js/([^"]*\bjquery\b[^"]*)"', data)
Comments
Post a Comment