用xpath匹配的全内容为list集合无法进行解码
工具
python3.9
pycharm 2020.4 community
实例化的html数据进行解码问题,请教一下应该如何对匹配的内容进行解码并且保存
import requests
from lxml import etree
import time
import re
class Movie(object):
def __init__(self,times):
self.times = int(times)
self.headers = headers = {
'User-Agent': (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/'
'537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36')
}
def get_html(self):
for i in range(self.times):
url = ('https://movie.douban.com/subject/30171424/reviews?start={}'
.format(i*20))
print('第{}页'
.format(i+1))
print(url)
request = requests.get(url=url,headers=self.headers).text
html_element = etree.HTML(request)
for x in html_element:
tree = etree.tostring(x)
get = tree.xpath('//div/header/a/text()')
print(get)
movie = Movie(1)
movie.get_html()
报错内容
/usr/local/bin/python3.9 /Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py
第1页
https://movie.douban.com/subject/30171424/reviews?start=0
Traceback (most recent call last):
File "/Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py", line 30, in <module>
movie.get_html()
File "/Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py", line 26, in get_html
get = tree.xpath('//div/header/a/text()')
AttributeError: 'bytes' object has no attribute 'xpath'
这里应该是对
lxml.etree.HTML
解析器有些误解。authors
就是你想要的结果。当然这里的
xpath
还需要做些调整,因为实际上获取到了很多不想要的换行符和空格。tree
是一个编码后的字符串, 也就是bytes
, 没有xpth
的属性或方法. 试下这个a
节点下存在多个文本节点,查看dom
结构使用合适的xpath
表达式谢谢