用xpath匹配的全内容为list集合无法进行解码

工具
python3.9
pycharm 2020.4 community

实例化的html数据进行解码问题,请教一下应该如何对匹配的内容进行解码并且保存

import requests
from lxml import etree
import time
import re

class Movie(object):
    def __init__(self,times):
        self.times = int(times)
        self.headers = headers = {
            'User-Agent': (
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/'
                '537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36')
        }

    def get_html(self):
        for i in range(self.times):
            url = ('https://movie.douban.com/subject/30171424/reviews?start={}'
                   .format(i*20))
            print('第{}页'
                  .format(i+1))
            print(url)
            request = requests.get(url=url,headers=self.headers).text
            html_element = etree.HTML(request)
            for x in html_element:
                tree = etree.tostring(x)
                get = tree.xpath('//div/header/a/text()')
                print(get)

movie = Movie(1)
movie.get_html()

报错内容

/usr/local/bin/python3.9 /Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py
第1页
https://movie.douban.com/subject/30171424/reviews?start=0
Traceback (most recent call last):
  File "/Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py", line 30, in <module>
    movie.get_html()
  File "/Users/scrooge/PycharmProjects/pythonProject/拆弹专家评分爬虫.py", line 26, in get_html
    get = tree.xpath('//div/header/a/text()')
AttributeError: 'bytes' object has no attribute 'xpath'
讨论数量: 4

这里应该是对lxml.etree.HTML解析器有些误解。

html_element = etree.HTML(request)
authors = html_element.xpath('//div/header/a/text()')

authors就是你想要的结果。 :ok_hand:

当然这里的xpath还需要做些调整,因为实际上获取到了很多不想要的换行符和空格。

3年前 评论
Scrooge (楼主) 3年前
Jason990420
tree = etree.tostring(x)

tree 是一个编码后的字符串, 也就是bytes, 没有xpth的属性或方法. 试下这个

            for node in html_element.xpath('//div/header/a'):
                text = node.text.strip()
                if text:
                    print(text)
3年前 评论
pardon110

a 节点下存在多个文本节点,查看dom 结构使用合适的 xpath表达式

'//div/header/a[@class="name"]/text()'

file

3年前 评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!