windows 与 linux python 爬虫获取源代码不同

版本

系统：windows 10
编辑器：pycharm
python: 3.9.1
chrome: 89.0.4389.114
ChromeDriver 89.0.4389.23
bs4: 0.0.1
requestium: 0.1.9
selenium: 3.141.0

代码

from bs4 import BeautifulSoup
from requestium import Session

def main():
    url = "https://www.zhibo8.cc/"
    s = Session(webdriver_path="D:\\venv\\Scripts\\chromedriver.exe",
                browser='chrome',
                default_timeout=15,
                webdriver_options={'arguments': ['headless']})
    s.driver.get(url)
    # 点击事件
    s.driver.find_element_by_css_selector('.c_nav a:nth-child(2)').click()
    # 获取源码
    html = s.driver.page_source
    content = BeautifulSoup(html, 'html.parser')
    lis = content.find_all(style="display: list-item;")
    for line in lis:
        print(line)
        exit()

if __name__ == '__main__':
    main()

正常的结果内容

<li data-time="2021-04-08 15:00" id="saishi625115" label="中国女足,韩国女足,奥女足,足球" style="display: list-item;">15:00 <b>奥女足附加赛首回合 韩国女足 <img src="//duihui.duoduocdn.c
om/zuqiu/hanguonvzu1.png"/> <span style="font-weight: bold;">1 - 1</span> <img src="//duihui.duoduocdn.com/zuqiu/zhongguo.png"/> 中国女足</b> <a href="/zhibo/zuqiu/2021/match625115v.ht
m" target="_blank">CCTV5 足球直播</a> <a href="https://www.zhibo8.cc/zhibo/zuqiu/2021/match625115v.htm" target="_blank">文字</a> <a href="//www.zhibo8.cc/shouji.htm" target="_blank">手
机看直播</a> <a href="http://www.188bifen.com/" target="_blank">比分</a> <a href="https://www.zhibo8.cc/zhibo/zuqiu/2021/match625115v.htm?redirect=animate" target="_blank">动画</a> <a
href="http://nbaftx.wanjiashe.com/game.php?sid=56" target="_blank">NBA范特西56服</a> </li>

=============================

问题描述：

代码放到centos后，标签li中没有id和data-time属性

版本

系统：centos 7.6
编辑器：vim
python: 3.6.8
chrome: 89.0.4389.114
ChromeDriver 89.0.4389.23
bs4: 0.0.1
requestium: 0.1.9
selenium: 3.141.0

代码

from bs4 import BeautifulSoup
from requestium import Session

def main():
    url = "https://www.zhibo8.cc/"
    s = Session(webdriver_path="/usr/local/src/chromedriver",
                browser='chrome',
                default_timeout=15,
                webdriver_options={'arguments': ['headless','no-sandbox']})
    s.driver.get(url)
    s.driver.find_element_by_css_selector('.c_nav a:nth-child(2)').click()
    html = s.driver.page_source
    content = BeautifulSoup(html, 'html.parser')
    lis = content.find_all(style="display: list-item;")
    for line in lis:
        print(line)
        exit()

if __name__ == '__main__':
    main()

有问题的结果

<li label="中国女足,韩国女足,奥女足,足球" style="display: list-item;">15:00 <b>奥女足附加赛首回合 韩国女足 <img src="http://duihui.duoduocdn.com/zuqiu/hanguonvzu1.png"/> - <img src="http://duihui.duoduocdn.com/zuqiu/zhongguo.png"/> 中国女足</b> <a href="/zhibo/zuqiu/2021/match625115v.htm" target="_blank">CCTV5 足球直播</a> <a href="https://www.zhibo8.cc/zhibo/zuqiu/2021/match625115v.htm" target="_blank">文字</a> <a href="http://www.188bifen.com/" target="_blank">比分</a> <a href="http://nbaftx.wanjiashe.com/game.php?sid=56" target="_blank">NBA范特西56服</a> </li>

requestium bs4 BeautifulSoup python

讨论数量: 3

charliecen

262 声望

最早用 phantomjs, 后来换成selenium + chrome | firefox, 结果在windows上正常，在centos会有问题

5年前评论

SilenceHL

你可以查看一下centos下直播⑧页面对应位置有没有 id 和 data-time 属性

版主 439 声望

@SilenceHL 已看过，确实没有。后来我伪装user-agent为Windows NT 一样拿不到

有可能是直播吧在Linux系统下有不同的方案，导致没有这两个属性，不仅仅是user-agent这个请求头，他们还有可能通过其他方式判断的，你要是对这两个属性需求的话可以再深入研究一下伪装成Windows下的结果，有可能是浏览器版本或者其他，但是整体我看了没有大区别，做爬虫还是要先根据网页内容来进行爬取，没有的东西他也不可能获取到的

charliecen （作者）（楼主）

@SilenceHL 暂时放弃了直接获取两个属性，我通过其它标签里的内容拿到后并附加到每条li属性中。另外我用firefox和chrome来伪装windows头，貌似有问题。只能使用headless才可以用

@charliecen 你可以尝试学习一下用Selenium去获取，逻辑上是一样的，Selenium用的人更多，另外数据解析可以用xpath，可以尝试一下不同的库。至于内容，你可以多长尝试请求头，找到Windows与centos的不同再进编写

@SilenceHL OK,谢谢回复

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

windows 与 linux python 爬虫获取源代码不同

版本

代码

正常的结果内容

问题描述：

版本

代码

有问题的结果

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

windows 与 linux python 爬虫获取源代码不同

版本

代码

正常的结果内容

问题描述：

版本

代码

有问题的结果

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录