如何删除 HTML 代码内字串的部分内容？

大家好，小弟想问有关Python字串的问题。
早前小弟使用了Selenium和Web driver 抓取所需的东西和输出了一个csv档案。
CSV档案内容如下：

然后，我要读取csv档案抓取html代码的栏位，再删除部分的字串。
我尝试了使用 replace 的方法，但网路上找到的方法都是针对指定的字元，而不是范围。

以下是html代码例子：

<div class="ContentGrid">
    香港一年GDP 都3千幾億大美金
    <br>
    2成都6百幾
    <br>
    <br>
    <br>
</div>

<div class="ContentGrid">
    <blockquote>
        <div style="color: #0000A0;">
            <blockquote>
                <div style="color: #0000A0;">
                    藍店送聖誕卡比施生有乜下場
                    <img data-icons="???" src="/faces/wonder2.gif" alt="???">
                </div>
            </blockquote>
            <br>何只聖誕卡，直情要送埋聖誕樹賀一賀佢
            <img data-icons="#hehe#" src="/faces/hehe.gif" alt="#hehe#">
        </div>
    </blockquote>
    <br>
    施生只對聖誕卡有感覺。
    <br>
    <br>
    <br>
</div>

我有大量的 div class="ContentGrid"，但不是每个 div class="ContentGrid" 也有 <blockquote>...</blockquote>。所以我需要移除所有包含 <blockquote>...</blockquote> 的内容

以下是我预期的结果:

<div class="ContentGrid">
    香港一年GDP 都3千幾億大美金
    <br>
    2成都6百幾
    <br>
    <br>
    <br>
</div>

<div class="ContentGrid">

    <br>
    施生只對聖誕卡有感覺。
    <br>
    <br>
    <br>
</div>

希望大家可以帮到我，谢谢你们。

html csv

fd5556

51 声望

暂无个人描述~

0 人点赞

Jason990420

1.9k 声望 / 個人 @ 個人

最佳答案

from bs4 import BeautifulSoup

with open('D:/html.txt', 'rt', encoding='utf-8') as f:
    txt = f.read()

soup = BeautifulSoup(txt, 'html.parser')

for s in soup.select('blockquote'):
    s.extract()

print(soup.prettify())

with open('D:/html.txt', 'rt', encoding='utf-8') as f:
    txt = f.read()

l = len('</blockquote>')
while '<blockquote>' in txt:
    index1 = txt.find('<blockquote>')
    index2 = txt.find('</blockquote>')
    txt = txt[:index1]+txt[index2+l:]

print(txt)

5年前评论

讨论数量: 1

Jason990420

1.9k 声望 / 個人 @ 個人

from bs4 import BeautifulSoup

with open('D:/html.txt', 'rt', encoding='utf-8') as f:
    txt = f.read()

soup = BeautifulSoup(txt, 'html.parser')

for s in soup.select('blockquote'):
    s.extract()

print(soup.prettify())

with open('D:/html.txt', 'rt', encoding='utf-8') as f:
    txt = f.read()

l = len('</blockquote>')
while '<blockquote>' in txt:
    index1 = txt.find('<blockquote>')
    index2 = txt.find('</blockquote>')
    txt = txt[:index1]+txt[index2+l:]

print(txt)

5年前评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

如何删除 HTML 代码内字串的部分内容？

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

如何删除 HTML 代码内字串的部分内容？

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录