不知道为啥，输出无内容，代码不知道错在哪里了

我想得到 Tiger，Two tigers two tigers run fast；Rabbit，Small white rabbit white and white；

import re 

s = """<div class="animal">
  <p class="name">
    <a title="Tiger"></a>
  </p>

  <p class="contents">
    Two tigers two tigers run fast
  </p>
</div>

<div class="animal">
  <p class="name">
    <a title="Rabbit"></a>
  </p>

  <p class="contents">
    Small white rabbit white and white 
  </p>
</div>"""


p = re.compile('<div class="animal".*?title="\
        (.*?)">.*?contents">(.*?)</p>', re.S)
r = p.findall(s)
print(r)

OlafChou

17 声望

暂无个人描述~

0 人点赞

推荐文章：

更多推荐...

博客

教你阅读 Python 开源项目代码 21 / 2 |

pardon110

862 声望 / 开发者 @ 社科大

最佳答案

正则错误不能匹配到目标内容。可如此这般达到你想要的效果。

file
注意非捕获分组（?:)用法，任意字符匹配防贪婪（[\s\S])，及p标签的闭口需要转义
若工作需要，可以考虑用scrapy抓取，它只需要你理解css或xpath表达式，直接用对象方法的形式抓取。

6年前评论

讨论数量: 5

Coolest

见习助教 395 声望

可以说清楚一些吗？比如你想获取那一段内容，环境......这些可以说的详细一些，不然没人会知道你想做什么

6年前评论

OlafChou （楼主）

我想得到 Tiger，Two tigers two tigers run fast；Rabbit，Small white rabbit white and white；

Coolest

见习助教 395 声望

问题就错在你的正则表达式

<div class="animal".*?title="\             (.*?)">.*?contents">(.*?)</p>

“title=”\”里面的这个\符号，它在这段正则表达式中起到了两个作用，一个是链接下文，还有一个作用就是饰演html文本里面的其中一个字符。你只是想让它起到链接下文的作用，但是正则表达式会把它误认为html的其中一个字符。
解决办法：不要换行，直接把整个正则表达式放一行就行了。

import re s = """<div class="animal">   <p class="name">     <a title="Tiger"></a>   </p>    <p class="contents">     Two tigers two tigers run fast   </p> </div>  <div class="animal">   <p class="name">     <a title="Rabbit"></a>   </p>    <p class="contents">     Small white rabbit white and white    </p> </div>"""   p = re.compile('<div class="animal".*?title="(.*?)">.*?contents">(.*?)</p>', re.S) r = p.findall(s) print(r)

6年前评论

Coolest

见习助教 395 声望

建议你用xpath，html用re会很麻烦，有时候找不到文本。

6年前评论

pardon110

862 声望 / 开发者 @ 社科大

正则错误不能匹配到目标内容。可如此这般达到你想要的效果。

file
注意非捕获分组（?:)用法，任意字符匹配防贪婪（[\s\S])，及p标签的闭口需要转义
若工作需要，可以考虑用scrapy抓取，它只需要你理解css或xpath表达式，直接用对象方法的形式抓取。

6年前评论

Jason990420

1.9k 声望 / 個人 @ 個人

范例

import re

s = '''
    <div class="animal">
        <p class="name">
            <a title="Tiger"></a>
        </p>
        <p class="contents">
            Two tigers two tigers run fast
        </p>
    </div>
    <div class="animal">
        <p class="name">
            <a title="Rabbit"></a>
        </p>
        <p class="contents">
            Small white rabbit white and white
        </p>
    </div>
'''
pattern = r'''
    <div class="animal">.*?
        <a title="(.*?)">.*?</a>.*?
        <p class="contents">(.*?)</p>'''
s = s.replace('\n', '')
pattern = pattern.replace('\n', '')

r = re.findall(pattern, s)
result = [f'{animal.strip()}, {contents.strip()}; ' for animal, contents in r]

print(''.join(result))

6年前评论

Coolest

有点乱

Coolest

@Jason990420 好像是，作者把代码排版弄乱了

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

不知道为啥，输出无内容，代码不知道错在哪里了

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

不知道为啥，输出无内容，代码不知道错在哪里了

推荐文章：

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录