记一次nginx拦截爬虫

EdwinYang 的个人博客 / 1 / 10 / 创建于 2年前 / 更新于 2年前

前言：
最近发现服务器在某个时间段，内存疯狂飙升，开始还以为是正常的业务造成的，升级服务器内存，发现还是没有解决问题；（这里自己偷懒了，一开始没有找到问题，默认为就是业务量上来了）
马上查看nginx日志，发现了一些不同寻常的请求：

这是什么玩意，怀揣着好奇心马上去搜索了一下，结果：

好家伙，差点没把我服务器送回家；
赶紧解决：
nginx层面解决
发现虽然是爬虫，但是并没有伪装，每个请求里边都带了user-agent，而且都是一样的，那就好解决了，直接上代码：(我这里适用的是docker)

docker-compose

version: '3'
services:
  d_nginx:
    container_name: c_nginx
    env_file:
      - ./env_files/nginx-web.env
    image: nginx:1.20.1-alpine
    ports:
      - '80:80'
      - '81:81'
      - '443:443'
    links:
      - d_php
    volumes:
      - ./nginx/conf:/etc/nginx/conf.d
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./nginx/deny-agent.conf:/etc/nginx/agent-deny.conf
      - ./nginx/certs:/etc/nginx/certs
      - ./nginx/logs:/var/log/nginx/
      - ./www:/var/www/html

目录结构

nginx
-----nginx.conf
-----agent-deny.conf
-----conf
----------xxxx01_server.conf
----------xxxx02_server.conf

agent-deny.conf

if ($http_user_agent ~* (Scrapy|AhrefsBot)) {
    return 404;
}
if ($http_user_agent ~ "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)|^$" ) {
    return 403;
}

然后在每个service里边include这个agent-deny.conf

server {
    include /etc/nginx/agent-deny.conf;
    listen 80;
    server_name localhost;
    client_max_body_size 100M;
    root /var/www/html/xxxxx/public;
    index index.php;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_set_header REMOTE-HOST $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    #客户端允许上传文件大小
    client_max_body_size 300M;

    #客户端缓冲区大小，设置过小，nginx就不会在内存里边处理，将生成临时文件，增加IO
    #默认情况下，该指令,32位系统设置一个8k缓冲区，64位系统设置一个16k缓冲区
    #client_body_buffer_size 5M;
    #发现设置改参数后，服务器内存跳动的幅度比较大，因为你不能控制客户端上传，决定不设置改参数

    #此指令禁用NGINX缓冲区并将请求体存储在临时文件中。 文件包含纯文本数据。 该指令在NGINX配置的http，server和location区块使用
    #可选值有：
    #off:该值将禁用文件写入
    #clean：请求body将被写入文件。 该文件将在处理请求后删除
    #on: 请求正文将被写入文件。 处理请求后，将不会删除该文件
    client_body_in_file_only clean;


    #客户端请求超时时间
    client_body_timeout 600s;

    location /locales {
       break;
    }

    location / {
        #禁止get请求下载.htaccess文件
        if ($request_uri = '/.htaccess') {
            return 404;
        }
        #禁止get请求下载.gitignore文件
        if ($request_uri = '/storage/.gitignore') {
            return 404;
        }
        #禁止get下载web.config文件
        if ($request_uri = '/web.config') {
            return 404;
        }
        try_files $uri $uri/ /index.php?$query_string;
    }

    location /oauth/token {
        #禁止get请求访问 /oauth/token
        if ($request_method = 'GET') {
            return 404;
        }
        try_files $uri $uri/ /index.php?$query_string;
    }

    location /other/de {
        proxy_pass http://127.0.0.1/oauth/;
        rewrite ^/other/de(.*)$ https://www.baidu.com permanent;
    }

    location ~ \.php$ {
        try_files $uri /index.php =404;
        fastcgi_split_path_info ^(.+\.php)(/.+)$;
        fastcgi_pass d_php:9000;
        fastcgi_index index.php;
        fastcgi_param SCRIPT_FILENAME  $document_root$fastcgi_script_name;
        fastcgi_connect_timeout 300s;
        fastcgi_send_timeout 300s;
        fastcgi_read_timeout 300s;
        include fastcgi_params;
        #add_header 'Access-Control-Allow-Origin' '*';
        #add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, PUT, DELETE';
        #add_header 'Access-Control-Allow-Headers' 'DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Authorization,token';
    }
}

这样每个请求里边都会拦截这个AhrefsBot了。

阿里云安全组拦截

分析日志还发现，其实请求的IP就那么几个段，那么为了多重保证（阿里云这个是见效最快，效果最好的，付费的就是不一样）
ip段：

54.36.0.0
51.222.0.0
195.154.0.0

直接外网入方向：

本作品采用《CC 协议》，转载必须注明作者和本文链接

见习助教 40 声望

勤劳的搬运工。

《L04 微信小程序从零到发布》

从小程序个人账户申请开始，带你一步步进行开发一个微信小程序，直到提交微信控制台上线发布。

《L01 基础入门》

我们将带你从零开发一个项目并部署到线上，本课程教授 Web 开发中专业、实用的技能，如 Git 工作流、Laravel Mix 前端工作流等。

推荐文章：

更多推荐...

nuxt + laravel 第一次做全栈，AI 考研英语，希望听听大家的意见和批评，谢谢各位！ 24 / 26 |

Redis 实用小技巧——记一次 Redis 「大扫除」行动 12 / 6 |

手摸手带你使用 docker-compose 编排一个开发环境 21 / 15 |

PHP 中一次反射需要多长时间 28 / 9 |

记一次 Laravel5 升级到 Laravel10 经过 + 使用 octane 进行容器化 10 / 29 |

[网安]二：记录一下，服务器又又被黑的一次。 34 / 33 |

讨论数量: 10

hikki

95 声望

我就说最近老是爬不到数据马上改改

2年前评论

EdwinYang （楼主）

啊！这？

EdwinYang （楼主）

真的好吗

oliver-l

哈哈哈哈哈哈

不高兴就喝水

哈哈哈，上有政策，下有对策

mengdodo

Laravel 9.x 译者 106 声望 / Backend Manager @ 萌嘟嘟 https://www.mengdodo.com/

这样比较好，眼不见为净

    # 过滤掉无效的图片访问记录
    access_log_bypass_if ($uri ~* 'images');
    access_log_bypass_if ($uri ~* 'robots.txt');
    access_log_bypass_if ($uri ~* 'favicon.ico');
    access_log_bypass_if ($http_user_agent ~* 'SemrushBot');
    access_log_bypass_if ($http_user_agent ~* 'serpstatbot');
    access_log_bypass_if ($http_user_agent ~* 'DataForSeoBot');
    access_log_bypass_if ($http_user_agent ~* 'PetalBot');
    access_log_bypass_if ($http_user_agent ~* 'BLEXBot');
    access_log_bypass_if ($http_user_agent ~* 'DotBot');
    access_log_bypass_if ($http_user_agent ~* 'AhrefsBot');
    access_log_bypass_if ($http_user_agent ~* 'MJ12bot');
    access_log_bypass_if ($http_user_agent ~* 'fromBoce'); 
    access_log_bypass_if ($http_user_agent ~* 'Uptime-Kuma');

2年前评论

EdwinYang （楼主）

这样确实大部分都干净了。

jfpl

课程读者 38 声望 / PHP @ ...

也可以在根目录加入robots.txt，屏蔽代码。就不会来爬了。这家也遵循这个规则的

2年前评论

EdwinYang （楼主）

已经设置了User-agent: *和Disallow:但是还是没有生效，而且这个玩意对那种不按这个规则的貌似也没有什么效果。

leven5

11 声望

虽然都是垃圾蜘蛛，但还是会遵守robots.txt的，直接访问它ua里的链接，一般都会提供屏蔽方法

2年前评论

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

未填写

私信

文章归档

2 篇 2023 年 11 月 1 篇 2023 年 10 月 1 篇 2023 年 6 月 3 篇 2023 年 3 月 1 篇 2022 年 11 月 1 篇 2022 年 4 月

1年前 jvm-参数信息 1年前 jvm-结构 1年前多个docker-compose部署环境，共用一个容器实现。 2年前 java线程基础理解 2年前 nginx日志按天优化

3 记一次nginx拦截爬虫 3 nginx日志按天优化 2 记录第一次使用 spring-boot + smart-socket 0 laravel queue 插眼 0 linux服务器删除指定目录下7天前的log文件

博客标签

成为赞助商