python instagram 爬虫

直接介绍一下具体的步骤以及注意点:

instagram 爬虫注意点

  • instagram的首页数据是 服务端渲染的,所以首页出现的11或12条数据是以html中的一个json结构存在的(additionalData),之后的帖子加载才是走ajax请求的

  • 在2019/06之前,ins是有反爬机制的,请求时需要在请求头加了'X-Instagram-GIS'字段。其算法是:
    1、将rhx_gis和queryVariables进行组合

    rhx_gis可以在首页处的sharedData这个json结构中获得

    2、然后进行md5哈希
    e.g.

        queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'
        print(queryVariables)
        headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)
  • 但是在在2019/06之后, instagram已经取消了X-Instagram-GIS的校验,所以无需再生成X-Instagram-GIS,上一点内容可以当做历史来了解了

  • 初始访问ins首页的时候会设置一些cookie,设置的内容(response header)如下:

        set-cookie: rur=PRN; Domain=.instagram.com; HttpOnly; Path=/; Secure
        set-cookie: ds_user_id=11859524403; Domain=.instagram.com; expires=Mon, 15-Jul-2019 09:22:48 GMT; Max-Age=7776000; Path=/; Secure
        set-cookie: urlgen="{\"45.63.123.251\": 20473}:1hGKIi:7bh3mEau4gMVhrzWRTvtjs9hJ2Q"; Domain=.instagram.com; HttpOnly; Path=/; Secure
        set-cookie: csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; Domain=.instagram.com; expires=Tue, 14-Apr-2020 09:22:48 GMT; Max-Age=31449600; Path=/; Secure
  • 关于query_hash,一般这个哈希值不用怎么管,可以直接写死

  • 特别注意:在每次请求时务必带上自定义的header,且header里面要有user-agent,这样子才能使用rhx_gis来进行签名访问并且获取到数据。切记!是每次访问!例如:

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
  • 大部分api的访问需要在请求头的cookie中携带session-id才能得到数据,一个正常的请求头(request header)如下:

        :authority: www.instagram.com
        :method: GET
        :path: /graphql/query/?query_hash=ae21d996d1918b725a934c0ed7f59a74&variables=%7B%22fetch_media_count%22%3A0%2C%22fetch_suggested_count%22%3A30%2C%22ignore_cache%22%3Atrue%2C%22filter_followed_friends%22%3Atrue%2C%22seen_ids%22%3A%5B%5D%2C%22include_reel%22%3Atrue%7D
        :scheme: https
        accept: */*
        accept-encoding: gzip, deflate, br
        accept-language: zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7
        cache-control: no-cache
        cookie: mid=XI-joQAEAAHpP4H2WkiI0kcY3sxg; csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; ds_user_id=11859524403; sessionid=11859524403%3Al965tcIRCjXmVp%3A25; rur=PRN; urlgen="{\"45.63.123.251\": 20473}:1hGKIj:JvyKtYz_nHgBsLZnKrbSq0FEfeg"
        pragma: no-cache
        referer: https://www.instagram.com/
        user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
        x-ig-app-id: 936619743392459
        x-instagram-gis: 8f382d24b07524ad90b4f5ed5d6fccdb
        x-requested-with: XMLHttpRequest

    注意user-agent、x-ig-app-id(html中的sharedData中获取)、x-instagram-gis,以及cookie中的session-id配置

  • api的分页(请求下一页数据),如用户帖子列表
    ins中一个带分页的ajax请求,一般请求参数会类似下面:

    query_hash: a5164aed103f24b03e7b7747a2d94e3c
    variables: {
    "id":"1664922478",
    "first":12,
    "after":"AQBJ8AGqCb5c9rO-dl2Z8ojZW12jrFbYZHxJKC1hP-nJKLtedNJ6VHzKAZtAd0oeUfgJqw8DmusHbQTa5DcoqQ5E3urx0BH9NkqZFePTP1Ie7A"}

    -- id表示用户id,可在html中的sharedData中获取
    -- first表示初始时获取多少条记录,好像最多是50
    -- after表示分页游标,记录了分页获取的位置

    当然 variables 部分里面的参数根据请求的api不同而可能不同(不止这么少),这里只列出与分页相关的参数。

    分页请求参数首先是从html中的sharedData中获取的:

        # 网页页面信息
        page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']
        # 下一页的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942w
        cursor = page_info['end_cursor']
        # 是否有下一页
        flag = page_info['has_next_page']

    end_cursor 即为 after的值,has_next_page检测是否有下一页
    如果是有下一页,可进行第一次分页数据请求,第一次分页请求的响应数据回来之后,id,first的值不用变,after的值变为响应数据中page_info中end_cursor的值,再构造variables,连同query_hash发起再下一页的请求
    再判断响应数据中的page_info中has_next_page的值,循环下去,可拿完全部数据。若不想拿完,可利用响应数据中的edge_owner_to_timeline_media中的count值来做判断,该值表示用户总共有多少媒体

  • 视频帖子和图片帖子数据结构不一样,注意判断响应数据中的is_video字段

  • 如果是用一个ins账号去采集的话,只要请求头的cookie中带上合法且未过期的session_id,可直接访问接口,无需计算签名。
    最直接的做法是:打开浏览器,登录instagram后,F12查看xhr请求,将request header中的cookie复制过来使用即可,向下面:

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
        'cookie': 'mid=XLaW9QAEAAH0WaPDCeY490qeeNlA; csrftoken=IgcP8rj0Ish5e9uHNXhVEsTId22tw8VE; ds_user_id=11859524403; sessionid=11859524403%3A74mdddCfCqXS7I%3A15; rur=PRN; urlgen="{\"45.63.123.251\": 20473}:1hGxr6:Phc4hR68jNts4Ig9FbrZRglG4YA"'
    }

    在请求发出的时候带上类似上面的请求头

  • 错误日志记录表在 192.168.1.57 中 zk_flock 库的 ins_error_log,目前比较多unknow ssl protocol 类型的错误,怀疑是爬取太快的原因,需要一个代理来切换

给出能运行的代码?(设置了FQ代理,不需要的可以去掉喔):

# -*- coding:utf-8 -*-
import requests
import re
import json
import urllib.parse
import hashlib
import sys

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

BASE_URL = 'https://www.instagram.com'
ACCOUNT_MEDIAS = "http://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%s"
ACCOUNT_PAGE = 'https://www.instagram.com/%s'

proxies = {
    'http': 'http://127.0.0.1:1087',
    'https': 'http://127.0.0.1:1087',
}

# 一次设置proxy的办法,将它设置在一次session会话中,这样就不用每次都在调用requests的时候指定proxies参数了
# s = requests.session()
# s.proxies = {'http': '121.193.143.249:80'}

def get_shared_data(html=''):
    """get window._sharedData from page,return the dict loaded by window._sharedData str
    """
    if html:
        target_text = html
    else:
        header = generate_header()
        response = requests.get(BASE_URL, proxies=proxies, headers=header)
        target_text = response.text
    regx = r"\s*.*\s*<script.*?>.*_sharedData\s*=\s*(.*?);<\/script>"
    match_result = re.match(regx, target_text, re.S)
    data = json.loads(match_result.group(1))

    return data

# def get_rhx_gis():
#     """get the rhx_gis value from sharedData
#     """
#     share_data = get_shared_data()
#     return share_data['rhx_gis']

def get_account(user_name):
    """get the account info by username
    :param user_name:
    :return:
    """
    url = get_account_link(user_name)
    header = generate_header()
    response = requests.get(url, headers=header, proxies=proxies)
    data = get_shared_data(response.text)
    account = resolve_account_data(data)
    return account

def get_media_by_user_id(user_id, count=50, max_id=''):
    """get media info by user id
    :param id:
    :param count:
    :param max_id:
    :return:
    """
    index = 0
    medias = []
    has_next_page = True
    while index <= count and has_next_page:
        varibles = json.dumps({
            'id': str(user_id),
            'first': count,
            'after': str(max_id)
        }, separators=(',', ':'))  # 不指定separators的话key:value的:后会默认有空格,因为其默认separators为(', ', ': ')
        url = get_account_media_link(varibles)
        header = generate_header()
        response = requests.get(url, headers=header, proxies=proxies)

        media_json_data = json.loads(response.text)
        media_raw_data = media_json_data['data']['user']['edge_owner_to_timeline_media']['edges']

        if not media_raw_data:
            return medias

        for item in media_raw_data:
            if index == count:
                return medias
            index += 1
            medias.append(general_resolve_media(item['node']))
        max_id = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        has_next_page = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
    return medias

def get_media_by_url(media_url):
    response = requests.get(get_media_url(media_url), proxies=proxies, headers=generate_header())
    media_json = json.loads(response.text)
    return general_resolve_media(media_json['graphql']['shortcode_media'])

def get_account_media_link(varibles):
    return ACCOUNT_MEDIAS % urllib.parse.quote(varibles)

def get_account_link(user_name):
    return ACCOUNT_PAGE % user_name

def get_media_url(media_url):
    return media_url.rstrip('/') + '/?__a=1'

# def generate_instagram_gis(varibles):
#     rhx_gis = get_rhx_gis()
#     gis_token = rhx_gis + ':' + varibles
#     x_instagram_token = hashlib.md5(gis_token.encode('utf-8')).hexdigest()
#     return x_instagram_token

def generate_header(gis_token=''):
    # todo: if have session, add the session key:value to header
    header = {
        'user-agent': USER_AGENT,
    }
    if gis_token:
        header['x-instagram-gis'] = gis_token
    return header

def general_resolve_media(media):
    res = {
        'id': media['id'],
        'type': media['__typename'][5:].lower(),
        'content': media['edge_media_to_caption']['edges'][0]['node']['text'],
        'title': 'title' in media and media['title'] or '',
        'shortcode': media['shortcode'],
        'preview_url': BASE_URL + '/p/' + media['shortcode'],
        'comments_count': media['edge_media_to_comment']['count'],
        'likes_count': media['edge_media_preview_like']['count'],
        'dimensions': 'dimensions' in media and media['dimensions'] or {},
        'display_url': media['display_url'],
        'owner_id': media['owner']['id'],
        'thumbnail_src': 'thumbnail_src' in media and media['thumbnail_src'] or '',
        'is_video': media['is_video'],
        'video_url': 'video_url' in media and media['video_url'] or ''
    }
    return res

def resolve_account_data(account_data):
    account = {
        'country': account_data['country_code'],
        'language': account_data['language_code'],
        'biography': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['biography'],
        'followers_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count'],
        'follow_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_follow']['count'],
        'full_name': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['full_name'],
        'id': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['id'],
        'is_private': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_private'],
        'is_verified': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_verified'],
        'profile_pic_url': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['profile_pic_url_hd'],
        'username': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['username'],
    }
    return account

account = get_account('shaq')

result = get_media_by_user_id(account['id'], 56)

media = get_media_by_url('https://www.instagram.com/p/Bw3-Q2XhDMf/')

print(len(result))
print(result)

封装成库了!

除此以外,为了方便我写了一个库放在了github上,里面包含了很多操作,希望大家能看一下给点建议。如果对你有用的话,欢迎star和PR~ 感谢泥萌!! -> github传送门

讨论数量: 2

@lovecn 我是看了别的开源库的相关代码,至于别的开源库是怎么知道的,据说是将instagram的js文件格式化之后看js他压缩过的代码猜出来的,真是位js逆向dalao

3周前 评论

请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!

社区文档:

官方入门教程,从这里开始你的 Python 之旅,将长久维护
《A Byte of Python》的中文译本,由社区维护,每年更新
Python 日常使用的最佳实践,高级 Python 开发者必知必会的知识
Pymotw.com 的中文翻译,实例讲解 Python 3 标准库,简单易懂