统计用户留存

统计用户留存

最初用msyql统计,数据稍微多点就很慢。所以把数据同步到clickhouse再统计,会快很多。

mysql 查询统计

# mysql统计

SELECT 
    *,
    CONCAT(ROUND(100*次日留存用户/日新增用户数,2),'%')  '次日留存率',
    CONCAT(ROUND(100*三日留存用户/日新增用户数,2),'%')  '三日留存率',
    CONCAT(ROUND(100*四日留存用户/日新增用户数,2),'%')  '四日留存率',
    CONCAT(ROUND(100*五日留存用户/日新增用户数,2),'%')  '五日留存率',
    CONCAT(ROUND(100*六日留存用户/日新增用户数,2),'%')  '六日留存率',
    CONCAT(ROUND(100*七日留存用户/日新增用户数,2),'%')  '七日留存率',
    CONCAT(ROUND(100*月留存用户/日新增用户数,2),'%')  '月留存率'
FROM 
(    
    SELECT 
    t.`dt` '日期',
    t.cid '渠道',
    COUNT(DISTINCT(t.id))  '日新增用户数',
    COUNT(DISTINCT(t1.uid))  '次日留存用户',
    COUNT(DISTINCT(t2.uid))  '三日留存用户',
    COUNT(DISTINCT(t3.uid))  '四日留存用户',
    COUNT(DISTINCT(t4.uid))  '五日留存用户',
    COUNT(DISTINCT(t5.uid))  '六日留存用户',
    COUNT(DISTINCT(t6.uid))  '七日留存用户',
    COUNT(DISTINCT(t7.uid))  '月留存用户'
    FROM 
    (
        SELECT min(date(create_time)) as dt, uid, cid FROM login_record  group by date(create_time),uid,cid
    ) t 
    LEFT JOIN login_record t1 ON t.id = t1.uid AND DATEDIFF(date(t1.create_time),t.`dt`) = 1
    LEFT JOIN login_record t2 ON t.id = t2.uid AND DATEDIFF(date(t2.create_time),t.`dt`) = 2
    LEFT JOIN login_record t3 ON t.id = t3.uid AND DATEDIFF(date(t3.create_time),t.`dt`) = 3
    LEFT JOIN login_record t4 ON t.id = t4.uid AND DATEDIFF(date(t4.create_time),t.`dt`) = 4
    LEFT JOIN login_record t5 ON t.id = t5.uid AND DATEDIFF(date(t5.create_time),t.`dt`) = 5
    LEFT JOIN login_record t6 ON t.id = t6.uid AND DATEDIFF(date(t6.create_time),t.`dt`) = 6
    LEFT JOIN login_record t7 ON t.id = t7.uid AND DATEDIFF(date(t7.create_time),t.`dt`) = 30
    GROUP BY t.dt, t.cid
) p WHERE 日期 BETWEEN '2022-07-1' AND '2022-07-25'  AND 渠道 = 100;

clickhouse 查询统计

统计活跃用户留存和留存率

只查询login_record表, 先统计用户最开始登录的时间(基础日期)
左连接自己统计登录时间(相对日期)
连接条件是用户id相等
并且基础日期小于等于相对日期
相对日期-基础日期=天数
留存数:
判断天数=0, 则是当天登录数. =1, 则是次登数
次登留存率 round(次登数/当天登录数, 保留位数)

# clickhouse统计

select 
    base_date,
    cid,
    day_0,
    day_1,
    day_2,
    day_3,
    day_4,
    day_5,
    day_6,
    day_29,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_3/day_0, 4) retention_4,
    round(day_4/day_0, 4) retention_5,
    round(day_5/day_0, 4) retention_6,
    round(day_6/day_0, 4) retention_7,
    round(day_29/day_0, 4) retention_30
from
(
    select 
        base_date,
        cid,
        -- 留存数,day_0: 当天登录数
        sum(case when datediff = 0 then 1 else 0 end) day_0,
        sum(case when datediff = 1 then 1 else 0 end) day_1,
        sum(case when datediff = 2 then 1 else 0 end) day_2,
        sum(case when datediff = 3 then 1 else 0 end) day_3,
        sum(case when datediff = 4 then 1 else 0 end) day_4,
        sum(case when datediff = 5 then 1 else 0 end) day_5,
        sum(case when datediff = 6 then 1 else 0 end) day_6,
        sum(case when datediff = 29 then 1 else 0 end) day_29
    from 
    (
        select
            uid,
            cid,
            base_date,
            retention_date,
            -- 后续登录时间 - 首日登录时间 = 天数
            retention_date - base_date datediff
        from
        (
            select
                uid,
                cid,
                base_date,
                retention_date
            from
            -- 表a,首日登录的用户
            (
                select
                    distinct 
                        date(create_time) base_date, 
                        uid
                from
                    login_record
                where
                    -- 需要计算留存的日期区间
                    base_date between '2022-07-01' and '2022-07-25'
                 group by 
                     base_date, uid
            ) a
            left join
            -- 表b,表连接匹配后续登录的情况,计算留存
            (
                select
                    distinct 
                        date(create_time) retention_date, 
                        uid,
                        cid
                from
                    login_record
                where
                    -- 此处不应限制最大日期,否则可能导致后续留存率为0
                    retention_date >= '2022-07-01'
                group by
                    retention_date, uid, cid
               ) b
            on 
                a.uid = b.uid
            where 
                -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
                base_date <= retention_date
            order by
                base_date
            ) c
            order by base_date
        ) d
        group by base_date, cid
        order by base_date
)

统计新增用户留存和留存率

与统计活跃用户不同,这里使用user表记录用户创建时间,与login_record表连接后计算时间差

# clickhouse 查询
select 
    base_date,
    cid,
    day_0,
    day_1,
    day_2,
    day_3,
    day_4,
    day_5,
    day_6,
    day_29,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_3/day_0, 4) retention_4,
    round(day_4/day_0, 4) retention_5,
    round(day_5/day_0, 4) retention_6,
    round(day_6/day_0, 4) retention_7,
    round(day_29/day_0, 4) retention_30
from
(
    select 
        base_date,
        cid,
        -- 留存数,day_0: 当天登录数
        sum(case when datediff = 0 then 1 else 0 end) day_0,
        sum(case when datediff = 1 then 1 else 0 end) day_1,
        sum(case when datediff = 2 then 1 else 0 end) day_2,
        sum(case when datediff = 3 then 1 else 0 end) day_3,
        sum(case when datediff = 4 then 1 else 0 end) day_4,
        sum(case when datediff = 5 then 1 else 0 end) day_5,
        sum(case when datediff = 6 then 1 else 0 end) day_6,
        sum(case when datediff = 29 then 1 else 0 end) day_29
    from 
    (
        select
            uid,
            cid,
            base_date,
            retention_date,
            -- 后续登录时间 - 首日登录时间 = 天数
            retention_date - base_date datediff
        from
        (
            select
                uid,
                cid,
                base_date,
                retention_date
            from
            -- 表a,首日登录的用户
            (
                select
                        DISTINCT date(create_time) base_date, 
                        cid,
                        id as uid
                from
                    user
                where
                    -- 需要计算留存的日期区间
                    base_date between '2022-07-01' and '2022-07-25'
            ) a
            left join
            -- 表b,表连接匹配后续登录的情况,计算留存
            (
                select
                        DISTINCT  date(create_time) retention_date, 
                        uid,
                        cid
                from
                    login_record
                where
                    -- 此处不应限制最大日期,否则可能导致后续留存率为0
                    retention_date > '2022-07-01'
                group by
                    retention_date, uid, cid
               ) b
            on 
                a.uid = b.uid
            where 
                -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
                base_date <= retention_date
            order by
                base_date
        ) c
        order by base_date
    ) d
    group by base_date, cid
    order by base_date
)

结果如下:

参考地址

本作品采用《CC 协议》,转载必须注明作者和本文链接
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!