18.7. gc — 垃圾收集器
本节目标:了解如何管理 Python 对象的内存。
gc
是 Python 底层内存管理机制(自动垃圾回收装置)的接口。模块中包含控制回收装置行为和检查暴露到系统中对象的函数,还有挂起收集,禁止引用循环和释放的函数。
追踪引用
gc
可以收集对象间的引用关系,用于寻找复杂数据结构中的循环。如果某数据结构有自己的循环,我们可以自定义代码来检测它的属性。如果循环是不可知代码,可以用 get_referents()
和 get_referrers()
函数创建一个通用型调试工具。
举个例子,get_referents()
函数可以根据输入的参数显示对象的 引用目标。
gc_get_referents.py
import gc
import pprint
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
print('Linking nodes {}.next = {}'.format(self, next))
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
# 构造一个引用循环。
one = Graph('one')
two = Graph('two')
three = Graph('three')
one.set_next(two)
two.set_next(three)
three.set_next(one)
print()
print('three refers to:')
for r in gc.get_referents(three):
pprint.pprint(r)
上例中,Graph
的实例 three
在它的实例字典中有一个对自己实例和自己的类的引用。
$ python3 gc_get_referents.py
Linking nodes Graph(one).next = Graph(two)
Linking nodes Graph(two).next = Graph(three)
Linking nodes Graph(three).next = Graph(one)
three refers to:
{'name': 'three', 'next': Graph(one)}
<class '__main__.Graph'>
我们再写一个例子,这次用 Queue
来进行一个广度优先遍历,来寻找对象的循环引用。插入到队列的项是一个包含引用链和下一个对象的元组。我们从 three
开始,来寻找所有引用的对象。当然我们要跳过类避免检索其中的方法和模块之类的东西。
gc_get_referents_cycles.py
import gc
import pprint
import queue
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
print('Linking nodes {}.next = {}'.format(self, next))
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
# 构造引用循环
one = Graph('one')
two = Graph('two')
three = Graph('three')
one.set_next(two)
two.set_next(three)
three.set_next(one)
print()
seen = set()
to_process = queue.Queue()
# 从一个空对象和 three 开始。
to_process.put(([], three))
# 寻找引用循环,我们要将所有在队列中的对象的对象链找出来,
# 这样我们就能输出全部的引用循环。
while not to_process.empty():
chain, next = to_process.get()
chain = chain[:]
chain.append(next)
print('Examining:', repr(next))
seen.add(id(next))
for r in gc.get_referents(next):
if isinstance(r, str) or isinstance(r, type):
# 忽略类与字符串
pass
elif id(r) in seen:
print()
print('Found a cycle to {}:'.format(r))
for i, link in enumerate(chain):
print(' {}: '.format(i), end=' ')
pprint.pprint(link)
else:
to_process.put((chain, r))
这样此点的循环可以很轻易的通过检查已经被处理过的对象得知。为了避免占用这些对象的引用,我们只把它们的 id()
值保存到一个集合中。循环中发现的字典对象是 Graph
实例的 __dict__
,里面保存了实例的属性。
$ python3 gc_get_referents_cycles.py
Linking nodes Graph(one).next = Graph(two)
Linking nodes Graph(two).next = Graph(three)
Linking nodes Graph(three).next = Graph(one)
Examining: Graph(three)
Examining: {'name': 'three', 'next': Graph(one)}
Examining: Graph(one)
Examining: {'name': 'one', 'next': Graph(two)}
Examining: Graph(two)
Examining: {'name': 'two', 'next': Graph(three)}
Found a cycle to Graph(three):
0: Graph(three)
1: {'name': 'three', 'next': Graph(one)}
2: Graph(one)
3: {'name': 'one', 'next': Graph(two)}
4: Graph(two)
5: {'name': 'two', 'next': Graph(three)}
强制垃圾回收
尽管垃圾回收器在解释器运行程序时会自动进行工作,我们可能还是要在某些时候手动运行一下,特别是有大量对象需要释放,或者并没有太多需要处理所以此时运行垃圾回收器不会对性能有所损害。我们可以用 collect()
来触发一次回收。
gc_collect.py
import gc
import pprint
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
print('Linking nodes {}.next = {}'.format(self, next))
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
# 构建引用循环
one = Graph('one')
two = Graph('two')
three = Graph('three')
one.set_next(two)
two.set_next(three)
three.set_next(one)
# 从此模块的命名空间中删除这些节点的引用。
one = two = three = None
# 显示出垃圾回收的结果。
for i in range(2):
print('\nCollecting {} ...'.format(i))
n = gc.collect()
print('Unreachable objects:', n)
print('Remaining Garbage:', end=' ')
pprint.pprint(gc.garbage)
本例中,这个引用循环被直接清理了,因为除了 Graph
之外,没有其他的引用指向它们了。 collect()
会返回它所找到的「无法再访问」对象的数量。例子中显示的是 6
,因为有3个对象以及它们的实例属性字典。
$ python3 gc_collect.py
Linking nodes Graph(one).next = Graph(two)
Linking nodes Graph(two).next = Graph(three)
Linking nodes Graph(three).next = Graph(one)
Collecting 0 ...
Unreachable objects: 6
Remaining Garbage: []
Collecting 1 ...
Unreachable objects: 0
Remaining Garbage: []
找到那些无法被回收的对象的引用
寻找对另一个对象的有引用的对象比查看某对象的引用内容要棘手一些。因为代码要查询引用需要首先拥有引用,而有些引用又是需要忽略掉的(这样就会多出一个引用)。下面我们创建一个引用循环,之后我们会通过 Graph
的实例演示删除「父」节点上的引用。
gc_get_referrers.py
import gc
import pprint
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
print('Linking nodes {}.next = {}'.format(self, next))
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
def __del__(self):
print('{}.__del__()'.format(self))
# 构建引用循环
one = Graph('one')
two = Graph('two')
three = Graph('three')
one.set_next(two)
two.set_next(three)
three.set_next(one)
# 收集下现在仍是不可回收但不是垃圾的对象。
print()
print('Collecting...')
n = gc.collect()
print('Unreachable objects:', n)
print('Remaining Garbage:', end=' ')
pprint.pprint(gc.garbage)
# 定义我们要从本模块的本地变量,
# 全局变量和垃圾回收器自己的记录中忽略一些引用
REFERRERS_TO_IGNORE = [locals(), globals(), gc.garbage]
def find_referring_graphs(obj):
print('Looking for references to {!r}'.format(obj))
referrers = (r for r in gc.get_referrers(obj)
if r not in REFERRERS_TO_IGNORE)
for ref in referrers:
if isinstance(ref, Graph):
# 一个 Graph 节点
yield ref
elif isinstance(ref, dict):
# 返回一个实例或者其他命名空间字典
for parent in find_referring_graphs(ref):
yield parent
# 查询在 graph 中的引用了对象的对象
print()
print('Clearing referrers:')
for obj in [one, two, three]:
for ref in find_referring_graphs(obj):
print('Found referrer:', ref)
ref.set_next(None)
del ref # 删除引用,这样此节点就可以被删除
del obj # 同上
# 清除 gc.garbage 所占有的引用
print()
print('Clearing gc.garbage:')
del gc.garbage[:]
# 到这一步需要把所有对象都释放
print()
print('Collecting...')
n = gc.collect()
print('Unreachable objects:', n)
print('Remaining Garbage:', end=' ')
pprint.pprint(gc.garbage)
如果循环比较清晰,这样的写法有些过,不过对于隐藏的比较深的循环我们可以使用 get_referrers()
可以暴露出一些意想不到的引用关系。
$ python3 gc_get_referrers.py
Linking nodes Graph(one).next = Graph(two)
Linking nodes Graph(two).next = Graph(three)
Linking nodes Graph(three).next = Graph(one)
Collecting...
Unreachable objects: 0
Remaining Garbage: []
Clearing referrers:
Looking for references to Graph(one)
Looking for references to {'name': 'three', 'next': Graph(one)}
Found referrer: Graph(three)
Linking nodes Graph(three).next = None
Looking for references to Graph(two)
Looking for references to {'name': 'one', 'next': Graph(two)}
Found referrer: Graph(one)
Linking nodes Graph(one).next = None
Looking for references to Graph(three)
Looking for references to {'name': 'two', 'next': Graph(three)}
Found referrer: Graph(two)
Linking nodes Graph(two).next = None
Clearing gc.garbage:
Collecting...
Unreachable objects: 0
Remaining Garbage: []
Graph(one).__del__()
Graph(two).__del__()
Graph(three).__del__()
回收阈值和代数
运行时,垃圾收集器维护着3个对象列表,其中一个是对 「代」的追踪。当对象在每次「代」的检查中,它们要么被回收要么进入下一代,反复如此直到最终无法被回收。
回收器的检查会基于不同的对象分配与释放频率而触发。当分配的数量减去释放的数量大于当前代指定的阈值时,垃圾回收器就会运行。当前的阈值是多少可以通过 get_threshold()
知道。
gc_get_threshold.py
import gc
print(gc.get_threshold())
返回的值是各代的阈值。
$ python3 gc_get_threshold.py
(700, 10, 10)
阈值可以通过 set_threshold()
更改。例子中使用了命令行参数来设置第0
代的阈值之后会分配一些对象。
gc_threshold.py
import gc
import pprint
import sys
try:
threshold = int(sys.argv[1])
except (IndexError, ValueError, TypeError):
print('Missing or invalid threshold, using default')
threshold = 5
class MyObj:
def __init__(self, name):
self.name = name
print('Created', self.name)
gc.set_debug(gc.DEBUG_STATS)
gc.set_threshold(threshold, 1, 1)
print('Thresholds:', gc.get_threshold())
print('Clear the collector by forcing a run')
gc.collect()
print()
print('Creating objects')
objs = []
for i in range(10):
objs.append(MyObj(i))
print('Exiting')
# 关闭调试
gc.set_debug(0)
不同的阈值会让垃圾收集的次数变化,下面的输出是因为设置允许了输出调试。
$ python3 -u gc_threshold.py 5
Thresholds: (5, 1, 1)
Clear the collector by forcing a run
gc: collecting generation 2...
gc: objects in each generation: 505 2161 4858
gc: done, 0.0010s elapsed
Creating objects
gc: collecting generation 0...
gc: objects in each generation: 5 0 7323
gc: done, 0.0000s elapsed
Created 0
Created 1
gc: collecting generation 0...
gc: objects in each generation: 4 2 7323
gc: done, 0.0000s elapsed
Created 2
Created 3
Created 4
gc: collecting generation 1...
gc: objects in each generation: 6 3 7323
gc: done, 0.0000s elapsed
Created 5
Created 6
Created 7
gc: collecting generation 0...
gc: objects in each generation: 6 0 7329
gc: done, 0.0000s elapsed
Created 8
Created 9
Exiting
阈值越小,清扫越频繁。
$ python3 -u gc_threshold.py 2
Thresholds: (2, 1, 1)
Clear the collector by forcing a run
gc: collecting generation 2...
gc: objects in each generation: 505 2161 4858
gc: done, 0.0010s elapsed
gc: collecting generation 0...
gc: objects in each generation: 2 0 7323
gc: done, 0.0000s elapsed
Creating objects
gc: collecting generation 0...
gc: objects in each generation: 5 0 7323
gc: done, 0.0000s elapsed
gc: collecting generation 1...
gc: objects in each generation: 3 3 7323
gc: done, 0.0000s elapsed
Created 0
Created 1
gc: collecting generation 0...
gc: objects in each generation: 4 0 7325
gc: done, 0.0000s elapsed
Created 2
gc: collecting generation 0...
gc: objects in each generation: 7 1 7325
gc: done, 0.0000s elapsed
Created 3
Created 4
gc: collecting generation 1...
gc: objects in each generation: 4 3 7325
gc: done, 0.0000s elapsed
Created 5
gc: collecting generation 0...
gc: objects in each generation: 7 0 7329
gc: done, 0.0000s elapsed
Created 6
Created 7
gc: collecting generation 0...
gc: objects in each generation: 4 2 7329
gc: done, 0.0000s elapsed
Created 8
gc: collecting generation 1...
gc: objects in each generation: 7 3 7329
gc: done, 0.0000s elapsed
Created 9
Exiting
调试
我们来挑战下调试内存泄露。 gc
模块有多个选项可以供我们直面内部工作来让我们的调试更加容易。这些选项是一个个比特标识,我们可以组合通过 set_debug()
来控制垃圾回收器的行为。调试信息会通过 sys.stderr
打印出来。
DEBUG_STATS
表示会打开统计报告,让垃圾回收器在运行时同时输出报告,报告包含每代所追踪到的对象数和执行清扫所花费的时间。
gc_debug_stats.py
import gc
gc.set_debug(gc.DEBUG_STATS)
gc.collect()
print('Exiting')
例子中的输出显示有两次单独的运行,一次是我们手动运行,一次是解释器退出时自动执行。
$ python3 gc_debug_stats.py
gc: collecting generation 2...
gc: objects in each generation: 618 1413 4860
gc: done, 0.0009s elapsed
Exiting
gc: collecting generation 2...
gc: objects in each generation: 1 0 6746
gc: done, 0.0022s elapsed
gc: collecting generation 2...
gc: objects in each generation: 113 0 6570
gc: done, 2930 unreachable, 0 uncollectable, 0.0012s elapsed
gc: collecting generation 2...
gc: objects in each generation: 0 0 3189
gc: done, 151 unreachable, 0 uncollectable, 0.0003s elapsed
DEBUG_COLLECTABLE
和 DEBUG_UNCOLLECTABLE
会让收集器报告每一个检查过的对象,不管它可以被回收还是不能被回收。如果看到某对象不能被回收,那就没有足够的信息来了解哪里还有数据的残留, DEBUG_SAVEALL
标识就会让 gc
保存所有找到但没有任何引用的对象放到 garbage
列表中。
gc_debug_saveall.py
import gc
flags = (gc.DEBUG_COLLECTABLE |
gc.DEBUG_UNCOLLECTABLE |
gc.DEBUG_SAVEALL
)
gc.set_debug(flags)
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
class CleanupGraph(Graph):
def __del__(self):
print('{}.__del__()'.format(self))
# 创建引用循环
one = Graph('one')
two = Graph('two')
one.set_next(two)
two.set_next(one)
# 构建另一个节点,只有自己占用。
three = CleanupGraph('three')
# 最后再构建一个引用循环
four = CleanupGraph('four')
five = CleanupGraph('five')
four.set_next(five)
five.set_next(four)
# 将所有的引用删除。
one = two = three = four = five = None
# 进行强制回收
print('Collecting')
gc.collect()
print('Done')
# 报告有哪些留下了
for o in gc.garbage:
if isinstance(o, Graph):
print('Retained: {} 0x{:x}'.format(o, id(o)))
# 重新设置 debug 标识,避免在退出时会输出过多信息导致例子混乱
gc.set_debug(0)
在垃圾收集过后就可以对保存下来的对象进行检查,在每个对象被创建后构造器不能进行修改以打印出对象id时使用此种方法可以对调试有所帮助。
$ python3 -u gc_debug_saveall.py
CleanupGraph(three).__del__()
Collecting
gc: collectable <Graph 0x101fe1f28>
gc: collectable <Graph 0x103d02048>
gc: collectable <dict 0x101c92678>
gc: collectable <dict 0x101c926c0>
gc: collectable <CleanupGraph 0x103d02160>
gc: collectable <CleanupGraph 0x103d02198>
gc: collectable <dict 0x101fe73f0>
gc: collectable <dict 0x101fe7360>
CleanupGraph(four).__del__()
CleanupGraph(five).__del__()
Done
Retained: Graph(one) 0x101fe1f28
Retained: Graph(two) 0x103d02048
Retained: CleanupGraph(four) 0x103d02160
Retained: CleanupGraph(five) 0x103d02198
简单起见,DEBUG_LEAK
代表所有其他选项的结合。
gc_debug_leak.py
import gc
flags = gc.DEBUG_LEAK
gc.set_debug(flags)
class Graph:
def __init__(self, name):
self.name = name
self.next = None
def set_next(self, next):
self.next = next
def __repr__(self):
return '{}({})'.format(
self.__class__.__name__, self.name)
class CleanupGraph(Graph):
def __del__(self):
print('{}.__del__()'.format(self))
# 构建引用循环
one = Graph('one')
two = Graph('two')
one.set_next(two)
two.set_next(one)
# 构建另一个节点,只有自己引用自己。
three = CleanupGraph('three')
# 再创建另一组引用循环。
four = CleanupGraph('four')
five = CleanupGraph('five')
four.set_next(five)
five.set_next(four)
# 删除所有节点的引用。
one = two = three = four = five = None
# 强制回收。
print('Collecting')
gc.collect()
print('Done')
# 报告哪些被留下了。
for o in gc.garbage:
if isinstance(o, Graph):
print('Retained: {} 0x{:x}'.format(o, id(o)))
# 重新设置 debug 标识,避免在退出时会有额外的信息导致例子混乱
gc.set_debug(0)
记住 DEBUG_SAVEALL
也包含在 DEBUG_LEAK
里,所以暂未被回收的没有引用的对象会被保留下来(garbage
列表)。
$ python3 -u gc_debug_leak.py
CleanupGraph(three).__del__()
Collecting
gc: collectable <Graph 0x1044e1f28>
gc: collectable <Graph 0x1044eb048>
gc: collectable <dict 0x101c92678>
gc: collectable <dict 0x101c926c0>
gc: collectable <CleanupGraph 0x1044eb160>
gc: collectable <CleanupGraph 0x1044eb198>
gc: collectable <dict 0x1044e7360>
gc: collectable <dict 0x1044e72d0>
CleanupGraph(four).__del__()
CleanupGraph(five).__del__()
Done
Retained: Graph(one) 0x1044e1f28
Retained: Graph(two) 0x1044eb048
Retained: CleanupGraph(four) 0x1044eb160
Retained: CleanupGraph(five) 0x1044eb198
参阅
- gc 标准库文档
- Python 2 到 3 的 gc 迁移注意事项
weakref
--
一个可以引用某对象但不增加其引用计数的模块。- 支持循环垃圾回收 -- Python C API 文档的背景资料
- Python 怎么管理内存? -- Python 内存管理的文章 --- Fredrik Lundh.
本译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。