Python 中的垃圾回收机制是如何工作的?

我将在这篇文章中讨论一下在CPython中是如何实现垃圾回收机制的。

CPython中垃圾回收的主要思路

  1. 维护引用计数器 。对于每一个对象,都有一个对于该对象的引用次数的计数器。如果这个计数器的值减为了 0 ,这就代表这个对象在程序中已经没用了,那么该对象所占用的内存就会被释放。

  2. 定期检测是否循环引用。 当引用计数器的值下降到 0 时来释放内存的机制并不适用于所有的情况。假如两个对象 AB ,其中 A 拥有对 B 的引用,B 拥有对 A 的引用。 这就称之为循环引用。在这种情况下,这两个对象也没有存在的价值了,此时 AB 都应该被垃圾回收处理。但是,这两个对象的引用计数值不为零, 所以内存会一直被占用。为了解决这个问题,CPython 通过使用算法来检测是否存在循环引用并释放循环引用中的对象。

  3. 通过启发式算法提升性能。 越晚创建的对象更可能需要被回收。 CPython引入了一个 分代回收 的概念来判断一个对象使用的相对年龄。年轻一代是指最新被创建出来的对象,而老一代则代表早前创建的对象。每个对象都确定的属于某一代。 当垃圾回收机制执行时, CPython 会优先尝试回收年轻一代的对象。CPython会定期回收老一代的对象(由启发式算法确定该回收执行的效率).

垃圾回收循环

了解CPython垃圾回收的运作周期是非常有益的。我们创建一个对象来观察垃圾回收机制的运作:

  • Python需要配置一个新的对象。为此,它调用 _PyObject_GC_Malloc,给这个对象分配内存以及将其添加到垃圾回收的第一阶段(我们称为0代)。 随即查看这个对象在0代中的数值是否超过阈值。如果确实超过阈值,而且垃圾回收机制当前没有运作,对 collect_generations 的调用随机生效进行垃圾回收。否则对象正常分配内存。

  • collect_generations 被调用,Python开始垃圾回收。这个方法算出什么阶段进行垃圾回收 (CPython默认有三代,但GC模块可以修改.。此外,年轻一代拥有低级索引, 所以0代是最年轻的一代)。Python循环所有代 (从最老到最年轻) 然后检测某一代的对象值超过阈值。如果有, 它会将所有年轻代合并到 这一代然后调用collect对这一代进行垃圾回收 。注意: Python希望最好在0代进行垃圾回收, 因为这一代拥有最年轻的对象,同样也能迭代最少。对老一代进行垃圾回收相当于收集所有对象因为对第i代的垃圾回收会使用从0到i代的所有对象。

  • collect 会对特定代进行垃圾回收。这相当于运行参考循环检测算法 (待会介绍) 然后在特定代找出一系列可得到和不可得到的对象。 这些可得到的对象会被并入下一高级的代 (也就是说,如果 collect 在第i代运行, 第i代的对象会被合并到i+1代)。对于不可获得的对象, CPython 会进行所有可能的终结器回调, 使弱ref回调,最终解除这些对象分配。

  • 最后,垃圾回收模块的内部状态会更新为collect完成它的职责。

CPython's Algorithm for Detecting Reference Cycles

Python attempts to find reference cycles within a generation. Confining the search for reference cycles to a single generation decreases the amount of work that has to be done in a single collection (if it is one of the generations that holds younger objects).

To find reference cycles, Python uses young, the pointer to the head of the list of objects for the generation that's being garbage collected, and runs the following:

update_refs(young)
subtract_refs(young)
gc_init_list(&unreachable)
move_unreachable(young,  &unreachable)

The update_refs method makes a copy of the reference count for every object in the generation so that the garbage collector can mutate its own version of the reference count without messing with the real reference count.

Then subtract_refs goes through each of the objects i in the generation being garbage collected, and decrements the reference counts on any objects j in the generation list that are referenced by object i. After this method has run, the reference count on an object in the generation should equal the number of references to that object from objects which do not belong to that generation (since all references from objects within the same generation have been removed).

Now comes the fun part. The move_unreachable method scans through the young list and moves objects with a reference count of 0 into the unreachable list and changes their reference count to GC_TENTATIVELY_UNREACHABLE. Objects with a non-zero reference count are marked as GC_REACHABLE and the objects they reference are traversed and marked as GC_REACHABLE then moved to the end of the young list so they too can be traversed later.

The reason that objects with a reference count of 0 are tentatively unreachable is as follows. Suppose object A has been marked as tentatively unreachable and is referenced by some object B. Suppose that B is in the same generation as A and is actually reachable from outside the generation, but that B comes later in the young list than A. Then A would be sent to the unreachable list when move_unreachable scans over it. However, when move_unreachable scans over B, it will notice that it's reference count is non-zero, mark it as GC_REACHABLE, and traverse B's references and mark them as reachable. Now, A has become GC_REACHABLE as well and has been moved to the end of the young list so that it's references can also be marked as GC_REACHABLE. Thus, we only know that an object is unreachable after move_unreachable has scanned over the entire young list.

Once the entire young list has been traversed, then all the items left in the unreachable list are definitely unreachable and so they can be deallocated. The items in the young list are then merged into older generations.

Performance Notes:

  • CPython's garbage collector still stops the world, but does so more infrequently than other implementations. Reference count deallocation increases the time between collections because the number of objects in a generation decreases whenever an object's reference count falls to 0 and gets deallocated. Since collections are triggered when the number of objects in a generation are above a threshold, reference count deallocation decreases the number of collections as long as there aren't too many reference cycles.
  • If the hypothesis that younger objects are the objects more likely to need to be garbage collected is true, then running the garbage collector on younger generations can significantly reduce total runtime.
  • Memory fragmentation occurs. Reference counting will naturally cause memory to get fragmented since the deallocation of an object means there will be small chunks of memory that get added back onto the heap. Some garbage collectors deal with fragmentation by copying all live objects into a different section of memory and freeing up an entire section of memory, but CPython doesn't.
  • Normal execution is slower. There is an extra check whenever an object gets allocated, referenced, or dereferenced. This means that instead of running normally, then performing a garbage collection all at once like some tracing garbage collectors do, CPython's reference count garbage collector will spread the work across normal activities and perform collection less frequently. Unfortunately, this means the total amount of work that goes into garbage collection is higher (because of the added reference count checks).
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。

原文地址:https://www.quora.com/Python-programming...

译文地址:https://learnku.com/python/t/22977/how-d...

本文为协同翻译文章,如您发现瑕疵请点击「改进」按钮提交优化建议
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!