翻译进度
30
分块数量
5
参与人数

PHP 内核:foreach 是如何工作的?

这是一篇协同翻译的文章,你可以点击『我来翻译』按钮来参与翻译。

foreach 是如何工作的?

首先声明,我知道 foreach 是什么,也知道怎么去用它。但这个问题关心的是,内核中 foreach 是如何运行的,我不想回答关于“如何使用 foreach 循环数组”的任何问题。


很长时间我都认为 foreach 是直接作用于数组本身,后来一些资料表明,它作用于数组的一个副本,那时我以为这就是真相了。但最近我又讨论了一下这件事,经过一些试验,发现我之前的想法并非完全正确。

licxisky 翻译于 1周前

让我来展示一下我的观点。下面的测试用例中我们将使用以下数组:

$array = array(1, 2, 3, 4, 5);

Test case 1:

foreach ($array as $item) {
  echo "$item\n";
  $array[] = $item;
}
print_r($array);

/* 循环中输出:    1 2 3 4 5
   循环后的$array : 1 2 3 4 5 1 2 3 4 5 */

这很清晰的表明我们不直接使用数据源 - 否则循环会一直持续下去,因此我们可以在循环中不停的推送元素到数组中。为了保证正确请看下面的测试用例:

michealzh 翻译于 1周前

Test case 2:

foreach ($array as $key => $item) {
  $array[$key + 1] = $item + 2;
  echo "$item\n";
}

print_r($array);

/* 循环中输出:    1 2 3 4 5
   循环后 $array : 1 3 4 5 6 7 */

这印证了我们的初步结论,在循环中使用的是数组的副本,否则我们将看到在循环中改变后的值。 但是...

如果我们查阅手册 手册,我们会发现下面这句话:

当 foreach 首次开始执行时,数组的内部指针自动重置为数组的第一个元素。

没错。。。这似乎表明 foreach 依赖源数组的指针。但是我们刚刚证明我们 没有使用源数组,对吧?好吧,不完全是。

michealzh 翻译于 1周前

Test case 3:

// 将数组指针移动到一个上面确保它不会影响循环
var_dump(each($array));

foreach ($array as $item) {
  echo "$item\n";
}

var_dump(each($array));

/* 输出
  array(4) {
    [1]=>
    int(1)
    ["value"]=>
    int(1)
    [0]=>
    int(0)
    ["key"]=>
    int(0)
  }
  1
  2
  3
  4
  5
  bool(false)
*/

因此,尽管我们不支持使用源数组,但是直接使用源数组指针 - 指针位于循环结束时的数组末尾证明了这一点。除非这不是真的 - 如果是,那么 test case 1 将永远循环。

PHP手册还说明:

由于 foreach 依赖与内部数组指针,因此在循环内部改变它可能导致意外的行为。

让我们找出那种 “意外行为” 是什么 (从技术上讲,任何行为都是意外的,因为我们不知道将会发生什么)。

michealzh 翻译于 1周前

Test case 4:

foreach ($array as $key => $item) {
  echo "$item\n";
  each($array);
}

/* 输出: 1 2 3 4 5 */

Test case 5:

foreach ($array as $key => $item) {
  echo "$item\n";
  reset($array);
}

/* 输出: 1 2 3 4 5 */

...意料之中,事实上它似乎支持 「复制源」 理论。

wojianduanfa_sxm_87 翻译于 1周前

The Question

What is going on here? My C-fu is not good enough for me to able to extract a proper conclusion simply by looking at the PHP source code, I would appreciate it if someone could translate it into English for me.

It seems to me that foreach works with a copy of the array, but sets the array pointer of the source array to the end of the array after the loop.

  • Is this correct and the whole story?
  • If not, what is it really doing?
  • Is there any situation where using functions that adjust the array pointer (each()reset() et al.) during a foreach could affect the outcome of the loop?

解答

foreach supports iteration over three different kinds of values:

In the following, I will try to explain precisely how iteration works in different cases. By far the simplest case is Traversable objects, as for these foreach is essentially only syntax sugar for code along these lines:

foreach ($it as $k => $v) { /* ... */ }

/* translates to: */

if ($it instanceof IteratorAggregate) {
    $it = $it->getIterator();
}
for ($it->rewind(); $it->valid(); $it->next()) {
    $v = $it->current();
    $k = $it->key();
    /* ... */
}

For internal classes, actual method calls are avoided by using an internal API that essentially just mirrors the Iterator interface on the C level.

Iteration of arrays and plain objects is significantly more complicated. First of all, it should be noted that in PHP "arrays" are really ordered dictionaries and they will be traversed according to this order (which matches the insertion order as long as you didn't use something like sort). This is opposed to iterating by the natural order of the keys (how lists in other languages often work) or having no defined order at all (how dictionaries in other languages often work).

The same also applies to objects, as the object properties can be seen as another (ordered) dictionary mapping property names to their values, plus some visibility handling. In the majority of cases, the object properties are not actually stored in this rather inefficient way. However, if you start iterating over an object, the packed representation that is normally used will be converted to a real dictionary. At that point, iteration of plain objects becomes very similar to iteration of arrays (which is why I'm not discussing plain-object iteration much in here).

So far, so good. Iterating over a dictionary can't be too hard, right? The problems begin when you realize that an array/object can change during iteration. There are multiple ways this can happen:

  • If you iterate by reference using foreach ($arr as &$v) then $arr is turned into a reference and you can change it during iteration.
  • In PHP 5 the same applies even if you iterate by value, but the array was a reference beforehand: $ref =& $arr; foreach ($ref as $v)
  • Objects have by-handle passing semantics, which for most practical purposes means that they behave like references. So objects can always be changed during iteration.

The problem with allowing modifications during iteration is the case where the element you are currently on is removed. Say you use a pointer to keep track of which array element you are currently at. If this element is now freed, you are left with a dangling pointer (usually resulting in a segfault).

There are different ways of solving this issue. PHP 5 and PHP 7 differ significantly in this regard and I'll describe both behaviors in the following. The summary is that PHP 5's approach was rather dumb and lead to all kinds of weird edge-case issues, while PHP 7's more involved approach results in more predictable and consistent behavior.

As a last preliminary, it should be noted that PHP uses reference counting and copy-on-write to manage memory. This means that if you "copy" a value, you actually just reuse the old value and increment its reference count (refcount). Only once you perform some kind of modification a real copy (called a "duplication") will be done. See You're being lied to for a more extensive introduction on this topic.

PHP 5

Internal array pointer and HashPointer

Arrays in PHP 5 have one dedicated "internal array pointer" (IAP), which properly supports modifications: Whenever an element is removed, there will be a check whether the IAP points to this element. If it does, it is advanced to the next element instead.

While foreach does make use of the IAP, there is an additional complication: There is only one IAP, but one array can be part of multiple foreach loops:

// Using by-ref iteration here to make sure that it's really
// the same array in both loops and not a copy
foreach ($arr as &$v1) {
    foreach ($arr as &$v) {
        // ...
    }
}

To support two simultaneous loops with only one internal array pointer, foreach performs the following shenanigans: Before the loop body is executed, foreach will back up a pointer to the current element and its hash into a per-foreach HashPointer. After the loop body runs, the IAP will be set back to this element if it still exists. If however the element has been removed, we'll just use wherever the IAP is currently at. This scheme mostly-kinda-sort of works, but there's a lot of weird behavior you can get out of it, some of which I'll demonstrate below.

Array duplication

The IAP is a visible feature of an array (exposed through the current family of functions), as such changes to the IAP count as modifications under copy-on-write semantics. This, unfortunately, means that foreach is in many cases forced to duplicate the array it is iterating over. The precise conditions are:

  1. The array is not a reference (is_ref=0). If it's a reference, then changes to it are supposed to propagate, so it should not be duplicated.
  2. The array has refcount>1. If refcount is 1, then the array is not shared and we're free to modify it directly.

If the array is not duplicated (is_ref=0, refcount=1), then only its refcount will be incremented (*). Additionally, if foreach by reference is used, then the (potentially duplicated) array will be turned into a reference.

Consider this code as an example where duplication occurs:

function iterate($arr) {
    foreach ($arr as $v) {}
}

$outerArr = [0, 1, 2, 3, 4];
iterate($outerArr);

Here, $arr will be duplicated to prevent IAP changes on $arr from leaking to $outerArr. In terms of the conditions above, the array is not a reference (is_ref=0) and is used in two places (refcount=2). This requirement is unfortunate and an artifact of the suboptimal implementation (there is no concern of modification during iteration here, so we don't really need to use the IAP in the first place).

(*) Incrementing the refcount here sounds innocuous, but violates copy-on-write (COW) semantics: This means that we are going to modify the IAP of a refcount=2 array, while COW dictates that modifications can only be performed on refcount=1 values. This violation results in user-visible behavior change (while a COW is normally transparent) because the IAP change on the iterated array will be observable -- but only until the first non-IAP modification on the array. Instead, the three "valid" options would have been a) to always duplicate, b) do not increment the refcountand thus allowing the iterated array to be arbitrarily modified in the loop or c) don't use the IAP at all (the PHP 7 solution).

Position advancement order

There is one last implementation detail that you have to be aware of to properly understand the code samples below. The "normal" way of looping through some data structure would look something like this in pseudocode:

reset(arr);
while (get_current_data(arr, &data) == SUCCESS) {
    code();
    move_forward(arr);
}

However foreach, being a rather special snowflake, chooses to do things slightly differently:

reset(arr);
while (get_current_data(arr, &data) == SUCCESS) {
    move_forward(arr);
    code();
}

Namely, the array pointer is already moved forward before the loop body runs. This means that while the loop body is working on element $i, the IAP is already at element $i+1. This is the reason why code samples showing modification during iteration will always unset the nextelement, rather than the current one.

Examples: Your test cases

The three aspects described above should provide you with a mostly complete impression of the idiosyncrasies of the foreach implementation and we can move on to discuss some examples.

The behavior of your test cases is simple to explain at this point:

  • In test cases 1 and 2 $array starts off with refcount=1, so it will not be duplicated by foreach: Only the refcount is incremented. When the loop body subsequently modifies the array (which has refcount=2 at that point), the duplication will occur at that point. Foreach will continue working on an unmodified copy of $array.

  • In test case 3, once again the array is not duplicated, thus foreach will be modifying the IAP of the $array variable. At the end of the iteration, the IAP is NULL (meaning iteration has done), which each indicates by returning false.

  • In test cases 4 and 5 both each and reset are by-reference functions. The $array has a refcount=2 when it is passed to them, so it has to be duplicated. As such foreach will be working on a separate array again.

Examples: Effects of current in foreach

A good way to show the various duplication behaviors is to observe the behavior of the current()function inside a foreach loop. Consider this example:

foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 2 2 2 2 */

Here you should know that current() is a by-ref function (actually: prefer-ref), even though it does not modify the array. It has to be in order to play nice with all the other functions like next which are all by-ref. By-reference passing implies that the array has to be separated and thus $arrayand the foreach-array will be different. The reason you get 2 instead of 1 is also mentioned above: foreach advances the array pointer before running the user code, not after. So even though the code is at the first element, foreach already advanced the pointer to the second.

Now lets try a small modification:

$ref = &$array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */

Here we have the is_ref=1 case, so the array is not copied (just like above). But now that it is a reference, the array no longer has to be duplicated when passing to the by-ref current() function. Thus current() and foreach work on the same array. You still see the off-by-one behavior though, due to the way foreach advances the pointer.

You get the same behavior when doing by-ref iteration:

foreach ($array as &$val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */

Here the important part is that foreach will make $array an is_ref=1 when it is iterated by reference, so basically you have the same situation as above.

Another small variation, this time we'll assign the array to another variable:

$foo = $array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 1 1 1 1 1 */

Here the refcount of the $array is 2 when the loop is started, so for once we actually have to do the duplication upfront. Thus $array and the array used by foreach will be completely separate from the outset. That's why you get the position of the IAP wherever it was before the loop (in this case it was at the first position).

Examples: Modification during iteration

Trying to account for modifications during iteration is where all our foreach troubles originated, so it serves to consider some examples for this case.

Consider these nested loops over the same array (where by-ref iteration is used to make sure it really is the same one):

foreach ($array as &$v1) {
    foreach ($array as &$v2) {
        if ($v1 == 1 && $v2 == 1) {
            unset($array[1]);
        }
        echo "($v1, $v2)\n";
    }
}

// Output: (1, 1) (1, 3) (1, 4) (1, 5)

The expected part here is that (1, 2) is missing from the output because element 1 was removed. What's probably unexpected is that the outer loop stops after the first element. Why is that?

The reason behind this is the nested-loop hack described above: Before the loop body runs, the current IAP position and hash is backed up into a HashPointer. After the loop body it will be restored, but only if the element still exists, otherwise the current IAP position (whatever it may be) is used instead. In the example above this is exactly the case: The current element of the outer loop has been removed, so it will use the IAP, which has already been marked as finished by the inner loop!

Another consequence of the HashPointer backup+restore mechanism is that changes to the IAP though reset() etc. usually do not impact foreach. For example, the following code executes as if the reset() were not present at all:

$array = [1, 2, 3, 4, 5];
foreach ($array as &$value) {
    var_dump($value);
    reset($array);
}
// output: 1, 2, 3, 4, 5

The reason is that, while reset() temporarily modifies the IAP, it will be restored to the current foreach element after the loop body. To force reset() to make an effect on the loop, you have to additionally remove the current element, so that the backup/restore mechanism fails:

$array = [1, 2, 3, 4, 5];
$ref =& $array;
foreach ($array as $value) {
    var_dump($value);
    unset($array[1]);
    reset($array);
}
// output: 1, 1, 3, 4, 5

But, those examples are still sane. The real fun starts if you remember that the HashPointerrestore uses a pointer to the element and its hash to determine whether it still exists. But: Hashes have collisions, and pointers can be reused! This means that, with a careful choice of array keys, we can make foreach believe that an element that has been removed still exists, so it will jump directly to it. An example:

$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
$ref =& $array;
foreach ($array as $value) {
    unset($array['EzFY']);
    $array['FYFY'] = 4;
    reset($array);
    var_dump($value);
}
// output: 1, 4

Here we should normally expect the output 1, 1, 3, 4 according to the previous rules. How what happens is that 'FYFY' has the same hash as the removed element 'EzFY', and the allocator happens to reuse the same memory location to store the element. So foreach ends up directly jumping to the newly inserted element, thus short-cutting the loop.

Substituting the iterated entity during the loop

One last odd case that I'd like to mention, it is that PHP allows you to substitute the iterated entity during the loop. So you can start iterating on one array and then replace it with another array halfway through. Or start iterating on an array and then replace it with an object:

$arr = [1, 2, 3, 4, 5];
$obj = (object) [6, 7, 8, 9, 10];

$ref =& $arr;
foreach ($ref as $val) {
    echo "$val\n";
    if ($val == 3) {
        $ref = $obj;
    }
}
/* Output: 1 2 3 6 7 8 9 10 */

As you can see in this case PHP will just start iterating the other entity from the start once the substitution has happened.

PHP 7

Hashtable iterators

If you still remember, the main problem with array iteration was how to handle removal of elements mid-iteration. PHP 5 used a single internal array pointer (IAP) for this purpose, which was somewhat suboptimal, as one array pointer had to be stretched to support multiple simultaneous foreach loops and interaction with reset() etc. on top of that.

PHP 7 uses a different approach, namely, it supports creating an arbitrary amount of external, safe hashtable iterators. These iterators have to be registered in the array, from which point on they have the same semantics as the IAP: If an array element is removed, all hashtable iterators pointing to that element will be advanced to the next element.

This means that foreach will no longer use the IAP at all. The foreach loop will be absolutely no effect on the results of current() etc. and its own behavior will never be influenced by functions like reset() etc.

Array duplication

Another important change between PHP 5 and PHP 7 relates to array duplication. Now that the IAP is no longer used, by-value array iteration will only do a refcount increment (instead of duplication the array) in all cases. If the array is modified during the foreach loop, at that point a duplication will occur (according to copy-on-write) and foreach will keep working on the old array.

In most cases, this change is transparent and has no other effect than better performance. However, there is one occasion where it results in different behavior, namely the case where the array was a reference beforehand:

$array = [1, 2, 3, 4, 5];
$ref = &$array;
foreach ($array as $val) {
    var_dump($val);
    $array[2] = 0;
}
/* Old output: 1, 2, 0, 4, 5 */
/* New output: 1, 2, 3, 4, 5 */

Previously by-value iteration of reference-arrays was special cases. In this case, no duplication occurred, so all modifications of the array during iteration would be reflected by the loop. In PHP 7 this special case is gone: A by-value iteration of an array will always keep working on the original elements, disregarding any modifications during the loop.

This, of course, does not apply to by-reference iteration. If you iterate by-reference all modifications will be reflected by the loop. Interestingly, the same is true for by-value iteration of plain objects:

$obj = new stdClass;
$obj->foo = 1;
$obj->bar = 2;
foreach ($obj as $val) {
    var_dump($val);
    $obj->bar = 42;
}
/* Old and new output: 1, 42 */

This reflects the by-handle semantics of objects (i.e. they behave reference-like even in by-value contexts).

Examples

Let's consider a few examples, starting with your test cases:

  • Test cases 1 and 2 retain the same output: By-value array iteration always keep working on the original elements. (In this case, even refcounting and duplication behavior is exactly the same between PHP 5 and PHP 7).

  • Test case 3 changes: Foreach no longer uses the IAP, so each() is not affected by the loop. It will have the same output before and after.

  • Test cases 4 and 5 stay the same: each() and reset() will duplicate the array before changing the IAP, while foreach still uses the original array. (Not that the IAP change would have mattered, even if the array was shared.)

The second set of examples was related to the behavior of current() under different reference/refcounting configurations. This no longer makes sense, as current() is completely unaffected by the loop, so its return value always stays the same.

However, we get some interesting changes when considering modifications during iteration. I hope you will find the new behavior saner. The first example:

$array = [1, 2, 3, 4, 5];
foreach ($array as &$v1) {
    foreach ($array as &$v2) {
        if ($v1 == 1 && $v2 == 1) {
            unset($array[1]);
        }
        echo "($v1, $v2)\n";
    }
}

// Old output: (1, 1) (1, 3) (1, 4) (1, 5)
// New output: (1, 1) (1, 3) (1, 4) (1, 5)
//             (3, 1) (3, 3) (3, 4) (3, 5)
//             (4, 1) (4, 3) (4, 4) (4, 5)
//             (5, 1) (5, 3) (5, 4) (5, 5) 

As you can see, the outer loop no longer aborts after the first iteration. The reason is that both loops now have entirely separate hashtable iterators, and there is no longer any cross-contamination of both loops through a shared IAP.

现在修复的另外一个奇怪的边缘现象是,当删除并且添加恰好具有相同的哈希元素时,会得到奇怪的结果:

$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
foreach ($array as &$value) {
    unset($array['EzFY']);
    $array['FYFY'] = 4;
    var_dump($value);
}
// Old output: 1, 4
// New output: 1, 3, 4

之前的 HashPointer 恢复机制直接跳转到新元素,因为它“看起来”和删除的元素相同(由于哈希和指针冲突)。由于我们不再依赖于哈希元素,因此不再是一个问题。

michealzh 翻译于 6天前

本文章首发在 LearnKu.com 网站上。
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。

原文地址:https://stackoverflow.com/questions/1005...

译文地址:https://learnku.com/php/t/29568

参与译者:5
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!