Tuesday, 18 June 2013
Facebook StumbleUpon Twitter Google+ Pin It

Weird properties of PHP's lexer and parser

There are (as of PHP 5.3.0) only two tokens which represent a single character:
  • T_NAMESPACE_SEPARATOR: \
  • T_CURLY_OPEN: {
    • This only occurs inside of interpolated strings, e.g. "{$foo}" lexes to: '"' T_CURLY_OPEN T_VARIABLE '}' '"'
  • Technically, there is a third, T_BAD_CHARACTER, but it is non-specific. No longer true according to one of the php devs
There are two items in the parser which, instead of being unspecified and generating a generic parse error, exist only to throw a special parse error:
  • using isset() with something other than a variable
  • using __halt_compiler() anywhere other than the global scope (e.g., inside a function, conditional or loop)
(Shameless blog plug on this one) The closing tag ?> is implicitly converted to a semicolon. The opening tag consumes one character of whitepace (or two in case of windows newlines) after the literal tag, but is otherwise completely ignored by the parser. Thus, the following code is syntactically correct:
for ( $i = 0 ?><?php $i < 10 ?><?php ++$i ) echo "$i\n" ?>
And it lexes (after the first round transform) to
T_FOR '(' T_VARIABLE '=' T_LNUMBER ';' T_VARIABLE '<' T_LNUMBER ';'
          T_INC T_VARIABLE ')' T_ECHO '"' T_VARIABLE 
              T_ENCAPSED_AND_WHITESPACE '"' ';'
The next several relate to variable interpolation syntax. For these, it helps to know the difference between a statement (ifforwhile, etc) and an expression (something with a value, like a variable, object lookup, function call, etc).
  1. If you interpolate an array with a single element lookup and no braces, non-identifier-non-whitespace chars will be parsed as single-character tokens until either a whitespace character or closing bracket is encountered.
    • e.g., "$foo["bar$$foo]" lexes to '"' T_VARIABLE '[' '"' T_STRING '$' '$' T_STRING ']' '"'
  2. In a similar scenario to the above, if you do use a space inside the braces, you will get an extra, emptyT_ENCAPSED_AND_WHITESPACE token.
    • e.g., "$foo[ whatever here" lexes to '"' T_VARIABLE '[' T_ENCAPSED_AND_WHITESPACE T_ENCAPSED_AND_WHITESPACE '"'
  3. In the midst of complex interpolation, if you are in one of the constructs that allows you to use full expressions, you can insert a closing tag (which PHP considers to be the same as a ';' and therefore bad syntax, but nevertheless), and it will be parsed as such. Furthermore, if you use an open tag, the lexer will remember that you were in the middle of an expression inside a string interpolation, although this seems like a moment of good design and implementation (or something like it).
You can nest heredocs. Seriously. Consider the following:
echo <<<THONE
${<<<THTWO
test
THTWO
}
THONE;
You can nest it as deep as you want, which is terrible (edit: a terrible thing to do), but what is hilarious is that, while the actual PHP interpreter handles this scenario correctly, the PHP userland tokenizer, token_get_all(), cannot handle it, and parses the remainder of the source after the innermost heredoc to be one long interpolated string (edit: according to a person on the php dev team, this is fixed in 5.5).
I hope these oddities have been as amusing for you to read about here as they have been for me to discover.