Tracking down random nulls from PHP Couchbase extension
Migrating primary data stores with no downtime, without losing data can be tricky and a lot could go wrong. One of the games at Zynga, recently completed such a project to migrate our primary data store to Couchbase from Membase, an older version of Couchbase server which is Memcached with disk persistence and a simple cluster management interface. Due to our PHP version, we were using the Couchbase native PHP library 2.0.5. To guarantee data integrity during the migration, we compared data that was written to each data source. While read comparing the results, we noticed get or getFromReplica operations on integer keys returned a null randomly while the Membase keys returned an integer. The good thing was that it was easily reproducible, about 1 in 4 times.
The Couchbase C library received raw bytes over the network, which were then passed over to the Couchbase PHP SDK, which then converted the bytes into PHP variables to be consumed by a default transcoder (PHP) that processed it and returned the value to the application.
Couchbase -> Couchbase C library -> Couchbase PHP SDK -> default transcoder -> application
After some logging in the library, the problem surfaced when the default transcoder tried to parse the bytes it received from Couchbase over the network. The value printed fine in PHP, but, for some reason the PHP is_numeric check failed on it. The json_decode also returned a null, which was odd. We looked at the implementation of the default transcoder and it did turn out that it does a json_decode on the value. So here were our random nulls. One option was to not json_decode, but, that did not seem right. If it fails randomly that means something else might be wrong deep down and we could get another random nulls or bad values in other cases which we might not have detected yet.
More logging in the Couchbase library C code surfaced that the zval_value of the PHP variable had a string that was not null terminated. But, it also had a length variable with it in the same data structure, indicating that it might not need to null terminate if the structure is keeping track of variable length. But, it all depends on the implementation of the function using that structure on how it honors the length.
typedef union _zvalue_value { long lval; /* long value */ double dval; /* double value */ struct { char *val; int len; } str; HashTable *ht; /* hash table value */ zend_object_value obj; } zvalue_value;
At this point, we started looking into PHP source code and it turned out json_decode and is_numeric, both used an internal function is_numeric_string to detect/parse digits. Now, we were getting closer.
is_numeric_string started parsing the string from the pointer to the beginning of the string in the zval_value of the PHP variable and walked over it till it could find a non-digit or it hit the max length of long, and then compared the lengths of the string that was passed into the number of digits the loop found. If that matched, it returned true, else false. The problem here was if the memory right next to the last digit had an int on it, this would fail. In the real world, this could practically happen all the time. For instance: a string ‘34543’, that looked like following in memory: ‘34543098hbjhgtf’ would fail the test even if the PHP variable had the len property of zval_value set correctly. Moreover, going outside the length of the variable, the program was accessing memory that was not allocated to it, which could cause all kinds of bugs.
Couple of code snippets from is_numeric_string:
Github: https://github.com/php/php-src/blob/babeca356b657f2fe14a8be21d31eadadc5a971a/Zend/zend_operators.h#L102:L224
for (type = IS_LONG; !(digits >= MAX_LENGTH_OF_LONG && (dval || allow_errors == 1)); digits+, ptr+) {
if (ptr != str + length) { if (!allow_errors) { return 0; } if (allow_errors == -1) { zend_error(E_NOTICE, "A non well formed numeric value encountered"); } }
The solution was pretty simple for us, we passed in a flag to duplicate the bytes and it started copying the bytes to another location and adding a null termination. This should work for most cases but this does introduce an overhead of allocating additional memory and copying bytes. In our case, this overhead was not noticeable as our reads were small and we were not limited by memory because of our application load being more CPU intensive. But, if you are running low on free RAM or doing large reads, I would recommend keeping an eye on RAM usage and potential increase in response times due to copying of bytes.
This issue still exists with the new function _is_numeric_string_ex ( https://github.com/php/php-src/blob/master/Zend/zend_operators.c#L2997 ) So, there is a possibility of this bug surfacing in other forms at times with anything that uses this function internally. My recommendation will be to print the PHP variable and print the output of is_numeric for that variable, if you see a numeric string returning false, you might be running into this problem. At this point, start looking at the code or extension that is creating the variable.