Monkey patching in PHP

Tuesday, June 22. 2010
Comments

I haven't really had the chance or time to play with PHP 5.3 until recently when Ubuntu 10.04 upgraded my local installations and kind of forced me to dive into it a little. And I'm also probably the last person on the planet to notice, but namespaces in PHP 5.3 allow you to monkey-patch core PHP code.

What's monkey patching?

So monkey patching is a technique to replace functions at runtime. One of the more common applications is stubbing (or mocking) code in unit tests. So for example mocking the response from a server allows you to run a unit test in absence of another external service. Thus making your test suite both more robust and possible bugs easier to squash.

Up until 5.3 monkey patching was not available in PHP — unless you used the runkit extension.

Other use cases are changing the behavior of code without directly forking it and maintaining a local copy, e.g. to add a feature or so or even to apply bug fixes without modifying the original code.

Example

Here's some example code.

<?php
namespace monkeypatch;

$str = 'your mom';

echo "This should eight, but it's not: " . strlen($str) . "\n"; // 6
echo "Now this should be really eight: " . \strlen($str) . "\n";

function strlen($str) {
    return 6;
}

The difference between strlen() and \strlen() is, that the first call uses the function we defined in the current namespace. Since using the namespace operator requires it to be the first thing in a file, all consecutive functions and classes are part of this namespace. If an equivalent is not available in the current namespace, it'll fall back to the parent namespace or root.

Other applications that come to mind would be fixing the parameter order in strstr() and in_array(), and similar! But of course I'm kidding and wouldn't suggest that really. :-)

Fin

That's all kids!

Defined tags for this entry: , ,

My Berlin Buzzwords 2010 recap

Wednesday, June 9. 2010
Comments

I attended Berlin Buzzwords 2010 for the last two days and aside from meeting a bunch of great people during talks, here are some take away notes from this conference:

  • I got introduced to new stuff — such as HyperTable (c++ bigtable implementation), which I had never heard of before.

  • I actually know a lot more about hadoop, HDFS and tika now than I did before — though I won't be able to use any a lot of it soon. The HDFS talk in particular was interesting as it got rid off the bells and whistles (OMG distributed file system and replicated!!!) for me. On Hadoop — it was easy to feel a little overwhelmed.

  • No MongoDB for me.

  • Hilarious: "Localhost is local most." (by Mario Scheliga)

  • (On HDFS' issues with the NameNode:) "Highly available vs. pretty highly available."

  • A lot of people talked about scaling (in and off talks) without a) having any first hand experience and/or b) a need for it. That was probably the buzzwordy part about this conference.

  • I did not learn as much about Lucene as I wanted or had planned. Primarily because the nature of the talks was a little too advanced for me. A basic introduction to Lucene/Solr's architecture and ways to scale out is still on my wish list.

  • I noticed that contributors to Apache projects like to discuss Jira issues in their talks.

  • Twitter is using Lucene/Java to scale out its (near real-time) search, but sticks to trivial types (instead of objects) to (re)gain performance.

  • Riak seems pretty cool: consistent hashing, auto-balancing, sharding — must investigate more. Also, Rusty Klophaus is a cool guy and I learned that Basho is not just a software company, but they also have a band. And riak is Indonesian and stands for something like how the water flows.

  • Cassandra looks interesting as well. Considering they are Java not written in Erlang, a lot of people seem to like them anyways. Also, Eric Evans is a great presenter — kudos to him. I especially liked the part where he suggested to not use Cassandra for obvious reasons, but the inner geek disagreed.

  • I don't know why presentations by Nokia, are like that. I'm missing a little enthusiasm about work or project.

  • Bashing other projects sucks. Also, introducing yourself with, "We are like X but better.", makes you look shady as well.

  • Benchmarks on slides really suck. And if people still can't resist, they should have a better explanation for them.

  • Berlin Buzzwords really had a great venue.

  • Thanks mucho to the organizers — Isabel, Simon, Jan & newthinking — for an interesting conference.

For more details, head over to Rusty Klophaus:

Defined tags for this entry: , , , , ,

Shopping for a CDN

Saturday, June 5. 2010
Comments

In this blog post I'll compare different CDNs with each other, on the list are:

  • Akamai (through MySpace)
  • CacheFly
  • CloudFront
  • EdgeCast (twice, through Speedyrails)
  • LimeLight Networks (through mydeo)
  • … and Amazon S3 — the pseudo CDN

Thanks to SpeedyRails, EasyBib (CacheFly, Cloudfront, S3) and mydeo for helping with these tests.

What's a CDN?

A CDN (Content Delivery Network) is a service usually offered by Tier1's or at least companies that have a so-called global network footprint.

A CDN lets you distribute your assets/content on an array of servers and the nifty technology behind it makes sure that a customer is always transparently routed to a server closer to them, thus making it faster for the client to fetch the assets.

Content, or assets, can be anything such as images, css, JavaScript or media (audio, video). My numbers focus on assets primarily, I haven't run any tests with larger media files.

An example for CDN usage would be that, let's say I go to myspace.com — all the required assets are distributed using a CDN run by Akamai. When I browse myspace, the JavaScript files are pulled from a server located in Frankfurt. Whereas when I browse MySpace from the U.K., the files are pulled from a server in the U.K..

All of this is — as I said — transparent, which means that I don't really notice a difference when I go to the website. It should be faster though.

Performance

I'll skip over why it makes sense to use a CDN from a pure performance point of view. A much better blog article is available at the Yahoo! developer blog

When is a CDN necessary?

I wouldn't recommend getting a CDN for a blog — unless you're TechCrunch and live off of it. In my opinion this is a gray area. If you make money and your traffic is not just local (to the location of your server), consider a CDN, it's more affordable than you think.

On monitoring

Pingdom is a nifty distributed monitoring service.

What Pingdom does is the following: Pingdom allows you to setup checks (literally within minutes) and then it runs the monitoring from different locations world wide.

The advantage of multiple locations is that you do know if for example your website is not available for everyone, or if it's a local issue of a backbone provider, etc.. Beyond general availability, Pingdom also gather data on response times (average, fastest and slowest) and lets you filter on all of the above.

The current locations from which your website is monitored include Amsterdam (Netherlands), Atlanta, GA (U.S.), Chicago, IL (U.S.), Copenhagen (Denmark), Dallas, TX (U.S.), Frankfurt (Germany), Herndon, VA (U.S.), Houston, TX (U.S.), London (U.K.), Los Angeles, CA (U.S.), Montreal (Canada), New York, NY (U.S.), Stockholm (Sweden) and Tampa, FL (U.S.). In some locations, Pingdom employs multiple monitors.

The only downside I can see is that Pingdom has no footprint in all of Asia, South America or Africa. So in case you're target demo is from either of those places, I'd advice you to gather your own numbers.

Well, gathering your own research data might be a good idea regardless.

Numbers

I used a minified jQuery library to compare the results of the various CDN vendors.

Amazon S3

Why do I consider S3 to be a pseudo CDN. Well, for starters — Amazon S3 is not distributed.

By nature, it shouldn't be used as a CDN. The problem is though that many people still do. Take a look at Twitter and think twice why a page takes so long to load (and the avatars are always last). There's your answer.

In order to be fair — Twitter also sometimes switches to Cloudfront (216.137.61.222) (or Akamai (213.248.124.139)?). I haven't really figured out why they don't stick to a real CDN period.

Besides, I think using Cloudfront is still not the best choice, thinking about it, they should of course use Joe Stump's project tweetimag.es (which uses EdgeCast).

Stats porn

Spoiler: 100% uptime on all of them! ;-)

But on to the stats!

Akamai

akamai-7day

  • provider: Akamai
  • 7 day period
  • average response time: 65 ms
  • slowest average response time: 289 ms
  • fastest average response time: 19 ms

Akamai is probably the most well-known CDN. The clear advantage of Akamai over others — they are everywhere. And they charge an arm and a leg for it too. ;-) (No offense meant!)

CacheFly

cachefly-7days

  • provider: CacheFly
  • 7 day period
  • average response time: 132 ms
  • slowest average response time: 1,506 ms
  • fastest average response time: 69 ms

CacheFly is another older CDN providers (~11 years). Pretty nice support and lots of custom options available when you email them. On their todo is a transparent proxy (WANT).

CacheFly has never failed me in over four years.

Cloudfront

cloudfront-7day

  • provider: Amazon Cloudfront
  • 7 day period
  • average response time: 276 ms
  • slowest average response time: 1,983 ms
  • fastest average response time: 171 ms

Cloudfront is Amazon's idea of a CDN. It integrates well with Amazon S3. There's no transparent proxy option and it's not as distributed. And remember, it's all eventually consistent.

EdgeCast

EdgeCast offers two options. Small and large files. Small files are a little more expensive but it's generally suggested that they work just as well as large files. The small files option distributes your assets on SSD (Solid State Disk!).

The suggested use case is that large is for video and audio assets.

Regardless of the options, check the graphs and the numbers for some serious head scratching.

Large

edgecast-big-7days

  • provider: EdgeCast (big files)
  • 7 day period
  • average response time: 77 ms
  • slowest average response time: 987 ms
  • fastest average response time: 22 ms
Small

edgecast-small

  • provider: EdgeCast (small files)
  • 7 day period
  • average response time: 91 ms
  • slowest average response time: 1627 ms
  • fastest average response time: 28 ms

Limelight

limelight-7days

  • provider: Limelight through MyDeo
  • 7 day period
  • average response time: 216 ms
  • slowest average response time: 1,668 ms
  • fastest average response time: 28 ms

And why is Limelight so slow? I don't think I can blame it entirely on Limelight. In contrast to other resellers, such as Speedyrails (which resells EdgeCast), MyDeo gives you a url with mydeo.com. And this domain uses Godaddy's rather crappy DNS service so I'm guessing that part of the poor performance is due to them.

Amazon S3

amazon-s3-7days

ROFLMAO LOL!!!111one

  • provider: Amazon S3
  • 7 day period
  • average response time: 534 ms
  • slowest average response time: 2,323 ms
  • fastest average response time: 331 ms

Quo vadis CDN?

My first advice to all resellers would be to get Pingdom and constantly run monitoring to make sure the system behaves as expected. Or as the production description suggests. :-)

On Pingdom itself — of course there may be issues as well (not that I noticed). But I don't think these are a factor here. I've been running these tests for almost two months now and a different 7 day time frame didn't look too different. No one performed much better or far worse.

Here are the numbers again, side by side:

Provider Average (ms) Slowest average (ms) Fastest average (ms)
Akamai 65 289 19
CacheFly 132 1,506 69
Cloudfront 275 1,983 171
EdgeCast (large) 77 987 22
EdgeCast (small) 91 1627 28
Limelight 216 1,668 28
Amazon S3 534 2,323 331

Comment

Akamai is almost in a league of its own. Of all contenders they offer the best CDN hands down. If anyone reselling Akamai at a reasonable price reads this, feel free to leave a comment or email me. Of course I'd be interested.

Still, it's a little surprising that Akamai is not further ahead of Edgecast.

Cloudfront versus others — from personal testing and also doing the math on S3 (storage, PUT, GET) with the addition of Cloudfront on top of it, I have to say that this is a pretty expensive service and probably only useful in terms of unified billing (one provider to rule them all). If this is not an issue, I suggest you find another.

CacheFly has great support, but lacks feature and it's also pretty expensive compared to others.

EdgeCast vs. EdgeCast — I have to contact Speedrails to find out if they gave me the wrong URLs or why the more expensive option did worse in these tests. That'll be interesting to figure out. Regardless of this bit, the performance is pretty stellar and the closest to Akamai.

I'll revisit Limelight and mydeo later again.

Fin

It's pretty obvious for us that we are switching from CacheFly to another CDN over the summer.

And not just because of the general performance but also because for example EdgeCast (through SpeedyRails) seems to be a lot more cost effective while offering more features and of course the much better performance at the same time.

In case there are questions, I can extract more numbers.

Defined tags for this entry: , , , , , , ,

PHP, APC and sessions

Wednesday, May 26. 2010
Comments

Playing with redis/Rediska and sessions, I wanted to get more numbers to compare this solution to a traditional MySQL-based approach which also made me revisit the idea of a CouchDB-based session handler for Zend_Session.

Implementing this handler, I ran into a weird issue:

Fatal error: Undefined class constant 'ALLOW_ALL' in /usr/home/till/foo/trunk/library/Zend/Uri/Http.php on line 447
Call Stack
#   Time    Memory  Function    Location
1   0.7357  3914816 Foo_Session_SaveHandler_Couchdb->write( )   ../Couchdb.php:0
2   0.7358  3916600 Foo_Couchdb->query( )   ../Couchdb.php:94
3   0.7361  3969464 Zend_Http_Client->__construct( )    ../Couchdb.php:368
4   0.7361  3969544 Zend_Http_Client->setUri( ) ../Client.php:250
5   0.7362  3976568 Zend_Uri::factory( )    ../Client.php:267
6   0.7365  4003352 Zend_Uri_Http->__construct( )   ../Uri.php:130
7   0.7367  4006216 Zend_Uri_Http->valid( ) ../Http.php:154
8   0.7368  4006216 Zend_Uri_Http->validateHost( )  ../Http.php:281

The funny thing is that that APC was added (for apc_store() and apc_fetch()) at the same time to the game (to cache the configuration) and when I disabled it, the error disappeared.

Talking to to one of the leads of APCGopal (Btw, cheers for helping!) — on IRC (#pecl@efnet) I thought at first that the issue was autoload related and we thought the order in which the extensions are loaded might make a difference. From Rasmus' comment, I later discovered bug #16745 with a proposed workaround to use session_write_close().

On a sidenote: I'm still not sure why the error is expected behavior for some people but yet it works with some PHP and APC versions and breaks with others. From what I gathered it broke for me with 5.2.6, 5.2.11 and 5.3.2. Tried all with the latest version of APC (3.1.3p1).

Here's how I fixed it for myself

I have a Lagged_Application class to bootstrap my application. Lagged_Application is kind of like Zend_Application sans a lot of safety nets and magic. Since it does a lot less, it's also quiet a bit faster. To get an idea, check out my Google Code repository (for an alas rather outdated version of it).

I added the following function to it:

<?php
// (...)
public function shutdown()
{
    session_write_close();
}

My index.php looks like the following:

<?php
include 'library/Lagged/Application.php';
$app = new Lagged_Application;
$app->setEnvironment('production');
$app->bootstrap();

register_shutdown_function(array($app, 'shutdown'));

Somewhat related — shutdown() could be a good start to tear down other objects as well, when needed.

More?

Now that this issue is fixed, I think also the infamous Fatal error: Exception thrown without a stack frame in Unknown on line 0 originates from the same issue. That is, when sessions and APC are around — but I should dig a little deeper to verify this.

All in all, it's a pretty weird issue and IM(very)HO, objects shouldn't be torn down or some sort of before hook should be executed to avoid this. But that's especially easy to say if you don't do C. :-)

Fin

That's all. I sure hope this saves someone else some time.

Defined tags for this entry: , , , , ,

Foursquare: How private is private?

Tuesday, May 25. 2010
Comments

Location is one of my hobbies. Even though I don't map items for openstreetmap and the like, I still try out at least every location-related startup there is.

Foursquare, as you probably know is a location-based game — get points and badges to check into locations. The points are aggregated into weekly leaderboard (of penis envy) and everyone gets a fresh start every Monday morning.

Check-in

Foursquare has different check-in modes. One is the regular, where your location gets published to your friends (and also Twitter/Facebook if those are linked up) and the other is called "off the grid" — supposedly not even your friends know where you're at.

Think of a possible scenario — cheating on your diet? You can still check into McDonald's and get the points but your boyfriend wouldn't know you did it.

Downsides

Is off the grid really off the grid? Far from it.

If you play Foursquare on a more national or global scale (e.g. between cities) even though you check in off the grid, your general location is updated on your profile.

So let's say I went from Berlin to Munich and didn't want anyone else to know. I still check in off the grid at the airport in Munich (to get a stupid badge or whatever) and my Foursquare profile would not show where I exactly I checked in (e.g. airport), but it would say "Till (Munich)".

From what I noticed the other week, if you checked into a venue and did off the grid, it would still show your icon on the venue's page on Foursquare, which doesn't really sound like advertised either.

So how is that check-in actually private? Well, not at all.

Location without a check-in

But wait, it gets even better!

I noticed that Foursquare's Android and Blackberry applications update your location without checking-in. From what I gathered, it's plenty to look at your friend list and sure as hell enough to scan for places around you (to get caught ;-)).

Friend list

The friend list always shows people from the city you're in. So as soon as you open the application, it displays those. Whenever you leave the city, it'll say something like, "Friends in other cities" — and "Viola!", your profile got updated.

All of this is powered by the GPS tracking in your nifty phone. Pretty cool, eh?

Scanning places

Last weekend I went to Chemnitz to attend a family thing — even though didn't check in anywhere, I briefly scanned to see who and what was around me. Still, my profile got updated.

Foursquare: my history

The above shows two check-ins, Jet is in Berlin (gas station) and Aral is near Dresden (another gas station — but don't worry, I just bought a magazine and took my dog for a walk — didn't need to fuel up).

And here's a shot of my profile, which clearly states I'm in Chemnitz:

Foursquare: my profile

I guess I should have expected it, but I'm still not sure if I like it. The upside is, it's pretty accurate too (note, sarcasm)!

And in case anyone got doubts — I'm sure someone with elite jedi powers from Foursquare can verify that I didn't cheat.

Share

I think the biggest mis-conception here is that I expected Foursquare to share my location only when I do something using the application or the website. I'd really like it to be a more active thing when my profile is populated with data. On the other hand, that's probably inconvenient as hell for … Foursquare?

Further more

I haven't really checked into this any further, but does anyone know if the Foursquare applications use background data?

I would like to how much of my location is shared on a regular basis and also how granular the data gathered is, e.g. Foursquare only updates the city/country on your profile, but do they really keep latitude and longitude?

Fin

Not the usual programming bs. :-) And that's all for today.

If in doubt about your data, you should disable location based services.

The very least you can do is to learn enough about them in order to understand (and comprehend) what's happening with your data.

Defined tags for this entry: , , ,