Parallel Processing / Task Distribution with PHP

Last evening at Vancouver’s PHP meetup group I did a presentation on the topic of parallel processing with PHP. I received some good feedback on the talk so I’d like to share it with those of you who couldn’t attend the meetup.

First of all quite a few people asked me what tools did Saem and I use to make up those cool slides. We used the open sourced reveal.js to make HTML5 based slides, and hosted the slides by node.js, and if you want to control the slides remotely with your smartphone, then socket.io is the node.js plugin you will need.

So, what is ‘Parallel Processing’? In the context of computer science, parallel processing means “The simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads” – Wikipedia. There are plenty of programming languages such as C, Java and Erlang that support multi-threading natively. Unfortunately, PHP is not one of them.

Before we go ahead and start blaming PHP’s lack of multi-threading support, let’s not forget that PHP is a high level scripting language and the original purpose of it was to serve stateless HTTP requests, it’s supposed to be one and done, and it runs under Apache environment which is already multi-threaded. However, as the language became more and more popular and capable, developers started using PHP to build complex business logic other than just render HTML pages. It’s a natural evolvement so there is nothing wrong with that, but more and more advanced capabilities were demanded from PHP, and one of them is parallel processing.

One use case at HootSuite is sending 10s of 1000s of scheduled social network messages at a specific time interval, for example at 8:00 AM. Loading all the messages scheduled at 8:00 AM and sending them out one by one sequentially is going to take hours, which is definitely not going to cut it for our users who expect delay of no more than a minute. So some form of parallel processing is needed so we can send them all out simultaneously.

Doing this in PHP pretty much comes down to

Multi-Thread
Multi-Process
Distributed

So let’s take a look at each method, its options and their pros and cons.

Multi-Thread

As I mentioned early on PHP does not support multithread natively. But there is one PHP extension (experimental) called pthreads that allows you to do just that. I didn’t spend much time playing with pthreads because personally I think doing multithreading in PHP is just wrong. You can write beautiful thread safe code, but the moment you start using other PHP extensions, you have to think about if they are thread safe, and unfortunately a lot of those popular ones aren’t!

(Update: Joe Watkins, the author of pthreads pointed out that pthreads is safe to use and my assumption is wrong. As I mentioned I didn’t play around with pthreads much and will likely stay away from multithreading in PHP, but if you are interested in trying it out, read Joe’s pthreads description here)

Multi-Process

While multithreading is out of the question, you can achieve parallel processing with running multiple PHP processes. I’ll show you three ways to do that:

Fork
Execute Command
Piping

Fork

PHP comes with a few process control functions that allow you to fork child processes, which uses the underlying `fork()` system call. when `pcntl_fork` is called, a child process is initiated, which is a copy of parent process. What that means is child process can inherit parent variables and resources such as a MySQL connection or a file handle. You can fork multiple child processes and let them handle tasks in parallel. However this module is not available in Apache’s mod_php mode, so you can only do it in CLI or FastCGI mode. Besides, having children process sharing resources with parent is dangerous, for example a child can close a MySQL connection without parent knowing.

Here is a simple code example that illustrate process forking
$workload = "some work load"; $processId = pcntl_fork(); if ($processId < 0){ die('Fork failed!'); } else if ($processId == 0) { // child starts working here trim($workload); } else { // parent waits for child pcntl_wait($status); }

Execute External Command

Another way to create multiple processes is through `exec`, `exec_shell` or `passthru`, which run an external command, such as a PHP command, and create completely new and separate processes. The major difference among these three functions are,

`exec` returns last line of the output from the new process
`exec_shell` returns all outputs from the new process as string
`passthru` returns raw outputs from new process, such as binary

One thing to be aware of is, executing external command is not a cheap operation, especially when you need to execute many in a loop. When you have hundreds of processes running on a server, the system will be super busy dealing with context switches, thus degrading overall performance.

Piping

So far we’ve looked at forking and executing external command to create processes. If you haven’t noticed yet, one common problem for both is that processes can’t communicate with each other. If you want to do some coordination between processes while they are running, then you are out of luck.

So next I’ll show you one more lesser known way to deal with multiple process, which is called piping. It allows two processes sending instructions to each other through STDIN and STDOUT, much the same as linux pipe (|) command. The key PHP function is `proc_open`, which creates a new process and returns the process handle. It takes a bit of effort to make it work, but could be very powerful. Here are some code example to illustrate how piping works. Basically client sends ‘Hello’ to Worker, then worker takes ‘Hello’, appends ‘World’, and send the whole string back to client.

// pipe_client.php, uses proc_open() function to create the // worker process, and send instructions to it


$descriptorspec = array(

  0 => array("pipe", "r"),  // stdin for worker

  1 => array("pipe", "w"),  // stdout for worker

);
$worker = proc_open("php pipe_worker.php", $descriptorspec, $pipes);
if ($worker) {

  fwrite($pipes[0], "hello");

while (!feof($pipes[1])) { echo fgets($pipes[1]). "n"; } proc_close($worker); }

// pipe_worker.php, all it does is to read instructions // from STDIN, and write response to STDOUT $line = fread(STDIN,4096); fwrite(STDOUT, "$line world");

Distributed Parallel Processing

So far we have looked at theads, forking child process, creating new process, even letting processes communicate with each other. However these methods only apply when you are just dealing with one single server. You can only scale so far by adding CPUs and cores within a server, at some point you’ll need to go beyond one server and that, you will need distributed parallel processing.

In a distributed world, some degree of network level programming is inevitable. Fortunately there are quite a few open sourced libraries and frameworks available so we don’t have to write socket level programs in PHP. In this blog I’ll show you how Gearman and ZeroMQ fit into the picture.

Socket Programming

But before I do that, why don’t we start with some fun stuff first! Let’s write some socket program that enables two processes communicating with each other over TCP! Yes, PHP does provide a few functions that allow you to do socket programming. Again, we have a client script that sends ‘Hello’, and worker script that takes ‘Hello’, appends ‘World’, and sends ‘Hello World’ back to the client.

// socket_client.php // create a socket, connect to worker using TCP on port 11111 $socket = socket_create(AF_INET, SOCK_STREAM, 0); $result = socket_connect($socket, '127.0.0.1', 11111); // send Hello to worker socket_write($socket, "Hello", 5); // wait for response from worker $result = socket_read ($socket, 1024); echo "$resultn"; socket_close($socket);

// socket_worker.php // create TCP socket, bind to 11111 $socket = socket_create(AF_INET, SOCK_STREAM, 0); //IPv4, TCP socket_bind($socket, '127.0.0.1', 11111); //listen to socket socket_listen($socket);

while (true) { // accept incoming TCP connection $ret = socket_accept($socket); // read the stream $input = socket_read($ret, 1024); $output = "$input World"; // send result back to client socket_write($ret, $output, strlen($output)); socket_close($ret); }

Socket programming isn’t that hard right? Well, don’t be fooled by the simplicity of the code above because they are no where close to production ready! Think about what will happen when worker is not up? What will happen when the worker quits in the middle of execution? What will happen when both client and worker are normal but there is a network partition? There will be a gazillion of situations like this you’ll have to handle when writing socket code, and this is exactly why you should really appreciate open sources tools such as Gearman and ZeroMQ.

Gearman

Gearman was developed and open sourced by Danga, who also created the most popular distributed caching solution Memcached. A few tings about Gearman before I show some code examples:

A Job Distribution Framework
High Performance
Language Agnositc (lots of bindings)
Download Gearman server at http://gearman.org
Download the PHP extension at http://pecl.php.net/package/gearman

We have used Gearman for a few years at HootSuite and overall we were happy with it, although we are not particular in love with centralized broker system for distributed processing, and that’s why we started looking into ZeroMQ a while ago and initiated the transition from Gearman to ZeroMQ. Let’s have a look at a simple Gearman code example before we get to ZeroMQ.

Please note that to run the following code you have to have Gearman deamon (GearmanD) installed and started on localhost:4731, and Gearman PHP extension installed. And the example is the same Hello World example as above.

// gearman_client.php $gmclient= new GearmanClient(); $gmclient->addServer(); //localhost // the function ->do blocks, wait for worker response // to do tasks in non-blocking mode, use doBackground function // Also not 'getWorld' is a job name, and 'Hello' // is the job payload $result = $gmclient->do("getWorld", "Hello "); echo "$resultn";

// gearman_worker.php $gmworker= new GearmanWorker(); $gmworker->addServer(); //localhost // register the job 'getWorld' handler 'getWorldFn' // which is defined below as a function $gmworker->addFunction("getWorld", "getWorldFn");


// loops here wait for jobs, and let handler deal with it

while($gmworker->work()) {}

function getWorldFn($job) { return $job->workload() . "World!";

ZeroMQ

ZeroMQ is the new favorite technology at HootSuite, please read my previous blog post here to find out why! Here is a summary of ZeroMQ characteristics.

a C library, Not a AMQP Service such as RabbitMQ, ActiveMQ…
Hides Complex TCP and Socket Handling
High Performance, Reliable, Flexible
Language Agnostic (30+ bindings)
Multiple Transport such as TCP, IPC, inproc, multicast
Out of box communication patterns such as Req/Rep, Pub/Sub, Pipeline
Library: http://www.zeromq.org
PHP Extension: http://php.zero.mq

ZeroMQ is one of the fast growing technical skillsets according to LinkedIn, so if you don’t know what it is yet, definitely check it out. It’s not a silver bullet that magically solves everything, but it does make building distributed systems awfully simpler. HootSuite skipped WS/SOAP, Thrift or even REST, and use ZeroMQ directly to build our internal SOA.

The followings are some simple code examples illustrating how to use ZeroMQ to build the Hello World client/worker program similar to the examples above. Again, make sure you have ZeroMQ C library and the PHP extension installed.

//zmp_client.php

//creates context $context = new ZMQContext(); //create DEALER socket http://api.zeromq.org/2-1:zmq-socket#toc6 $socket = new ZMQSocket($context, ZMQ::SOCKET_DEALER); //client connects $socket->connect('tcp://127.0.0.1:15000'); //send 100 Hellos for ($i = 0; $i send("$i - Hello"); echo $socket->recv() . "n"; sleep(1); }

//zmp_server.php

//creates context $context = new ZMQContext(); //create ROUTER socket, http://api.zeromq.org/2-1:zmq-socket#toc7 $socket = new ZMQSocket($context, ZMQ::SOCKET_ROUTER); //worker binds $socket->bind('tcp://*:15000'); //poll the socket, like event dispatcher $poll = new ZMQPoll(); $poll->add($socket, ZMQ::POLL_IN); $readable = $writeable = array(); while(true) { $events = $poll->poll($readable, $writeable, 1000); foreach($readable as $s) { //When there is incoming message, deal with it $message = $socket->recvmulti(); $socket->sendmulti(array($message[0], $message[1] . " World!")); } }

What I suggest is, run the client code first before you start the worker, see what happens. Then start the worker and see what happens. While the client is looping through 100 iterations, stop the worker in the middle for a few seconds, then start it again and see what happens, do this a few times. Think about what you have achieved by writing ~30 lines of code, without a single line for error handling, and also remember there isn’t any message broker such as RabbitMQ or Gearman in the middle.

—————–

Intrigued? Interested? Come build awesome products with other talented engineers at HootSuite. Apply here!

7 thoughts on “Parallel Processing / Task Distribution with PHP”

krakjoe says:

May 1, 2013 at 5:35 pm

Actually, pthreads does not require you to consider the safety of extensions. Additionally, the vast majority of PHP extensions are actually thread safe, the misinformation that they are not has been floating around the internet since the days they weren’t; I’m afraid not many people who are actually qualified to talk about such things bother to do so, so the myth persists.

LikeLike

1. beiercai says:
  
  May 1, 2013 at 6:52 pm
  
  @krakjoe Thanks for pointing it out, mentioned it in the blog.
  
  LikeLike
  
tfry says:

September 17, 2014 at 7:41 pm

Curious to learn more about your SOA architecture, and what you used in place of the yucky WS/SOAP default. Would you be able to share some details?

Thanks,

Tim

LikeLike

beiercai says:

September 17, 2014 at 9:50 pm

Hi tfry

We’ve been working pretty hard in the last while building out our SOA, moving from LAMP to Scala based services connected by ZeroMQ, been working really well for us so far. Expect future blog posts for full review of our architecture, stay tuned!

LikeLike

tfry says:

September 19, 2014 at 2:06 pm

beiercai tfry Thanks — looking forward to seeing the posts!

LikeLike

ThomasCheng says:

June 21, 2015 at 12:27 am

Hi there,

Thanks for the article. I have a question.

In the article you mentioned about sharing resources between forked processes.

“Besides, having children process sharing resources with parent is dangerous, for example a child can close a MySQL connection without parent knowing.”

I read in several places, and a lot of people say that “fork()” does not share resources. Instead, it makes a memory clone of the parent thread.

http://stackoverflow.com/questions/8707339/sharing-variables-between-child-processes-in-php

So I was wondering if this is the difference in PHP versions? Or did I misunderstood something?

If you could help clarify that would be great!

BTW, just FYI, what I’m trying to do is to make a generic module which branches several processes on a cron trigger on a shared hosting environment. Each process (including the parent process, after forking, will look up the task que, and start popping out tasks one by one from the db-driven task queue I have. So I’d be having 5 processes in parallel, each processing from the same task queue one by one.

Each process can have an execution_time of 60 seconds. **So actually it’d be even better if these processes don’t depend on each other, because I know forking means if parent reaches 60 second, the children would be gone together

===

Thanks in advance!

Cheers,
Thomas

LikeLike

Pingback: PHP: multi-process socket TCP per acquisire dati dal GPS Tracker GPS-102 compatibile OpenGTS | Il blog di Mario Spada

Code @ Hootsuite

Behind the Scenes at Hootsuite: the technology, the challenges, the people

Parallel Processing / Task Distribution with PHP

Multi-Thread