Effortless Serverless Map Reduce
Refinery allows you to do effortless map reduce in order to spread your compute across hundreds (or thousands) of machines. This doesn't require writing any new code or integrating with any API frameworks.
If you don't care about collecting all the results from the map part of your map-reduce you should probably use a
Queue Block instead. Queues are a much more efficient and speedy way to process a large number of items when you don't care about merging all of the results immediately at the end.
Refinery makes it very easy to use a large number of machines to do concurrent computation. The ease of use can be problematic in that it also makes it easy to run up a large bill with little effort. While the cost of spinning up and using 1,000 virtual servers in Refinery is an order of magnitude cheaper than conventional hosts, it's not free. You should always carefully consider the resources being used and should calculate your costs appropriately.
Creating a Serverless Map Reduce Pipeline
A map reduce pipeline in Refinery consists of at least three parts. The following is an example Refinery diagram of this:
One Code Block to return an array of items to be distributed to workers, one Code Block to do work on a single item in the array, and one Code Block to get the results as an array.
The following are an example of the
Code Block for each:
Code Block #1: which returns an array of items to be distributed to the worker machines. This is as simple as return an array like
[1,2,3,4,5]. The following code is an example of this:
def main(block_input, backpack): return_list =  # Generate a list of 100 numbers in an array for i in range( 0, 100 ): return_list.append( i ) # Return the array return return_list
Code Block #2(connected to
Code Block #1via a
fan-outtransition): which does an operation on a single item of the array. This is the
Code Blockwhich does the computationally-intensive work you would want to use a map reduce for. A very basic example is the following which just takes an item from the above returned number array and doubles it:
def main(block_input, backpack): # Multiply the item by two and return the new value return ( block_input * 2 )
While this example is trivial (and not a good use-case for
fan-out), it demonstrates the format that the
fan-out data will use when passing data between
Code Block #3(connected to
Code Block #2via a
fan-intransition): which is passed an array of the returned values from all of the executions of
Code Block #2. In our example case, the array passed to this final code block would be
Example Fan-Out & Fan-In Execution
The above video demonstrates an execution of the example mentioned previously. Once we execute the first
Code Block the second
Code Block is fanned-out to so that it executes 100 times (1 per each item returned by the first
Block Executions debugging utility we can easily see the input data and output data to each block. The video shows that the first
Code Block returns an array of 100 numbers. The second
Code Block (connected to the first via a
fan-out) executed 100 times which each individual number as an input. Finally, the third
Code Block (connected to the first via a
fan-in) received an array of 100 return values from the second
It's important to note that you don't immediately have to
fan-in after doing a
fan-out. You can do multiple
then transitions and then eventually do a
fan-in with the results.
It's important to ensure that before doing a
fan-in you've not increased or decreased the number of parallel executions. For example if you do an
if transition and only do a
fan-in with fifty executions instead of the originally fanned-out 100 executions - that won't work. For the same reason, it's important to note that all of your executions between a
fan-out and a
fan-in must not result in an uncaught exception. This will break a
fan-in for the same reason: the number of executions coming to a
fan-in is less than the number of executions from the previous
Limits of Fan-Outs & Fan-Ins (and other considerations)
By default, your Refinery account has an execution limit of 1,000 concurrent executions across all of your projects. This is purely an account-level limit which can be raised if more capacity is required. However, a 1,000 concurrency limit does not mean that you are limited to only fanning-out to less than 1,000 items. You can still do a
fan-out of more than 1,000 items and it will work the same way. This is because, generally speaking, Refinery pipelines will only slow down when dealing with extreme load (see the exception to this in the warning below). When the 1,000 concurrent execution limit is hit the additional executions that need to be performed are simply queued up to be retried once more execution capacity is available. In the context of a
fan-out this just means that the 10,000 executions will take a bit longer to execute.
The following example demonstrates this:
Code Blockdoes a
fan-outwith an array of 10,000 items.
- While our system executes the connected
Code Block10,000 times, the default concurrency limit is hit (1,000 concurrent executions). Instead of failing, the executions are simply queued up to be retried later.
- Some of the previously-running
Code Blocksfinish executing. The queued up
Code Blockswhich couldn't execute previously are now executed since there is available capacity for them.
- Eventually all the
Code Blocksin the
fan-outfinish executing and the
Code Blockconnected via the
fan-inis executed with the results passed as input.
In this example, the entire pipeline works as expected but at a slightly slower pace.
As a reminder, if you don't need to do the
fan-in part you should utilize the
Queue Block instead. The
Queue Block offers significant advantages in terms of speed, cost, and automatic scaling.
Currently, doing a
fan-out that results in hitting your max concurrency limit (default of 1,000 concurrent executions for Refinery accounts) can cause some requests to API Endpoints to fail. This is because API Endpoints need to execute the attached
Code Blocks to respond to the web request. When the execution capacity is maxed out the
Code Block execution will fail and an error will be returned for the request. This is unique in Refinery because most pipelines will handle a maxed-out capacity situation by simply slowing down instead of breaking. In future releases of Refinery this problem will be completely fixed but as of this time it is important to note this limitation.
As a temporary workaround, multiple Refinery accounts can be used to avoid this problem. One Refinery account is used for real-time sensitive projects and one Refinery account is used for non-real-time sensitive projects.