Joins¶

Inner or outer join on data (similar to database joins/products)

Not to be mistaken for flow-based joins that work on I/O channels.

TODO

class rdc.etl.transform.join.Join(join=None, is_outer=False, default_outer_join_data=None)[source]¶

Join some key => value pairs, that can depend on the source hash.

This element can change the stream length, either positively (joining >1 item data) or negatively (joining <1 item data)

join(hash, channel=0)[source]¶

Abtract method that must be implemented in concrete subclasses, to return the data that should be joined with the given row.

It should be iterable, or equivalent to False in a test.

If the result is iterable and its length is superior to 0, the result of this transform will be a cartesian product between this method result and the original input row.

If the result is false or iterable but 0-length, the result of this transform will depend on the join type, determined by the is_outer attribute.

If is_outer == True, the transform output will be a simple union between the input row and the result of self.get_default_outer_join_data()
If is_outer == False, this row will be sinked, and will not generate any output from this transform.

Default join type is inner, to preserve backward compatibility.

Example:

>>> from rdc.etl.transform.join import Join
>>> from rdc.etl.transform.util import clean

>>> @Join
... def my_join(hash, channel=STDIN):
...     return ({'a':1}, {'b':2}, )

>>> map(clean, my_join({'foo': 'bar'}, {'foo': 'baz'}, ))
[H{'foo': 'bar', 'a': 1}, H{'foo': 'bar', 'b': 2}, H{'foo': 'baz', 'a': 1}, H{'foo': 'baz', 'b': 2}]