Joins¶
Inner or outer join on data (similar to database joins/products)
Not to be mistaken for flow-based joins that work on I/O channels.
TODO
-
class
rdc.etl.transform.join.
Join
(join=None, is_outer=False, default_outer_join_data=None)[source]¶ Join some key => value pairs, that can depend on the source hash.
This element can change the stream length, either positively (joining >1 item data) or negatively (joining <1 item data)
-
join
(hash, channel=0)[source]¶ Abtract method that must be implemented in concrete subclasses, to return the data that should be joined with the given row.
It should be iterable, or equivalent to False in a test.
If the result is iterable and its length is superior to 0, the result of this transform will be a cartesian product between this method result and the original input row.
If the result is false or iterable but 0-length, the result of this transform will depend on the join type, determined by the is_outer attribute.
- If is_outer == True, the transform output will be a simple union between the input row and the result of self.get_default_outer_join_data()
- If is_outer == False, this row will be sinked, and will not generate any output from this transform.
Default join type is inner, to preserve backward compatibility.
Example:
>>> from rdc.etl.transform.join import Join >>> from rdc.etl.transform.util import clean >>> @Join ... def my_join(hash, channel=STDIN): ... return ({'a':1}, {'b':2}, ) >>> map(clean, my_join({'foo': 'bar'}, {'foo': 'baz'}, )) [H{'foo': 'bar', 'a': 1}, H{'foo': 'bar', 'b': 2}, H{'foo': 'baz', 'a': 1}, H{'foo': 'baz', 'b': 2}]
-