blog darkness...

words, music, food

PySpark: use generators / yield in RDD map

It wasn't obvious to me at first and the documentation isn't awesome.  If you try to use yield in a mapper function to generate multiple output rows from a single input row for an RDD using .map(), it will crap out.

The solution is to use RDD.flatMap()

blog comments powered by Disqus