add join #31

mrocklin · 2014-06-05T23:42:15Z

This is ugly and likely inefficient. Help.

toolz.join has some pretty complex yield stuff going on. The work here is pretty much just writing down state explicitly to handle all of that. It's pretty ugly and I assume that it's not any more efficient than the Pure Python solution. Tips appreciated.

This is ugly and likely inefficient. Help.

eriknw · 2014-06-06T05:15:38Z

Thanks for tackling this. A working implementation is better than no implementation. Still, I see what you mean when you call it ugly; there is a reason yield is awesome! The reasons cytoolz doesn't use yield is so there can be fast C bindings available to other Cython code, I think using yield was a little slower (or at least not faster than the C extension), and we were able to implement everything just fine without it. If there is ever a compelling reason to use yield (and join is probably the most compelling case so far), I think we should consider it. We lose the fast C binding, but that's an advanced feature anyway. But, since there are C bindings, it's very convenient that there are C bindings to everything, and it would be great to keep it this way.

An inner join can actually be relatively simple and clean (untested code):

    def __next__(self):
        cdef PyObject *obj = NULL
        if not self.matches:
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = <object>obj
            self.matches.reverse()
        match = self.matches.pop()
        return (match, self.right)

I don't know if it's a good idea or faster to reverse (do we care about the order?) and pop the items from the matches list, and it would be straightforward to change it to iteration. Actually, here's a version of an inner join with iteration over matches:

    def __next__(self):
        cdef PyObject *obj
        obj = PyIter_Next(self.matches)
        if obj is not NULL:
            match = <object>obj
        else:
            # StopIteration is not considered an exception here.
            # What other exceptions can occur when iterating a list?
            # The following check is needed in general, however, and
            # is easy to forget.
            obj = PyErr_Occurred()
            if obj is not NULL:
                raise <object>obj
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = iter(<object>obj)
            match = next(self.matches)
        return (match, self.right)

Using PyIter_Next like this can get pretty ugly. It returns NULL but sets no exception if StopIteration is encountered, but it returns NULL and sets an exception that should be checked with PyErr_Occurred() if any other exception occurs. I believe this is faster than doing try - except StopIteration though.

I haven't considered outer joins yet, but maybe you can use something from here. It would be okay for join to be a function that returned one of _join_inner, _join_outer, _join_left_outer, _join_right_outer if this makes the code easier to understand and develop. This would add more code by violating the DRY principle, so use your best judgement. I've done this before in cases where two simple things is preferable to one complicated thing.

eriknw · 2014-06-06T05:35:48Z

Hmm, the version that iterates over self.matches with PyIter_Next isn't the best example except for educational purposes. It's clearer (and I believe faster) to use an iterating index, such as:

    def __next__(self):
        cdef PyObject *obj = NULL
        if self.i == len(self.matches):
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = <object>obj
            self.i = 0
        match = <object>PyList_GET_ITEM(self.matches, self.i)  # skip error checking
        self.i += 1
        return (match, self.right)

This is beginning to split hairs.

eriknw · 2014-06-06T14:57:01Z

A right outer join is also pretty simple:

    def __next__(self):
        cdef PyObject *obj
        if self.i == len(self.matches):
            self.right = next(self.rightseq)
            key = self.rightkey(self.right)
            obj = PyDict_GetItem(self.d, key)
            if obj is NULL:
                return (self.left_default, self.right)
            self.matches = <object>obj
            self.i = 0
        match = <object>PyList_GET_ITEM(self.matches, self.i)  # skip error checking
        self.i += 1
        return (match, self.right)

Left outer join and full outer join are the more involved cases, and it gets pretty complicated when any of the joins are merged into a single function.

Having the joins defined separately may be simpler to understand, because inner join and right outer join are so much easier to understand on their own. Some code repetition may be avoided by using the same initialization routine that is defined in a parent class. Subclassing from a single C extension class is straightforward.

Heh, I hope you're not souring on Cython, @mrocklin. You always seem to tackle ugly, difficult cases!

eriknw · 2014-06-06T14:57:30Z

cytoolz/itertoolz.pyx

+                key, matches = next(self.d_items)
+                while(key in self.seen_keys and matches):
+                    key, matches = next(self.d_items)
+                self.key = key


self.key appears to be unnecessary.

I'm not sure that this is worth it

mrocklin · 2014-06-08T14:39:49Z

OK, I've pushed up a version with different _*join clases. I'm not confident that the left outer joins can be written concisely and so I'm tempted to merge a couple of these. In particular, I expect that optimizing join will be pretty expensive from a development standpoint..

mrocklin · 2014-06-08T14:57:30Z

I've change my old join to match the full outer join and then put in a few checks to make it cover all cases. We now have two viable options, splitting the joins by type and having one mega-join.

Thoughts welcome.

mrocklin · 2014-06-08T15:03:45Z

Have you ever done line profiling in Cython? I'm curious how much overhead the if checks take up. Also generally interested where the bottlenecks are.

eriknw · 2014-06-08T16:32:44Z

I have not done line profiling in Cython yet. Doing so would have added too much time to the initial development of cytoolz, and I have not yet done a revision iteration for which I expect I will do line profiling. If you do it, let me know how it goes!

The thing I don't like about the mega-join version is its use of recursion for inner joins and left outer joins. The maximum recursion depth can be exceeded when there are many consecutive keys from rightseq that do not match any of the keys from leftseq. Such a pathological case may be common for some types of data. It is primarily for this reason that I prefer splitting the joins by type. It was great to see both options, though, so thanks!

I'm not too concerned about optimizing join from a development standpoint when the joins are separate for two reasons:

I don't think there is much room for optimization
The different cases can be optimized independently

eriknw · 2014-06-08T16:33:14Z

cytoolz/itertoolz.pyx

+
+cdef class _inner_join(_join):
+    def __iter__(self):
+        self.matches = ()


Why is this necessary?

Isn't. Removed. It was old cruft that hung around.

mrocklin · 2014-06-09T23:59:54Z

I don't think there is much room for optimization

One big case will be when leftkey or rightkey are given as non-callables, implying that there will be lots of GetItem going on. Managing code complexity in this case might be challenging, especially when we also have multiple solutions for inner/outer

mrocklin · 2014-06-10T02:51:20Z

In [1]: from toolz.curried import *

In [2]: import toolz

In [3]: import cytoolz

In [4]: small = [(i, str(i)) for i in range(100)] * 10

In [5]: big = pipe([110]*10000, map(range), concat, list)

In [6]: from cytoolz.itertoolz import _consume

In [7]: timeit _consume(toolz.join(0, small, identity, big))
1 loops, best of 3: 1.12 s per loop

In [8]: timeit _consume(cytoolz.itertoolz.join(0, small, identity, big))
1 loops, best of 3: 521 ms per loop

First, hurrah, we got the magical 2x speedup. Second, why can't I import join from cytoolz directly?

eriknw · 2014-06-10T02:57:13Z

First, hurrah, we got the magical 2x speedup.

Cool, and this is with the mega-join version, right?

Second, why can't I import join from cytoolz directly?

Add it to __all__ in itertoolz.pyx of course!

eriknw · 2014-06-10T03:04:40Z

I don't think there is much room for optimization

One big case will be when leftkey or rightkey are given as non-callables, implying that there will be lots of GetItem going on. Managing code complexity in this case might be challenging, especially when we also have multiple solutions for inner/outer

I've been considering how to best do this both in regards to the mega version and the split version. The mega version would probably just use if-else statements which may or may not be in a convenience cdef method. I think the split version can be done very efficiently with additional subclasses (of course, the mega version could too). I'll post a proof of concept class hierarchy that should be nearly optimal and have minimal impact on code duplication. Plus, the performance improvement should be well worth it (especially in functions like groupby).

mrocklin · 2014-06-10T03:24:35Z

It might be wise to try this out first on groupby just to ensure that getter functions do impose a sizable cost and so are worth optimizing out.

mrocklin · 2014-06-10T03:25:18Z

BTW, what's your schedule like these days? I'm curious how aggressively I need to budget my time here before SciPy.

eriknw · 2014-06-10T03:33:53Z

Agreed. Three subclasses--one for callable key, one for single index, and one for list of indices--will be needed for each function. This does not appeal to our Pythonic sensibilities. It is more boilerplate than we typically like to have, and the result for join will be 12 more subclasses.

Here is the approach I'm considering:

cdef class _join:
    cdef object _rightkey
    ...


cdef class _right_outer_join(_join):
    ...


cdef class _right_outer_join_key(_right_outer_join):
    cdef object rightkey

    # This must match the signature of `__cinit__` in `_join`
    def __cinit__(self,
                  object leftkey, object leftseq,
                  object rightkey, object rightseq,
                  object left_default=no_default,
                  object right_default=no_default):
        self.rightkey = rightkey


cdef class _right_outer_join_index(_right_outer_join):
    cdef inline object rightkey(self, object item):
        return item[self._rightkey]

cdef class _right_outer_join_indices(_right_outer_join):
    cdef object rightkey(self, object item): 
        # make tuple
        ...

While not exactly pretty, it does achieve efficiency while avoiding duplicating most code. Hmm, I should make sure this works. _right_outer_join may not like trying to call self.rightkey if it doesn't exist even though it will exist in a subclass.

mrocklin · 2014-06-10T03:38:58Z

A quick groupby profile. We see that getting items with a lambda takes around 20% in the toolz case, 35% in the cytoolz case.

In [1]: from toolz import groupby

In [2]: data = [(i, i % 10) for i in range(1000000)]

In [3]: prun -s cumulative groupby(lambda x: x[1], data)
         2000014 function calls in 0.489 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.489    0.489 <string>:1(<module>)
        1    0.300    0.300    0.475    0.475 itertoolz.py:56(groupby)
  1000000    0.113    0.000    0.113    0.000 <string>:1(<lambda>)
  1000000    0.061    0.000    0.061    0.000 {method 'append' of 'list' objects}
       10    0.000    0.000    0.000    0.000 itertoolz.py:81(<lambda>)
        1    0.000    0.000    0.000    0.000 {callable}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [4]: from cytoolz import groupby

In [5]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.284 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.284    0.284 <string>:1(<module>)
        1    0.168    0.168    0.269    0.269 {cytoolz.itertoolz.groupby}
  1000000    0.101    0.000    0.101    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

mrocklin · 2014-06-10T03:40:59Z

Oh, I didn't know that cytoolz supported function inlining. That changes things. Do we need separate classes? Any chance we can move functions around dynamically at runtime? :)

mrocklin · 2014-06-10T03:43:15Z

Wait, no, I take it back. Significantly more gains seem possible. These results break my intuition gained above. I'm now less confident in my ability to infer from profiling results.

In [9]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.290 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.290    0.290 <string>:1(<module>)
        1    0.171    0.171    0.276    0.276 {cytoolz.itertoolz.groupby}
  1000000    0.104    0.000    0.104    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [10]: prun -s cumulative groupby(itemgetter(1), data)
         3 function calls in 0.099 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.099    0.099 <string>:1(<module>)
        1    0.084    0.084    0.084    0.084 {cytoolz.itertoolz.groupby}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

eriknw · 2014-06-10T03:55:37Z

BTW, what's your schedule like these days? I'm curious how aggressively I need to budget my time here before SciPy.

Pretty flexible. I completely overdid it over Labor Day weekend: I overloaded my allergies, which had me out of commission for nearly two weeks (note to self: don't go on long bike rides when farmers are plowing fields!). I still have a lingering cough, but at least I can sleep now. Anyway, I've been catching up on other things, and life is returning to normal.

So. Schedule. benchtoolz needs about 30 minutes of work before I can push it. I've been sitting on it far longer than I intended. Although I'm busy tomorrow, I believe we can push for a new release of toolz and cytoolz relatively soon, and definitely before the SciPy conference. I will be around to help with this and do much of the work on cytoolz myself. toolz mostly needs Python 3.4 support and your streaming analytics doc page added.

My talk on functional programming in Python with PyToolz was accepted for PyOhio. This is in 6-7 weeks, and I will begin drafting it soon. Heh, I also plan to give a Lightning Talk on benchtoolz, and I may even pitch it for a sprint.

In other words, I'll be more active than I have been in recent weeks, and we can push to develop a version of join that we're both happy with.

mrocklin · 2014-06-10T04:04:37Z

Sorry to hear about the allergies. I had a mercifully brief incident earlier this week. It's alarming how bad life gets when swallowing becomes painful.

Great about the PyOhio talk! Let me know if I can help in any way.

mrocklin · 2014-06-10T04:06:07Z

and we can push to develop a version of join that we're both happy with.

Most of my concern about breaking out into multiple implementations was that I didn't have inlining in mind, and so was picturing a dozen parallel implementations of __next__. The way you've outlined it seems like a good way to go.

eriknw · 2014-06-10T04:07:26Z

Interesting profiling. How do timeit results compare?

Oh, I didn't know that cytoolz supported function inlining. That changes things. Do we need separate classes? Any chance we can move functions around dynamically at runtime? :)

C functions are not first class citizens, although pointers to C functions should work :-P

mrocklin · 2014-06-10T04:09:09Z

See #33 for further benchmark discussion

eriknw · 2014-06-10T04:40:45Z

Most of my concern about breaking out into multiple implementations was that I didn't have inlining in mind, and so was picturing a dozen parallel implementations of next. The way you've outlined it seems like a good way to go.

Oh, right, we should simply use cdef inline object rightkey(self, item) for all the variations, so ignore the extra cruft in _right_outer_join_key above. I have tested this approach with a toy class hierarchy, and it appears to work. Heh, it's too bad Cython doesn't support mixins of extension classes.

Great about the PyOhio talk! Let me know if I can help in any way.

Thanks! I'll definitely post the slides for review long before the conference. I'm wondering how much I should use (not copy/paste) from existing documentation and presentations. The examples from them are great. I think first I will need to develop the story arc for my presentation, then thread in suitable content from existing material. I want this talk to be my own, but it would be stupid to ignore available resources.

mrocklin · 2014-06-24T17:37:44Z

@eriknw thoughts on merging this (once I handle conflicts) without adding the more efficient bits? I'd like to publish the streaming analytics blogpost soon and might not have time soon to take care of the inlined solution here.

eriknw · 2014-06-24T18:13:59Z

Merging a working version soon sounds great, and maybe I can explore the inline version once other issues are addressed.

How soon do you want to publish the blogpost? I think we should push for a release of *toolz just before the blogpost, and both should preferably come before the SciPy conference.

cytoolz has almost caught up with toolz. Other than join (and the sandbox), I think the only thing missing is using indices for keys.

The two things I would like toolz to get before the next release are Python 3.4 support and the streaming analytics documentation. It would be nice to add the "Logical Operators" PR too, but this isn't a must-have.

eriknw · 2014-06-24T18:25:15Z

Also, you may want to wait until I merge #38 before resolving conflicts.

Conflicts: cytoolz/itertoolz.pyx cytoolz/tests/test_itertoolz.py

mrocklin · 2014-06-28T14:57:04Z

Conflicts resolved.

eriknw · 2014-06-28T19:27:07Z

Excellent. Merging so we can move onto other things. We can explore the more efficient approach later.

add join

add join

dd2d95f

This is ugly and likely inefficient. Help.

eriknw reviewed Jun 6, 2014
View reviewed changes

split out different joins into different classes

919a97c

I'm not sure that this is worth it

single join class with single __next__

1734d7c

eriknw reviewed Jun 8, 2014
View reviewed changes

remove loose __iter__

78f725a

add join to __init__.pxd

5867a32

mrocklin mentioned this pull request Jun 10, 2014

Performance benchmarks #33

Open

Merge branch 'master' into join

33a1068

Conflicts: cytoolz/itertoolz.pyx cytoolz/tests/test_itertoolz.py

eriknw added a commit that referenced this pull request Jun 28, 2014

Merge pull request #31 from mrocklin/join

993c3c6

add join

eriknw merged commit 993c3c6 into pytoolz:master Jun 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add join #31

add join #31

mrocklin commented Jun 5, 2014

eriknw commented Jun 6, 2014

eriknw commented Jun 6, 2014

eriknw commented Jun 6, 2014

eriknw Jun 6, 2014

mrocklin commented Jun 8, 2014

mrocklin commented Jun 8, 2014

mrocklin commented Jun 8, 2014

eriknw commented Jun 8, 2014

eriknw Jun 8, 2014

mrocklin Jun 8, 2014

mrocklin commented Jun 9, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 24, 2014

eriknw commented Jun 24, 2014

eriknw commented Jun 24, 2014

mrocklin commented Jun 28, 2014

eriknw commented Jun 28, 2014

add join #31

add join #31

Conversation

mrocklin commented Jun 5, 2014

eriknw commented Jun 6, 2014

eriknw commented Jun 6, 2014

eriknw commented Jun 6, 2014

eriknw Jun 6, 2014

Choose a reason for hiding this comment

mrocklin commented Jun 8, 2014

mrocklin commented Jun 8, 2014

mrocklin commented Jun 8, 2014

eriknw commented Jun 8, 2014

eriknw Jun 8, 2014

Choose a reason for hiding this comment

mrocklin Jun 8, 2014

Choose a reason for hiding this comment

mrocklin commented Jun 9, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jun 10, 2014

mrocklin commented Jun 24, 2014

eriknw commented Jun 24, 2014

eriknw commented Jun 24, 2014

mrocklin commented Jun 28, 2014

eriknw commented Jun 28, 2014