Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add join #31

Merged
merged 6 commits into from
Jun 28, 2014
Merged

add join #31

merged 6 commits into from
Jun 28, 2014

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Jun 5, 2014

This is ugly and likely inefficient. Help.

toolz.join has some pretty complex yield stuff going on. The work here is pretty much just writing down state explicitly to handle all of that. It's pretty ugly and I assume that it's not any more efficient than the Pure Python solution. Tips appreciated.

This is ugly and likely inefficient.  Help.
@eriknw
Copy link
Member

eriknw commented Jun 6, 2014

Thanks for tackling this. A working implementation is better than no implementation. Still, I see what you mean when you call it ugly; there is a reason yield is awesome! The reasons cytoolz doesn't use yield is so there can be fast C bindings available to other Cython code, I think using yield was a little slower (or at least not faster than the C extension), and we were able to implement everything just fine without it. If there is ever a compelling reason to use yield (and join is probably the most compelling case so far), I think we should consider it. We lose the fast C binding, but that's an advanced feature anyway. But, since there are C bindings, it's very convenient that there are C bindings to everything, and it would be great to keep it this way.

An inner join can actually be relatively simple and clean (untested code):

    def __next__(self):
        cdef PyObject *obj = NULL
        if not self.matches:
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = <object>obj
            self.matches.reverse()
        match = self.matches.pop()
        return (match, self.right) 

I don't know if it's a good idea or faster to reverse (do we care about the order?) and pop the items from the matches list, and it would be straightforward to change it to iteration. Actually, here's a version of an inner join with iteration over matches:

    def __next__(self):
        cdef PyObject *obj
        obj = PyIter_Next(self.matches)
        if obj is not NULL:
            match = <object>obj
        else:
            # StopIteration is not considered an exception here.
            # What other exceptions can occur when iterating a list?
            # The following check is needed in general, however, and
            # is easy to forget.
            obj = PyErr_Occurred()
            if obj is not NULL:
                raise <object>obj
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = iter(<object>obj)
            match = next(self.matches)
        return (match, self.right)

Using PyIter_Next like this can get pretty ugly. It returns NULL but sets no exception if StopIteration is encountered, but it returns NULL and sets an exception that should be checked with PyErr_Occurred() if any other exception occurs. I believe this is faster than doing try - except StopIteration though.

I haven't considered outer joins yet, but maybe you can use something from here. It would be okay for join to be a function that returned one of _join_inner, _join_outer, _join_left_outer, _join_right_outer if this makes the code easier to understand and develop. This would add more code by violating the DRY principle, so use your best judgement. I've done this before in cases where two simple things is preferable to one complicated thing.

@eriknw
Copy link
Member

eriknw commented Jun 6, 2014

Hmm, the version that iterates over self.matches with PyIter_Next isn't the best example except for educational purposes. It's clearer (and I believe faster) to use an iterating index, such as:

    def __next__(self):
        cdef PyObject *obj = NULL
        if self.i == len(self.matches):
            while obj is NULL:
                self.right = next(self.rightseq)
                key = self.rightkey(self.right)
                obj = PyDict_GetItem(self.d, key)
            self.matches = <object>obj
            self.i = 0
        match = <object>PyList_GET_ITEM(self.matches, self.i)  # skip error checking
        self.i += 1
        return (match, self.right)

This is beginning to split hairs.

@eriknw
Copy link
Member

eriknw commented Jun 6, 2014

A right outer join is also pretty simple:

    def __next__(self):
        cdef PyObject *obj
        if self.i == len(self.matches):
            self.right = next(self.rightseq)
            key = self.rightkey(self.right)
            obj = PyDict_GetItem(self.d, key)
            if obj is NULL:
                return (self.left_default, self.right)
            self.matches = <object>obj
            self.i = 0
        match = <object>PyList_GET_ITEM(self.matches, self.i)  # skip error checking
        self.i += 1
        return (match, self.right)

Left outer join and full outer join are the more involved cases, and it gets pretty complicated when any of the joins are merged into a single function.

Having the joins defined separately may be simpler to understand, because inner join and right outer join are so much easier to understand on their own. Some code repetition may be avoided by using the same initialization routine that is defined in a parent class. Subclassing from a single C extension class is straightforward.

Heh, I hope you're not souring on Cython, @mrocklin. You always seem to tackle ugly, difficult cases!

key, matches = next(self.d_items)
while(key in self.seen_keys and matches):
key, matches = next(self.d_items)
self.key = key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.key appears to be unnecessary.

I'm not sure that this is worth it
@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2014

OK, I've pushed up a version with different _*join clases. I'm not confident that the left outer joins can be written concisely and so I'm tempted to merge a couple of these. In particular, I expect that optimizing join will be pretty expensive from a development standpoint..

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2014

I've change my old join to match the full outer join and then put in a few checks to make it cover all cases. We now have two viable options, splitting the joins by type and having one mega-join.

Thoughts welcome.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2014

Have you ever done line profiling in Cython? I'm curious how much overhead the if checks take up. Also generally interested where the bottlenecks are.

@eriknw
Copy link
Member

eriknw commented Jun 8, 2014

I have not done line profiling in Cython yet. Doing so would have added too much time to the initial development of cytoolz, and I have not yet done a revision iteration for which I expect I will do line profiling. If you do it, let me know how it goes!

The thing I don't like about the mega-join version is its use of recursion for inner joins and left outer joins. The maximum recursion depth can be exceeded when there are many consecutive keys from rightseq that do not match any of the keys from leftseq. Such a pathological case may be common for some types of data. It is primarily for this reason that I prefer splitting the joins by type. It was great to see both options, though, so thanks!

I'm not too concerned about optimizing join from a development standpoint when the joins are separate for two reasons:

  1. I don't think there is much room for optimization
  2. The different cases can be optimized independently


cdef class _inner_join(_join):
def __iter__(self):
self.matches = ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't. Removed. It was old cruft that hung around.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 9, 2014

I don't think there is much room for optimization

One big case will be when leftkey or rightkey are given as non-callables, implying that there will be lots of GetItem going on. Managing code complexity in this case might be challenging, especially when we also have multiple solutions for inner/outer

@mrocklin
Copy link
Member Author

In [1]: from toolz.curried import *

In [2]: import toolz

In [3]: import cytoolz

In [4]: small = [(i, str(i)) for i in range(100)] * 10

In [5]: big = pipe([110]*10000, map(range), concat, list)

In [6]: from cytoolz.itertoolz import _consume

In [7]: timeit _consume(toolz.join(0, small, identity, big))
1 loops, best of 3: 1.12 s per loop

In [8]: timeit _consume(cytoolz.itertoolz.join(0, small, identity, big))
1 loops, best of 3: 521 ms per loop

First, hurrah, we got the magical 2x speedup. Second, why can't I import join from cytoolz directly?

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

First, hurrah, we got the magical 2x speedup.

Cool, and this is with the mega-join version, right?

Second, why can't I import join from cytoolz directly?

Add it to __all__ in itertoolz.pyx of course!

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

I don't think there is much room for optimization

One big case will be when leftkey or rightkey are given as non-callables, implying that there will be lots of GetItem going on. Managing code complexity in this case might be challenging, especially when we also have multiple solutions for inner/outer

I've been considering how to best do this both in regards to the mega version and the split version. The mega version would probably just use if-else statements which may or may not be in a convenience cdef method. I think the split version can be done very efficiently with additional subclasses (of course, the mega version could too). I'll post a proof of concept class hierarchy that should be nearly optimal and have minimal impact on code duplication. Plus, the performance improvement should be well worth it (especially in functions like groupby).

@mrocklin
Copy link
Member Author

It might be wise to try this out first on groupby just to ensure that getter functions do impose a sizable cost and so are worth optimizing out.

@mrocklin
Copy link
Member Author

BTW, what's your schedule like these days? I'm curious how aggressively I need to budget my time here before SciPy.

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

Agreed. Three subclasses--one for callable key, one for single index, and one for list of indices--will be needed for each function. This does not appeal to our Pythonic sensibilities. It is more boilerplate than we typically like to have, and the result for join will be 12 more subclasses.

Here is the approach I'm considering:

cdef class _join:
    cdef object _rightkey
    ...


cdef class _right_outer_join(_join):
    ...


cdef class _right_outer_join_key(_right_outer_join):
    cdef object rightkey

    # This must match the signature of `__cinit__` in `_join`
    def __cinit__(self,
                  object leftkey, object leftseq,
                  object rightkey, object rightseq,
                  object left_default=no_default,
                  object right_default=no_default):
        self.rightkey = rightkey


cdef class _right_outer_join_index(_right_outer_join):
    cdef inline object rightkey(self, object item):
        return item[self._rightkey]

cdef class _right_outer_join_indices(_right_outer_join):
    cdef object rightkey(self, object item): 
        # make tuple
        ...

While not exactly pretty, it does achieve efficiency while avoiding duplicating most code. Hmm, I should make sure this works. _right_outer_join may not like trying to call self.rightkey if it doesn't exist even though it will exist in a subclass.

@mrocklin
Copy link
Member Author

A quick groupby profile. We see that getting items with a lambda takes around 20% in the toolz case, 35% in the cytoolz case.

In [1]: from toolz import groupby

In [2]: data = [(i, i % 10) for i in range(1000000)]

In [3]: prun -s cumulative groupby(lambda x: x[1], data)
         2000014 function calls in 0.489 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.489    0.489 <string>:1(<module>)
        1    0.300    0.300    0.475    0.475 itertoolz.py:56(groupby)
  1000000    0.113    0.000    0.113    0.000 <string>:1(<lambda>)
  1000000    0.061    0.000    0.061    0.000 {method 'append' of 'list' objects}
       10    0.000    0.000    0.000    0.000 itertoolz.py:81(<lambda>)
        1    0.000    0.000    0.000    0.000 {callable}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [4]: from cytoolz import groupby

In [5]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.284 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.284    0.284 <string>:1(<module>)
        1    0.168    0.168    0.269    0.269 {cytoolz.itertoolz.groupby}
  1000000    0.101    0.000    0.101    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

@mrocklin
Copy link
Member Author

Oh, I didn't know that cytoolz supported function inlining. That changes things. Do we need separate classes? Any chance we can move functions around dynamically at runtime? :)

@mrocklin
Copy link
Member Author

Wait, no, I take it back. Significantly more gains seem possible. These results break my intuition gained above. I'm now less confident in my ability to infer from profiling results.

In [9]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.290 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.290    0.290 <string>:1(<module>)
        1    0.171    0.171    0.276    0.276 {cytoolz.itertoolz.groupby}
  1000000    0.104    0.000    0.104    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [10]: prun -s cumulative groupby(itemgetter(1), data)
         3 function calls in 0.099 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.099    0.099 <string>:1(<module>)
        1    0.084    0.084    0.084    0.084 {cytoolz.itertoolz.groupby}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

BTW, what's your schedule like these days? I'm curious how aggressively I need to budget my time here before SciPy.

Pretty flexible. I completely overdid it over Labor Day weekend: I overloaded my allergies, which had me out of commission for nearly two weeks (note to self: don't go on long bike rides when farmers are plowing fields!). I still have a lingering cough, but at least I can sleep now. Anyway, I've been catching up on other things, and life is returning to normal.

So. Schedule. benchtoolz needs about 30 minutes of work before I can push it. I've been sitting on it far longer than I intended. Although I'm busy tomorrow, I believe we can push for a new release of toolz and cytoolz relatively soon, and definitely before the SciPy conference. I will be around to help with this and do much of the work on cytoolz myself. toolz mostly needs Python 3.4 support and your streaming analytics doc page added.

My talk on functional programming in Python with PyToolz was accepted for PyOhio. This is in 6-7 weeks, and I will begin drafting it soon. Heh, I also plan to give a Lightning Talk on benchtoolz, and I may even pitch it for a sprint.

In other words, I'll be more active than I have been in recent weeks, and we can push to develop a version of join that we're both happy with.

@mrocklin
Copy link
Member Author

Sorry to hear about the allergies. I had a mercifully brief incident earlier this week. It's alarming how bad life gets when swallowing becomes painful.

Great about the PyOhio talk! Let me know if I can help in any way.

@mrocklin
Copy link
Member Author

and we can push to develop a version of join that we're both happy with.

Most of my concern about breaking out into multiple implementations was that I didn't have inlining in mind, and so was picturing a dozen parallel implementations of __next__. The way you've outlined it seems like a good way to go.

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

Interesting profiling. How do timeit results compare?

Oh, I didn't know that cytoolz supported function inlining. That changes things. Do we need separate classes? Any chance we can move functions around dynamically at runtime? :)

C functions are not first class citizens, although pointers to C functions should work :-P

@mrocklin
Copy link
Member Author

See #33 for further benchmark discussion

@eriknw
Copy link
Member

eriknw commented Jun 10, 2014

Most of my concern about breaking out into multiple implementations was that I didn't have inlining in mind, and so was picturing a dozen parallel implementations of next. The way you've outlined it seems like a good way to go.

Oh, right, we should simply use cdef inline object rightkey(self, item) for all the variations, so ignore the extra cruft in _right_outer_join_key above. I have tested this approach with a toy class hierarchy, and it appears to work. Heh, it's too bad Cython doesn't support mixins of extension classes.

Great about the PyOhio talk! Let me know if I can help in any way.

Thanks! I'll definitely post the slides for review long before the conference. I'm wondering how much I should use (not copy/paste) from existing documentation and presentations. The examples from them are great. I think first I will need to develop the story arc for my presentation, then thread in suitable content from existing material. I want this talk to be my own, but it would be stupid to ignore available resources.

@mrocklin
Copy link
Member Author

@eriknw thoughts on merging this (once I handle conflicts) without adding the more efficient bits? I'd like to publish the streaming analytics blogpost soon and might not have time soon to take care of the inlined solution here.

@eriknw
Copy link
Member

eriknw commented Jun 24, 2014

Merging a working version soon sounds great, and maybe I can explore the inline version once other issues are addressed.

How soon do you want to publish the blogpost? I think we should push for a release of *toolz just before the blogpost, and both should preferably come before the SciPy conference.

cytoolz has almost caught up with toolz. Other than join (and the sandbox), I think the only thing missing is using indices for keys.

The two things I would like toolz to get before the next release are Python 3.4 support and the streaming analytics documentation. It would be nice to add the "Logical Operators" PR too, but this isn't a must-have.

@eriknw
Copy link
Member

eriknw commented Jun 24, 2014

Also, you may want to wait until I merge #38 before resolving conflicts.

Conflicts:
	cytoolz/itertoolz.pyx
	cytoolz/tests/test_itertoolz.py
@mrocklin
Copy link
Member Author

Conflicts resolved.

@eriknw
Copy link
Member

eriknw commented Jun 28, 2014

Excellent. Merging so we can move onto other things. We can explore the more efficient approach later.

eriknw added a commit that referenced this pull request Jun 28, 2014
@eriknw eriknw merged commit 993c3c6 into pytoolz:master Jun 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants