Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmarks #33

Open
mrocklin opened this issue Jun 10, 2014 · 6 comments
Open

Performance benchmarks #33

mrocklin opened this issue Jun 10, 2014 · 6 comments

Comments

@mrocklin
Copy link
Member

I've been putting performance comparisons in #31 . Felt they should go somewhere more permanent.

@mrocklin
Copy link
Member Author

Join

In [1]: from cytoolz.curried import *
In [2]: from cytoolz.itertoolz import _consume, join
In [3]: data = [(i, i % 10) for i in range(1000000)]
In [4]: names = [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five'), (6, 'six'), (7, 'seven'), (8, 'eight'), (9, 'nine'), (0, 'zero')]

In [5]: timeit _consume(join(0, names, 1, data))
10 loops, best of 3: 168 ms per loop

In [6]: import pandas
In [7]: df = pandas.DataFrame(data, columns=['a', 'b'])
In [8]: names_df = pandas.DataFrame(names, columns=['b', 'name'])

In [9]: timeit pandas.merge(names_df, df, left_on='b', right_on='b')
10 loops, best of 3: 86.8 ms per loop

@mrocklin
Copy link
Member Author

Getter function in groupby

How much cost is there in choosing the wrong function?

In [9]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.290 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.290    0.290 <string>:1(<module>)
        1    0.171    0.171    0.276    0.276 {cytoolz.itertoolz.groupby}
  1000000    0.104    0.000    0.104    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [10]: prun -s cumulative groupby(itemgetter(1), data)
         3 function calls in 0.099 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.099    0.099 <string>:1(<module>)
        1    0.084    0.084    0.084    0.084 {cytoolz.itertoolz.groupby}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

@mrocklin
Copy link
Member Author

Looks like profile doesn't work well with these sorts of tasks

In [11]: timeit groupby(lambda x: x[1], data)
10 loops, best of 3: 141 ms per loop

In [12]: timeit groupby(get(1), data)
10 loops, best of 3: 135 ms per loop

In [13]: timeit groupby(itemgetter(1), data)
10 loops, best of 3: 70.8 ms per loop

@mrocklin mrocklin mentioned this issue Jun 10, 2014
@eriknw
Copy link
Member

eriknw commented Jul 3, 2014

I ran the join benchmarks that you posted above using cytoolz 0.7.0. Note that groupby and join in cytoolz have both become faster for index keys since you first posted your results.

In [12]: timeit consume(join(0, names, 1, data))
10 loops, best of 3: 163 ms per loop

In [13]: timeit pandas.merge(names_df, df, left_on='b', right_on='b')
1 loops, best of 3: 197 ms per loop

@mrocklin
Copy link
Member Author

mrocklin commented Jul 3, 2014

Performance difference is even more exaggerated on my machine. I wasn't able to get pandas to go remarkably faster by playing with indices at all.

Nice work.

@eriknw
Copy link
Member

eriknw commented Jul 3, 2014

I really like functionality comparisons between toolz and pandas. I think it is a good way to get people interested in--and to actually use--toolz (even though we still love pandas). Comparing performance as well is a bonus, which may surprise some people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants