Performance benchmarks #33

mrocklin · 2014-06-10T03:59:57Z

I've been putting performance comparisons in #31 . Felt they should go somewhere more permanent.

mrocklin · 2014-06-10T04:00:42Z

Join

In [1]: from cytoolz.curried import *
In [2]: from cytoolz.itertoolz import _consume, join
In [3]: data = [(i, i % 10) for i in range(1000000)]
In [4]: names = [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five'), (6, 'six'), (7, 'seven'), (8, 'eight'), (9, 'nine'), (0, 'zero')]

In [5]: timeit _consume(join(0, names, 1, data))
10 loops, best of 3: 168 ms per loop

In [6]: import pandas
In [7]: df = pandas.DataFrame(data, columns=['a', 'b'])
In [8]: names_df = pandas.DataFrame(names, columns=['b', 'name'])

In [9]: timeit pandas.merge(names_df, df, left_on='b', right_on='b')
10 loops, best of 3: 86.8 ms per loop

mrocklin · 2014-06-10T04:02:02Z

Getter function in `groupby`

How much cost is there in choosing the wrong function?

In [9]: prun -s cumulative groupby(lambda x: x[1], data)
         1000003 function calls in 0.290 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.290    0.290 <string>:1(<module>)
        1    0.171    0.171    0.276    0.276 {cytoolz.itertoolz.groupby}
  1000000    0.104    0.000    0.104    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [10]: prun -s cumulative groupby(itemgetter(1), data)
         3 function calls in 0.099 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.099    0.099 <string>:1(<module>)
        1    0.084    0.084    0.084    0.084 {cytoolz.itertoolz.groupby}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

mrocklin · 2014-06-10T04:08:35Z

Looks like profile doesn't work well with these sorts of tasks

In [11]: timeit groupby(lambda x: x[1], data)
10 loops, best of 3: 141 ms per loop

In [12]: timeit groupby(get(1), data)
10 loops, best of 3: 135 ms per loop

In [13]: timeit groupby(itemgetter(1), data)
10 loops, best of 3: 70.8 ms per loop

eriknw · 2014-07-03T20:16:17Z

I ran the join benchmarks that you posted above using cytoolz 0.7.0. Note that groupby and join in cytoolz have both become faster for index keys since you first posted your results.

In [12]: timeit consume(join(0, names, 1, data))
10 loops, best of 3: 163 ms per loop

In [13]: timeit pandas.merge(names_df, df, left_on='b', right_on='b')
1 loops, best of 3: 197 ms per loop

mrocklin · 2014-07-03T20:21:44Z

Performance difference is even more exaggerated on my machine. I wasn't able to get pandas to go remarkably faster by playing with indices at all.

Nice work.

eriknw · 2014-07-03T20:52:05Z

I really like functionality comparisons between toolz and pandas. I think it is a good way to get people interested in--and to actually use--toolz (even though we still love pandas). Comparing performance as well is a bonus, which may surprise some people.

mrocklin mentioned this issue Jun 10, 2014

add join #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarks #33

Performance benchmarks #33

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

eriknw commented Jul 3, 2014

mrocklin commented Jul 3, 2014

eriknw commented Jul 3, 2014

Performance benchmarks #33

Performance benchmarks #33

Comments

mrocklin commented Jun 10, 2014

mrocklin commented Jun 10, 2014

Join

mrocklin commented Jun 10, 2014

Getter function in groupby

mrocklin commented Jun 10, 2014

eriknw commented Jul 3, 2014

mrocklin commented Jul 3, 2014

eriknw commented Jul 3, 2014

Getter function in `groupby`