Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executemany has an option for 'bulk'-insert on Impala #96

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nonsleepr
Copy link
Contributor

Impala supports multiple row inserts. This pull request adds option to use this feature.

@laserson
Copy link
Contributor

I'm not sure I understand what this does. Could you explain it in a bit more detail?

@nonsleepr
Copy link
Contributor Author

The idea here is to INSERT data in big chunks instead of doing it row-by-row.

@nonsleepr nonsleepr closed this Jul 28, 2015
@nonsleepr nonsleepr deleted the executemany branch July 28, 2015 16:20
@nonsleepr nonsleepr restored the executemany branch July 28, 2015 16:22
@nonsleepr nonsleepr reopened this Jul 28, 2015
@laserson
Copy link
Contributor

Ah, I was unfamiliar with executemany's use as a method to insert rows. Is this something you're doing often with your Impala cluster?

That said, I'm generally uneasy about having impyla rewrite people's queries, and anyway, using an INSERT statement is generally discouraged, as opposed to writing files to HDFS and registering the tables with CREATE statements. executemany is only really implemented to comply with PEP 249. My preference would be for you to send data to Impala using Ibis. Thoughts?

@nonsleepr
Copy link
Contributor Author

In my project, I'm using impyla as one of several database drivers which could be accessed via DB API 2.0. Users should be able to upload/insert small datasets/tables (probably 100 rows max) into the database. Right now this will produce hundreds of files in HDFS while this PR allows to avoid it.

According to PEP 249:

.executemany(...)
...
Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.

Impala docs also give following recommendations on INSERT...VALUES:

If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.

I didn't have a chance to look at ibis until now. Interestingly it uses impyla and hdfs (which my project is based on) and bunch of other packages underneath. One of the goals of my project is to make it lean and preferably pure (and that's #91 is for).

@laserson
Copy link
Contributor

Ok, as long as the default is the same, shouldn't be a problem. I'll make some additional comments for changes as well.

@@ -25,6 +25,8 @@
IntegrityError, DataError, NotSupportedError)


RE_INSERT_VALUES = re.compile(r"(.*\binsert\b.*\bvalues\b\s*)(\(.*\))\s*;?\s*", flags=re.IGNORECASE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: insert addl line per PEP8

@Gatsby-Lee
Copy link

@laserson Can you merge this change into master?

@laserson
Copy link
Contributor

I'm no longer involved with this project. Try a more recent committer.

@timarmstrong
Copy link
Contributor

Sorry this got neglected.

This is interesting but I'm unsure if the interface is quite right. Is there a reason not to do the rewrite transparently?

@darklord1807
Copy link

Hi I achieved the same thing by this #460 and it is working well.

Using

allToBeInserted.to_sql('xx', engine, if_exists='append', index=False,chunksize=2000,method='multi')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants