executemany has an option for 'bulk'-insert on Impala #96

nonsleepr · 2015-07-25T03:26:03Z

Impala supports multiple row inserts. This pull request adds option to use this feature.

laserson · 2015-07-27T01:37:36Z

I'm not sure I understand what this does. Could you explain it in a bit more detail?

nonsleepr · 2015-07-27T13:56:28Z

The idea here is to INSERT data in big chunks instead of doing it row-by-row.

laserson · 2015-07-28T18:27:23Z

Ah, I was unfamiliar with executemany's use as a method to insert rows. Is this something you're doing often with your Impala cluster?

That said, I'm generally uneasy about having impyla rewrite people's queries, and anyway, using an INSERT statement is generally discouraged, as opposed to writing files to HDFS and registering the tables with CREATE statements. executemany is only really implemented to comply with PEP 249. My preference would be for you to send data to Impala using Ibis. Thoughts?

nonsleepr · 2015-07-29T15:12:36Z

In my project, I'm using impyla as one of several database drivers which could be accessed via DB API 2.0. Users should be able to upload/insert small datasets/tables (probably 100 rows max) into the database. Right now this will produce hundreds of files in HDFS while this PR allows to avoid it.

According to PEP 249:

.executemany(...)
...
Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.

Impala docs also give following recommendations on INSERT...VALUES:

If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.

I didn't have a chance to look at ibis until now. Interestingly it uses impyla and hdfs (which my project is based on) and bunch of other packages underneath. One of the goals of my project is to make it lean and preferably pure (and that's #91 is for).

laserson · 2015-07-29T20:31:56Z

Ok, as long as the default is the same, shouldn't be a problem. I'll make some additional comments for changes as well.

laserson · 2015-07-29T20:34:11Z

impala/dbapi/interface.py

@@ -25,6 +25,8 @@
                          IntegrityError, DataError, NotSupportedError)


+RE_INSERT_VALUES = re.compile(r"(.*\binsert\b.*\bvalues\b\s*)(\(.*\))\s*;?\s*", flags=re.IGNORECASE)
+


Nit: insert addl line per PEP8

Gatsby-Lee · 2018-05-21T18:43:53Z

@laserson Can you merge this change into master?

laserson · 2018-05-21T19:36:00Z

I'm no longer involved with this project. Try a more recent committer.

timarmstrong · 2019-03-07T21:09:40Z

Sorry this got neglected.

This is interesting but I'm unsure if the interface is quite right. Is there a reason not to do the rewrite transparently?

darklord1807 · 2021-07-21T19:58:48Z

Hi I achieved the same thing by this #460 and it is working well.

Using

allToBeInserted.to_sql('xx', engine, if_exists='append', index=False,chunksize=2000,method='multi')

executemany has an option for 'bulk'-insert on Impala

1df4213

Comments to make intentions clearer

7985c27

nonsleepr closed this Jul 28, 2015

nonsleepr deleted the executemany branch July 28, 2015 16:20

nonsleepr restored the executemany branch July 28, 2015 16:22

nonsleepr reopened this Jul 28, 2015

laserson reviewed Jul 29, 2015
View reviewed changes

Rename argument; add docstring; clarify logic

6adc689

IamGianluca mentioned this pull request Sep 12, 2016

Best practice when inserting data to existing table via impyla #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executemany has an option for 'bulk'-insert on Impala #96

executemany has an option for 'bulk'-insert on Impala #96

nonsleepr commented Jul 25, 2015

laserson commented Jul 27, 2015

nonsleepr commented Jul 27, 2015

laserson commented Jul 28, 2015

nonsleepr commented Jul 29, 2015

laserson commented Jul 29, 2015

laserson Jul 29, 2015

Gatsby-Lee commented May 21, 2018

laserson commented May 21, 2018

timarmstrong commented Mar 7, 2019

darklord1807 commented Jul 21, 2021

		@@ -25,6 +25,8 @@
		IntegrityError, DataError, NotSupportedError)


		RE_INSERT_VALUES = re.compile(r"(.\binsert\b.\bvalues\b\s)(\(.\))\s;?\s", flags=re.IGNORECASE)

executemany has an option for 'bulk'-insert on Impala #96

Are you sure you want to change the base?

executemany has an option for 'bulk'-insert on Impala #96

Conversation

nonsleepr commented Jul 25, 2015

laserson commented Jul 27, 2015

nonsleepr commented Jul 27, 2015

laserson commented Jul 28, 2015

nonsleepr commented Jul 29, 2015

laserson commented Jul 29, 2015

laserson Jul 29, 2015

Choose a reason for hiding this comment

Gatsby-Lee commented May 21, 2018

laserson commented May 21, 2018

timarmstrong commented Mar 7, 2019

darklord1807 commented Jul 21, 2021