Let more encodings through and allow registration of encodings #176

cpcloud · 2015-09-01T18:40:00Z

closes #173

cpcloud · 2015-09-05T22:04:38Z

@mwiebe was there any particular reason that encodings were limited to A and the various versions of U* as opposed to allowing an arbitrary string?

mwiebe · 2015-09-05T22:29:25Z

It was allowing ascii, utf-8, and some other similar ones. The main reason to limit to a pretty strict and small set is to keep it easier to support as a standard interface, and catch errors as early as possible. Think of the history of html with IE6, and how liberal it was supporting very arbitrary input as HTML.

cpcloud · 2015-09-08T12:45:48Z

My concern is that not every system spells these encodings in the same way and supports many additional encodings that won't conform to the existing set. MySQL is a good example of this. So, instead of special casing every system, I think it would make sense to allow Python encodings, and a way for users to register their own encodings.

mwiebe · 2015-09-08T15:25:19Z

This sounds like a similar argument to "System X spells 32-bit integers as 'integer32' instead of 'int32', we should extend datashape to support that." Extending datashape to allow arbitrary encoding strings would seem to be the opposite of what datashape is for, helping systems talk to each other. Isn't the point of datashape to translate different system's type systems into a common language?

cpcloud · 2015-09-08T19:17:02Z

@mwiebe In this PR: libdynd/libdynd#417, it seems like the following is inconsistent with what you're saying above:

Other string storage choices can be viewed as the one string type via expression types, which adapt between the representation.

Presumably the thing you're using to do the adapting would need to know what encoding to adapt to. Where would that information be stored? What would the datashape of the conversion expression be?

mwiebe · 2015-09-08T19:44:44Z

@cpcloud any unicode encoding (except UCS2) can represent all strings, so choosing a particular one as the default representation for all normal string interaction is very good to simplify implementation to one string type that can get all the usability and performance development focus.

Yes, a string adaptor type would need to know the encoding. It also needs to know whether it's null-terminated, whether it's a fixed memory buffer or variable, etc, because there are a huge number of string storage approaches out there. All of these details would have to be parameters somehow to the adaptor type. I don't know what the datashape of the conversion expression should be, but however it's spelled, I would expect DyND to use an adaptor type under the hood which converts to/from that string storage for any computation.

dhirschfeld · 2016-05-16T02:35:40Z

Very keen to see this merged as I'm getting a ValueError when trying to use blaze with a SQLServer db

ValueError: Unsupported string encoding 'Latin1_General_CI_AS'

dhirschfeld · 2016-09-01T05:39:36Z

Ping! Just linking some issues which demonstrate that this is an actual problem which people are having.

This still affects me but I've been quiet because I've resorted to monkey-patching my own version of blaze.

daefresh · 2016-09-01T10:30:13Z

Will this be worked on? We really need it...

skrah · 2016-09-01T16:19:23Z

I can take a look at this. At first glance the aliases seem to be currently supported by DyND-Datashape.

dhirschfeld · 2016-09-01T19:53:08Z

I think latin1 is such a common encoding it would make sense to have that be one of the canonical encodings.

I think you'd still need the ability to register encodings as there may be many non-standard names out there, in the wild.

skrah · 2016-09-08T15:30:40Z

@mwiebe @dhirschfeld @cpcloud Perhaps the DyND-Datashape syntax for general type constructors could also work for unknown encodings. Example:

>>> ndt.type("Latin1_General_CI_AS[string]")
ndt.type('Latin1_General_CI_AS[string]')
>>> 
>>> ndt.type("Latin1_General_CI_AS[bytes]")
ndt.type('Latin1_General_CI_AS[bytes]')

DyND could treat such strings as an opaque blob tagged with the constructor, Blaze could do more. But they'd have the same (existing) syntax.

mwiebe · 2016-09-10T22:48:21Z

@skrah the syntax will accept that, definitely, but it doesn't really fit in the semantics defined there. Those types are pattern types with the particular type variable names, that should match any concrete type constructable with those arguments. e.g. ndt.type("Latin1_General_CI_AS[string].match("pointer[string]") should return true. (It doesn't presently, but with the nd::buffer type constructor mechanism in place I do see a way to make it work). Ideally we'd fit these types into the current datashape semantics.

skrah · 2016-09-11T15:59:18Z

@mwiebe Interesting, I've always viewed this as a tag or type constructor. Indeed this matches:

>>> ndt.type("Rational[(int64, int64)]").match(ndt.type("(int64, int64)"))
True

What can you do if you don't want custom types to be matched against concrete types?
I imagine this would be the case e.g. for a rational-dtype C-extension.

inkrement · 2016-10-18T19:38:55Z

please add this functionality. Is there a workaround to use unsupported encodings with the current version or do we have to wait?

ivandir · 2017-11-25T00:52:53Z

Hi,

Can we get this pull request through?
Currently 0.5.2 is the latest version of DataShape with pip and has this issue ValueError: Unsupported string encoding 'utf8mb4_unicode_ci' . @cpcloud

cpcloud · 2017-11-25T00:58:46Z

@ivandir, I'm not working on datashape anymore. Sorry this lingered for so long. You're welcome to pull this down and try to merge it in to your local branch and work off of that. I don't have the bandwidth to take this up right now.

cpcloud added 7 commits August 28, 2015 18:00

Remove unused imports

09fa3fc

Test register_codec

1345590

Implement simple encoding registry

0821e0e

Use the registry in the String type

fbb20e4

Export the function

9df17f4

Update existing tests

b5c8788

Merge branch 'master' of github.com:blaze/datashape into arb-enc

373bfcf

cpcloud self-assigned this Sep 1, 2015

cpcloud added this to the 0.4.7 milestone Sep 1, 2015

cpcloud added the enhancement label Sep 1, 2015

Merge branch 'master' of github.com:blaze/datashape into arb-enc

98a35ce

cpcloud modified the milestones: 0.4.7, 0.4.8 Sep 15, 2015

kwmsmith modified the milestones: 0.5.0, 0.5.2 Feb 2, 2016

kwmsmith modified the milestones: 0.5.2, 0.5.3 May 5, 2016

This was referenced Sep 1, 2016

Odo Function with encoding error: ValueError: Unsupported string encoding u'SQL_Latin1_General_CP1_CI_AS' blaze/odo#426

Open

support for charset:utf8mb4 blaze/blaze#1188

Open

resolves bug with python3+pymysql #210

Open

cpcloud closed this Nov 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let more encodings through and allow registration of encodings #176

Let more encodings through and allow registration of encodings #176

cpcloud commented Sep 1, 2015

cpcloud commented Sep 5, 2015

mwiebe commented Sep 5, 2015

cpcloud commented Sep 8, 2015

mwiebe commented Sep 8, 2015

cpcloud commented Sep 8, 2015

mwiebe commented Sep 8, 2015

dhirschfeld commented May 16, 2016

dhirschfeld commented Sep 1, 2016

daefresh commented Sep 1, 2016

skrah commented Sep 1, 2016

dhirschfeld commented Sep 1, 2016

skrah commented Sep 8, 2016

mwiebe commented Sep 10, 2016

skrah commented Sep 11, 2016

inkrement commented Oct 18, 2016

ivandir commented Nov 25, 2017

cpcloud commented Nov 25, 2017

Let more encodings through and allow registration of encodings #176

Let more encodings through and allow registration of encodings #176

Conversation

cpcloud commented Sep 1, 2015

cpcloud commented Sep 5, 2015

mwiebe commented Sep 5, 2015

cpcloud commented Sep 8, 2015

mwiebe commented Sep 8, 2015

cpcloud commented Sep 8, 2015

mwiebe commented Sep 8, 2015

dhirschfeld commented May 16, 2016

dhirschfeld commented Sep 1, 2016

daefresh commented Sep 1, 2016

skrah commented Sep 1, 2016

dhirschfeld commented Sep 1, 2016

skrah commented Sep 8, 2016

mwiebe commented Sep 10, 2016

skrah commented Sep 11, 2016

inkrement commented Oct 18, 2016

ivandir commented Nov 25, 2017

cpcloud commented Nov 25, 2017