Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let more encodings through and allow registration of encodings #176

Closed
wants to merge 8 commits into from
Closed

Let more encodings through and allow registration of encodings #176

wants to merge 8 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 1, 2015

closes #173

@cpcloud cpcloud self-assigned this Sep 1, 2015
@cpcloud cpcloud added this to the 0.4.7 milestone Sep 1, 2015
@cpcloud
Copy link
Member Author

cpcloud commented Sep 5, 2015

@mwiebe was there any particular reason that encodings were limited to A and the various versions of U* as opposed to allowing an arbitrary string?

@mwiebe
Copy link
Contributor

mwiebe commented Sep 5, 2015

It was allowing ascii, utf-8, and some other similar ones. The main reason to limit to a pretty strict and small set is to keep it easier to support as a standard interface, and catch errors as early as possible. Think of the history of html with IE6, and how liberal it was supporting very arbitrary input as HTML.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 8, 2015

My concern is that not every system spells these encodings in the same way and supports many additional encodings that won't conform to the existing set. MySQL is a good example of this. So, instead of special casing every system, I think it would make sense to allow Python encodings, and a way for users to register their own encodings.

@mwiebe
Copy link
Contributor

mwiebe commented Sep 8, 2015

This sounds like a similar argument to "System X spells 32-bit integers as 'integer32' instead of 'int32', we should extend datashape to support that." Extending datashape to allow arbitrary encoding strings would seem to be the opposite of what datashape is for, helping systems talk to each other. Isn't the point of datashape to translate different system's type systems into a common language?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 8, 2015

@mwiebe In this PR: libdynd/libdynd#417, it seems like the following is inconsistent with what you're saying above:

Other string storage choices can be viewed as the one string type via expression types, which adapt between the representation.

Presumably the thing you're using to do the adapting would need to know what encoding to adapt to. Where would that information be stored? What would the datashape of the conversion expression be?

@mwiebe
Copy link
Contributor

mwiebe commented Sep 8, 2015

@cpcloud any unicode encoding (except UCS2) can represent all strings, so choosing a particular one as the default representation for all normal string interaction is very good to simplify implementation to one string type that can get all the usability and performance development focus.

Yes, a string adaptor type would need to know the encoding. It also needs to know whether it's null-terminated, whether it's a fixed memory buffer or variable, etc, because there are a huge number of string storage approaches out there. All of these details would have to be parameters somehow to the adaptor type. I don't know what the datashape of the conversion expression should be, but however it's spelled, I would expect DyND to use an adaptor type under the hood which converts to/from that string storage for any computation.

@cpcloud cpcloud modified the milestones: 0.4.7, 0.4.8 Sep 15, 2015
@kwmsmith kwmsmith modified the milestones: 0.5.0, 0.5.2 Feb 2, 2016
@kwmsmith kwmsmith modified the milestones: 0.5.2, 0.5.3 May 5, 2016
@dhirschfeld
Copy link
Contributor

Very keen to see this merged as I'm getting a ValueError when trying to use blaze with a SQLServer db

ValueError: Unsupported string encoding 'Latin1_General_CI_AS'

@dhirschfeld
Copy link
Contributor

Ping! Just linking some issues which demonstrate that this is an actual problem which people are having.

This still affects me but I've been quiet because I've resorted to monkey-patching my own version of blaze.

@daefresh
Copy link

daefresh commented Sep 1, 2016

Will this be worked on? We really need it...

@skrah
Copy link
Contributor

skrah commented Sep 1, 2016

I can take a look at this. At first glance the aliases seem to be currently supported by DyND-Datashape.

@dhirschfeld
Copy link
Contributor

I think latin1 is such a common encoding it would make sense to have that be one of the canonical encodings.

I think you'd still need the ability to register encodings as there may be many non-standard names out there, in the wild.

@skrah
Copy link
Contributor

skrah commented Sep 8, 2016

@mwiebe @dhirschfeld @cpcloud Perhaps the DyND-Datashape syntax for general type constructors could also work for unknown encodings. Example:

>>> ndt.type("Latin1_General_CI_AS[string]")
ndt.type('Latin1_General_CI_AS[string]')
>>> 
>>> ndt.type("Latin1_General_CI_AS[bytes]")
ndt.type('Latin1_General_CI_AS[bytes]')

DyND could treat such strings as an opaque blob tagged with the constructor, Blaze could do more. But they'd have the same (existing) syntax.

@mwiebe
Copy link
Contributor

mwiebe commented Sep 10, 2016

@skrah the syntax will accept that, definitely, but it doesn't really fit in the semantics defined there. Those types are pattern types with the particular type variable names, that should match any concrete type constructable with those arguments. e.g. ndt.type("Latin1_General_CI_AS[string].match("pointer[string]") should return true. (It doesn't presently, but with the nd::buffer type constructor mechanism in place I do see a way to make it work). Ideally we'd fit these types into the current datashape semantics.

@skrah
Copy link
Contributor

skrah commented Sep 11, 2016

@mwiebe Interesting, I've always viewed this as a tag or type constructor. Indeed this matches:

>>> ndt.type("Rational[(int64, int64)]").match(ndt.type("(int64, int64)"))
True

What can you do if you don't want custom types to be matched against concrete types?
I imagine this would be the case e.g. for a rational-dtype C-extension.

@inkrement
Copy link

please add this functionality. Is there a workaround to use unsupported encodings with the current version or do we have to wait?

@ivandir
Copy link

ivandir commented Nov 25, 2017

Hi,

Can we get this pull request through?
Currently 0.5.2 is the latest version of DataShape with pip and has this issue ValueError: Unsupported string encoding 'utf8mb4_unicode_ci' . @cpcloud

@cpcloud
Copy link
Member Author

cpcloud commented Nov 25, 2017

@ivandir, I'm not working on datashape anymore. Sorry this lingered for so long. You're welcome to pull this down and try to merge it in to your local branch and work off of that. I don't have the bandwidth to take this up right now.

@cpcloud cpcloud closed this Nov 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow an arbitrary string for encoding parameter to String type
8 participants