Foreign object model; round-trip preservation #16

kohlhase · 2017-10-02T13:21:15Z

davidcarlisle · 2017-10-04T19:45:40Z

It's not clear to me that there is any issue to be addressed here, the fact that a string encoded the binary encoding has < or > in it shouldn't be an issue.

kohlhase · 2017-10-05T06:25:27Z

I tend to agree with David here. Let's ask @lars-hellstrom if he upholds his problem.

lars-hellstrom · 2017-10-05T16:32:45Z

Yes, I maintain that there is a problem! The problem is that it is not clear from the standard how foreign objects in one encoding correspond to those in another encoding.

This becomes painfully clear if you imagine writing a tool that converts between the binary and XML encodings. Should a character < in the payload of a binary encoding foreign object be turned into < or < in the contents of an XML encoding OMFOREIGN object? If the latter, there is no way to put tags in those contents. If the former, you implicitly say that the payload of a binary encoding foreign object is always XML, which is equally silly.

The root of the problem is probably that the standard is vague about the importance of the encoding attribute, since different types of foreign objects need different translations between different OpenMath-encodings. I propose that wording to the effect of imposing the following restrictions should be added:

If the encoding is an XML namespace, then the payload of a binary encoding foreign object is XML code (in UTF-8 encoding).
If the encoding is not an XML namespace, then the contents of an XML encoding foreign object may only be character data. (I.e., no tags, processing instructions, or the like.)

"XML namespace" should also be considered to include the two historical strings (that I don't have the zeal to look up right now).

davidcarlisle · 2017-10-05T16:45:26Z

I don't really think < is special here. In the binary encoding a foreign object is just a stream of bytes of specified length even if it's "a" or "<" or byte 3 you can't really reliably convert it to xml, other than (say) base64 encoding it and putting it in the xml that way or writing the byte stream out to a file and referencing it from xml. I think the proposed restriction would negate any advantage of using the binary encoding, in the binary encoding you could for example include an png image inline as a foreign object, you don't want to have to encode that as an XML compatible string, if you then want to write the OM object as xml, you will have to base64 encode that data to get an xml-compatible string that you could put in an omforeign.

…

On 5 October 2017 at 17:32, lars-hellstrom ***@***.***> wrote: Yes, I maintain that there is a problem! The problem is that it is not clear from the standard how foreign objects in one encoding correspond to those in another encoding. This becomes painfully clear if you imagine writing a tool that converts between the binary and XML encodings. Should a character < in the payload of a binary encoding foreign object be turned into < or < in the contents of an XML encoding OMFOREIGN object? If the latter, there is no way to put tags in those contents. If the former, you implicitly say that the payload of a binary encoding foreign object is always XML, which is equally silly. The root of the problem is probably that the standard is vague about the importance of the encoding attribute, since different types of foreign objects need different translations between different OpenMath-encodings. I propose that wording to the effect of imposing the following restrictions should be added: - If the encoding is an XML namespace, then the payload of a binary encoding foreign object is XML code (in UTF-8 encoding). - If the encoding is *not* an XML namespace, then the contents of an XML encoding foreign object may only be character data. (I.e., no tags, processing instructions, or the like.) "XML namespace" should also be considered to include the two historical strings (that I don't have the zeal to look up right now). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNcAnrSIpVKMb09Udr7XtTy8cKqrzGuks5spQStgaJpZM4PqoRm> .

kohlhase · 2017-10-06T07:49:48Z

I agree with David on this.
The only real way of deciding this issue is probably to implement a round-trip tool that works (then David is right) or die trying (then there is evidence that Lars is).

lars-hellstrom · 2017-10-06T12:51:23Z

An OpenMath object is fundamentally the abstract thing described in Chapter 2, which may contain foreign objects. These abstract things then have a number of encodings, as described in the sections of Chapter 3. I have always presumed that in order to qualify as an encoding of OpenMath, a scheme must be able to encode all OpenMath objects, but it seems @davidcarlisle rather says each encoding has a different domain, because they support different sets of foreign objects?

The point about what happens to a PNG when included in an XML format is an interesting one — I had assumed that the octets should just become characters  through ÿ (looks horrible, but should process OK), although some of them being forbidden maybe requires something like adding a base64-encoding layer on top of the octets instead — but that is a separate problem. My example shows there are problems already within the realm of character based formats. About the use of these in the binary encoding, the standard explicitly says:

Character based formats (including XML based formats) should be encoded in UTF-8 to produce a stream of bytes to use as the payload of the foreign object.

Thus: we can reliably tell which sequence of characters the payload of an OpenMath-binary foreign object encodes. What is unclear is how those characters should get encoded when transcoding to OpenMath-XML. If < is encoded as < then the transcoded object cannot contain tags, which is wrong for MathML. If < is encoded as < then the character sequence had to be well-formed XML already in the binary encoding, which is wrong for everything not XML.

The two bullet points I proposed are precisely about clarifying when one should do one thing and when one should do the other.

Perhaps it's easier if I try to phrase it in RNG? Right now we have for OpenMath-XML

# foreign constructor
OMFOREIGN =  element OMFOREIGN {
    compound.attributes, attribute encoding {xsd:string}?,
   (omel|notom)* }

I think that should be changed to (though bear in mind that I'm guessing on RNG syntax)

# foreign constructor
OMFOREIGN =  element OMFOREIGN {
    compound.attributes, attribute encoding {xsd:anyURI}?,
   (omel|notom)* }
  |
  element OMFOREIGN {
    compound.attributes, attribute encoding {xsd:anyMIME}?,
   text }

davidcarlisle · 2017-10-06T13:02:05Z

On 6 October 2017 at 13:51, lars-hellstrom ***@***.***> wrote: An *OpenMath object* is fundamentally the abstract thing described in Chapter 2 <https://openmath.github.io/standard/om20-2017-07-22/omstd20.html#cha_obj>, which may contain foreign objects. These abstract things then have a number of encodings, as described in the sections of Chapter 3. I have always presumed that in order to qualify as an *encoding of OpenMath*, a scheme must be able to encode *all* OpenMath objects, but it seems @davidcarlisle <https://github.com/davidcarlisle> rather says each encoding has a different domain, because they support different sets of foreign objects? The point about what happens to a PNG when included in an XML format is an interesting one — I had assumed that the octets should just become characters  through ÿ (looks horrible, but should process OK),

No it's not well formed, so fatal syntax error. (xml 1.0 doesn't allow any numeric refereces to control characters, xml 1.1 allows all but 0 but no one uses that and it doesn't really help. You can put any sequence of bytes in an xml file if you interpret that as base64 encoding the sequence and putting that in so I assumed that going from binary to xml that's what you'd do. It does mean that if you start with some xml encoded foreign object in an xml openmath object and map it to the binary encoding and back you may have a different encoding (base 64) of the original but it's not really ambiguous.

although some of them being forbidden maybe requires something like adding a base64-encoding layer on top of the octets instead — but that is a separate problem. My example shows there are problems already within the realm of *character based formats*. About the use of these in the binary encoding, the standard explicitly says: Character based formats (including XML based formats) should be encoded in UTF-8 to produce a stream of bytes to use as the payload of the foreign object. Thus: we *can* reliably tell which sequence of characters the payload of an OpenMath-binary foreign object encodes. What is *unclear* is how those characters should get encoded when transcoding to OpenMath-XML. If < is encoded as < then the transcoded object cannot contain tags, which is wrong for MathML. If < is encoded as < then the character sequence had to be well-formed XML already in the binary encoding, which is wrong for everything not XML. The two bullet points I proposed are precisely about clarifying when one should do one thing and when one should do the other. Perhaps it's easier if I try to phrase it in RNG? Right now we have for OpenMath-XML # foreign constructor OMFOREIGN = element OMFOREIGN { compound.attributes, attribute encoding {xsd:string}?, (omel|notom)* } I think that should be changed to (though bear in mind that I'm guessing on RNG syntax) # foreign constructor OMFOREIGN = element OMFOREIGN { compound.attributes, attribute encoding {xsd:anyURI}?, (omel|notom)* } | element OMFOREIGN { compound.attributes, attribute encoding {xsd:anyMIME}?, text }

you can't syntactically distinguish a mimetype from a (possibly relative) URI text/xml could be that mimetype or it could be a relative url to text/xml. You can of course special case case certain strings matching xml mime types for special handling which would help round trip thise special cases but to be honest I don't see it's a problem if you start with om-xml with a foreign xhtml document inlined, map to binary (just putting the string representaion inline) then map back to om-xml base64 encoding teh string to end up with the xhtml document in a base64-ended-xhtml-encoding the underlying object hasn't changed you just have a different encoding of teh foreign object —

…

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNcAkgAgiLsTCyv5Ofp9cGoOnOZyzo1ks5spiJLgaJpZM4PqoRm> .

lars-hellstrom · 2017-10-06T18:18:19Z

@davidcarlisle wrote:

You can put any sequence of bytes in an xml file if you interpret that as base64 encoding the sequence and putting that in so I assumed that going from binary to xml that's what you'd do.

For PNG I suppose that works OK, provided you change the encoding from image/png (in OM-binary) to maybe image/png; content-transfer-encoding=base64 (in OM-XML).

It does mean that if you start with some xml encoded foreign object in an xml openmath object and map it to the binary encoding and back you may have a different encoding (base 64) of the original but it's not really ambiguous.

If you do that, then I strongly suspect that the elements in that foreign object will no longer show up as elements in the DOM tree. That's not what is expected when you do a round trip — the conversion operations should be inverses of each other (up to relevant equivalence), not just injective.

you can't syntactically distinguish a mimetype from a (possibly relative) URI text/xml could be that mimetype or it could be a relative url to text/xml.

Oh, right! I was thinking "absolute URI", not "any URI"; my bad. The idea was to catch XML namespaces — those should be absolute URIs, should they not?

davidcarlisle · 2017-10-07T08:02:02Z

should be inverses of each other (up to relevant equivalence)

well there's the rub. I would consider an OM XML with a foreign object encoded as xml with encoding="foo" and another OM XML with the same foreign object encoded as encoding="base64-encoded-foo" as equivalent if they represent the same OM object, I think you'd consider them different.

practically if you do the simplest, safest thing each time then every time you convert and convert back you will add another layer of base64 encoding, some systems may special case known encodings and avoid that (or in practice most systems will only handle a very limited range of foreign objects anyway)
but that is a matter of software usability and I don't think we need to standardise it.

kohlhase mentioned this issue Oct 2, 2017

Foreign object model; round-trip preservation OpenMath/OM3#148

Closed

kohlhase added this to the OM2 Revision 2 milestone Oct 3, 2017

kohlhase assigned davidcarlisle and kohlhase Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foreign object model; round-trip preservation #16

Foreign object model; round-trip preservation #16

kohlhase commented Oct 2, 2017

davidcarlisle commented Oct 4, 2017

kohlhase commented Oct 5, 2017

lars-hellstrom commented Oct 5, 2017

davidcarlisle commented Oct 5, 2017 via email

kohlhase commented Oct 6, 2017

lars-hellstrom commented Oct 6, 2017

davidcarlisle commented Oct 6, 2017 via email

lars-hellstrom commented Oct 6, 2017

davidcarlisle commented Oct 7, 2017

Foreign object model; round-trip preservation #16

Foreign object model; round-trip preservation #16

Comments

kohlhase commented Oct 2, 2017

davidcarlisle commented Oct 4, 2017

kohlhase commented Oct 5, 2017

lars-hellstrom commented Oct 5, 2017

davidcarlisle commented Oct 5, 2017 via email

kohlhase commented Oct 6, 2017

lars-hellstrom commented Oct 6, 2017

davidcarlisle commented Oct 6, 2017 via email

lars-hellstrom commented Oct 6, 2017

davidcarlisle commented Oct 7, 2017