Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with libxml 2.11.1, 2.11.2, 2.11.3 (OK with libxml 2.11.0) #111

Open
jcamiel opened this issue May 11, 2023 · 1 comment
Open
Labels

Comments

@jcamiel
Copy link
Contributor

jcamiel commented May 11, 2023

Hi,

I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.

My sample:

  • I've the following html document <data>café</data>
  • I evaluate the following xpath expression normalize-space(//data).

Sample code:

use std::ffi::CStr;
use std::os::raw;
use libxml::parser::{Parser, ParserOptions};
use libxml::xpath::Context;

fn main() {
    let parser = Parser::default_html();
    let options = ParserOptions { encoding: Some("utf-8"), ..Default::default()};
    let data = "<data>café</data>";
    let doc = parser.parse_string_with_options(data, options).unwrap();

    let context = Context::new(&doc).unwrap();
    let result = context.evaluate("normalize-space(//data)").unwrap();

    assert_eq!(unsafe { *result.ptr }.type_, libxml::bindings::xmlXPathObjectType_XPATH_STRING);
    let value = unsafe { *result.ptr }.stringval;
    let value = value as *const raw::c_char;
    let value = unsafe { CStr::from_ptr(value) };
    let value = value.to_string_lossy();
    println!("{value}")
}

With libxml 2.11.0, the value printed is café, with libxml 2.11.1 the value printed is café:

  • With libxml 2.11.0:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
  • With libxml 2.11.3:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café

I've the impression that the encoding value of ParserOptions is not evaluated correctly through the crate (note: to reproduce the bug, you've to use Parser::default_html() and not Parser::default())

To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:

#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

int main() {
    xmlDocPtr doc = NULL;
    xmlXPathContextPtr context = NULL;
    xmlXPathObjectPtr result = NULL;

    // <data>café</data> in utf-8:
    char data[] = (char[]) {0x3c, 0x64, 0x61, 0x74, 0x61, 0x3e, 0x63, 0x61, 0x66, 0xc3, 0xa9, 0x3c, 0x2f, 0x64, 0x61,
                            0x74, 0x61, 0x3e};
    doc = htmlReadMemory(data, strlen(data), NULL, "utf-8",
                         HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Creating result request
    context = xmlXPathNewContext(doc);
    result = xmlXPathEvalExpression((const unsigned char *) "normalize-space(//data)", context);
    if (result->type == XPATH_STRING) {
        printf("%s\n", result->stringval);
    }

    xmlXPathFreeObject(result);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);
    return 0;
}
  • With libxml 2.11.0:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib -l xml2 test.c
$ ./a.out
$ café
  • With libxml 2.11.3:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib -l xml2 test.c
$ ./a.out
$ café

My suspision is in

pub fn parse_string_with_options<Bytes: AsRef<[u8]>>(

When I debug the following code:

   // Process encoding.
    let encoding_cstring: Option<CString> =
      parser_options.encoding.map(|v| CString::new(v).unwrap());
    let encoding_ptr = match encoding_cstring {
      Some(v) => v.as_ptr(),
      None => DEFAULT_ENCODING,
    };

    // Process url.
    let url_ptr = DEFAULT_URL;

If parser encoding is initialized with Some("utf-8"), encoding_ptr is not valid just before // Process url (it points to a null char).
So the call to the binding htmlReadMemory is made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !

Regards,

Jc

@jangernert
Copy link
Contributor

I hit this one as well. It think it is caused by libxml2 changing the default encoding when NULL is passed from utf-8 to ISO-8859-1 which apparently is more correct. But its breaking a lot of real world use cases.

So maybe the encoding override in this crate never worked and nobody noticed since the default was utf-8 anyway?

https://gitlab.gnome.org/GNOME/libxml2/-/issues/570

@jcamiel thanks for figuring out a temporary workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants