Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tie Breaker Error #134

Open
0xSheller opened this issue Jul 11, 2024 · 0 comments · May be fixed by #135
Open

Tie Breaker Error #134

0xSheller opened this issue Jul 11, 2024 · 0 comments · May be fixed by #135

Comments

@0xSheller
Copy link

I've been rooting out this problem for a few hours now and I've finally figured out where it's stemming from after concluding it's not in my implementation/code base.

Issue:
When detecting dialects, and two or more dialects have the exact same score, the tiebreaker returns NoneType.

See:

SimpleDialect('\t', '', ''):    P =   295149.000000     T =        0.997842     Q =   294512.000000
SimpleDialect('\t', '"', ''):   P =   128607.250000     skip.
SimpleDialect('\t', '"', '\\'): P =   295149.000000     T =        0.997842     Q =   294512.000000

This specifically yields 2 dialects that go into the break_ties_two function. I'm unsure if this affects the other tiebreakers:

  • break_ties_three
  • break_ties_four

Modifying the detect function to print results:

    def detect(
        self, data: str, delimiters: Optional[List[str]] = None
    ) -> Optional[SimpleDialect]:
        """Detect the dialect using the consistency measure

        Parameters
        ----------
        data : str
            The data of the file as a string

        delimiters : iterable
            List of delimiters to consider. If None, the :func:`get_delimiters`
            function is used to automatically detect this (as described in the
            paper).

        Returns
        -------
        dialect : SimpleDialect
            The detected dialect. If no dialect could be detected, returns None.

        """
        self._cached_is_known_type.cache_clear()

        # TODO: probably some optimization there too
        dialects = get_dialects(data, delimiters=delimiters)

        # TODO: This is not thread-safe and this object can simply own a Parser
        # for each dialect and set the limit directly there (we can also cache
        # the best parsing result)
        old_limit = field_size_limit(len(data) + 1)

        scores = self.compute_consistency_scores(data, dialects)
        best_dialects = ConsistencyDetector.get_best_dialects(scores)
        result: Optional[SimpleDialect] = None
        if len(best_dialects) == 1:
            result = best_dialects[0]
        else:
           print(len(best_dialects)) # << Here
            result = tie_breaker(data, best_dialects)
            print(type(result)) # << Here
            print(result) # << Here

        field_size_limit(old_limit)
        return result

Yields:

2
<class 'NoneType'>
None

This tells me the tie breaker isn't doing it's job.

Further Analysis:

  1. Ambiguous Tie Breakers: The function relies on several specific conditions to break ties between dialects, such as differences in quotechar, delimiter, or escapechar. If these conditions are not met, or if the distinctions are not sufficient to determine a clear winner (i.e., none of the predefined conditions apply or the conditions apply but lead to a tie), the function defaults to returning None.

  2. Incomplete Handling for Non-Specific Cases: The current implementation does not cover cases where dialects differ in ways that are not specified or when the parsing results in identical outputs for the specific attributes being compared.

0xSheller added a commit to 0xSheller/CleverCSV that referenced this issue Jul 11, 2024
@0xSheller 0xSheller linked a pull request Jul 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant