You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been rooting out this problem for a few hours now and I've finally figured out where it's stemming from after concluding it's not in my implementation/code base.
Issue:
When detecting dialects, and two or more dialects have the exact same score, the tiebreaker returns NoneType.
See:
SimpleDialect('\t', '', ''): P = 295149.000000 T = 0.997842 Q = 294512.000000
SimpleDialect('\t', '"', ''): P = 128607.250000 skip.
SimpleDialect('\t', '"', '\\'): P = 295149.000000 T = 0.997842 Q = 294512.000000
This specifically yields 2 dialects that go into the break_ties_two function. I'm unsure if this affects the other tiebreakers:
break_ties_three
break_ties_four
Modifying the detect function to print results:
def detect(
self, data: str, delimiters: Optional[List[str]] = None
) -> Optional[SimpleDialect]:
"""Detect the dialect using the consistency measure
Parameters
----------
data : str
The data of the file as a string
delimiters : iterable
List of delimiters to consider. If None, the :func:`get_delimiters`
function is used to automatically detect this (as described in the
paper).
Returns
-------
dialect : SimpleDialect
The detected dialect. If no dialect could be detected, returns None.
"""
self._cached_is_known_type.cache_clear()
# TODO: probably some optimization there too
dialects = get_dialects(data, delimiters=delimiters)
# TODO: This is not thread-safe and this object can simply own a Parser
# for each dialect and set the limit directly there (we can also cache
# the best parsing result)
old_limit = field_size_limit(len(data) + 1)
scores = self.compute_consistency_scores(data, dialects)
best_dialects = ConsistencyDetector.get_best_dialects(scores)
result: Optional[SimpleDialect] = None
if len(best_dialects) == 1:
result = best_dialects[0]
else:
print(len(best_dialects)) # << Here
result = tie_breaker(data, best_dialects)
print(type(result)) # << Here
print(result) # << Here
field_size_limit(old_limit)
return result
Yields:
2
<class 'NoneType'>
None
This tells me the tie breaker isn't doing it's job.
Further Analysis:
Ambiguous Tie Breakers: The function relies on several specific conditions to break ties between dialects, such as differences in quotechar, delimiter, or escapechar. If these conditions are not met, or if the distinctions are not sufficient to determine a clear winner (i.e., none of the predefined conditions apply or the conditions apply but lead to a tie), the function defaults to returning None.
Incomplete Handling for Non-Specific Cases: The current implementation does not cover cases where dialects differ in ways that are not specified or when the parsing results in identical outputs for the specific attributes being compared.
The text was updated successfully, but these errors were encountered:
0xSheller
added a commit
to 0xSheller/CleverCSV
that referenced
this issue
Jul 11, 2024
I've been rooting out this problem for a few hours now and I've finally figured out where it's stemming from after concluding it's not in my implementation/code base.
Issue:
When detecting dialects, and two or more dialects have the exact same score, the tiebreaker returns NoneType.
See:
This specifically yields 2 dialects that go into the
break_ties_two
function. I'm unsure if this affects the other tiebreakers:break_ties_three
break_ties_four
Modifying the
detect
function to print results:Yields:
This tells me the tie breaker isn't doing it's job.
Further Analysis:
Ambiguous Tie Breakers: The function relies on several specific conditions to break ties between dialects, such as differences in quotechar, delimiter, or escapechar. If these conditions are not met, or if the distinctions are not sufficient to determine a clear winner (i.e., none of the predefined conditions apply or the conditions apply but lead to a tie), the function defaults to returning None.
Incomplete Handling for Non-Specific Cases: The current implementation does not cover cases where dialects differ in ways that are not specified or when the parsing results in identical outputs for the specific attributes being compared.
The text was updated successfully, but these errors were encountered: