Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_error in validation scales poorly #440

Open
kevinschaper opened this issue Mar 21, 2023 · 1 comment
Open

log_error in validation scales poorly #440

kevinschaper opened this issue Mar 21, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@kevinschaper
Copy link
Collaborator

monarch-kg takes about 10 hours to validate, so I'm spending some time with a profiler to see what's up.

I started with kg-phenio and found that it run in a very reasonable amount of time, and there wasn't anything that jumped out at me in the profiler as a problem.

For the last hour or so, I've been running the validator on monarch-kg, and nearly all of the compute time is being taken up by the log_error method (it was 92% last time I checked, and now is at 99.9%)

Looking at the code, my guess would be that all of the not in checks get progressively more expensive as the number of errors goes up:

    def log_error(
        self,
        entity: str,
        error_type: ErrorType,
        message: str,
        message_level: MessageLevel = MessageLevel.ERROR,
    ):
        """
        Log an error to the list of such errors.
        
        :param entity: source of parse error
        :param error_type: ValidationError ErrorType,
        :param message: message string describing the error
        :param  message_level: ValidationError MessageLevel
        """
        # index errors by entity identifier
        level = message_level.name
        error = error_type.name
        
        # clean up entity name string...
        entity = str(entity).strip()

        if level not in self.errors:
            self.errors[level] = dict()

        if error not in self.errors[level]:
            self.errors[level][error] = dict()
        
        # don't record duplicate instances of error type and
        # messages for entity identifiers...
        if message not in self.errors[level][error]:
            self.errors[level][error][message] = [entity]
        else:
            if entity not in self.errors[level][error][message]:
                self.errors[level][error][message].append(entity)
@kevinschaper kevinschaper added the bug Something isn't working label Mar 21, 2023
@kevinschaper
Copy link
Collaborator Author

kevinschaper commented Mar 21, 2023

if I just replace the contents of log_error with pass, monarch-kg (w/ 7m edges, 0.8m nodes) runs in 11 minutes for me. I think that means that the validation code doesn't have performance problems, only the code that's tracking validation errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant