log_error in validation scales poorly #440

kevinschaper · 2023-03-21T19:56:29Z

monarch-kg takes about 10 hours to validate, so I'm spending some time with a profiler to see what's up.

I started with kg-phenio and found that it run in a very reasonable amount of time, and there wasn't anything that jumped out at me in the profiler as a problem.

For the last hour or so, I've been running the validator on monarch-kg, and nearly all of the compute time is being taken up by the log_error method (it was 92% last time I checked, and now is at 99.9%)

Looking at the code, my guess would be that all of the not in checks get progressively more expensive as the number of errors goes up:

    def log_error(
        self,
        entity: str,
        error_type: ErrorType,
        message: str,
        message_level: MessageLevel = MessageLevel.ERROR,
    ):
        """
        Log an error to the list of such errors.
        
        :param entity: source of parse error
        :param error_type: ValidationError ErrorType,
        :param message: message string describing the error
        :param  message_level: ValidationError MessageLevel
        """
        # index errors by entity identifier
        level = message_level.name
        error = error_type.name
        
        # clean up entity name string...
        entity = str(entity).strip()

        if level not in self.errors:
            self.errors[level] = dict()

        if error not in self.errors[level]:
            self.errors[level][error] = dict()
        
        # don't record duplicate instances of error type and
        # messages for entity identifiers...
        if message not in self.errors[level][error]:
            self.errors[level][error][message] = [entity]
        else:
            if entity not in self.errors[level][error][message]:
                self.errors[level][error][message].append(entity)

The text was updated successfully, but these errors were encountered:

kevinschaper · 2023-03-21T20:15:46Z

if I just replace the contents of log_error with pass, monarch-kg (w/ 7m edges, 0.8m nodes) runs in 11 minutes for me. I think that means that the validation code doesn't have performance problems, only the code that's tracking validation errors.

kevinschaper added the bug Something isn't working label Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log_error in validation scales poorly #440

log_error in validation scales poorly #440

kevinschaper commented Mar 21, 2023

kevinschaper commented Mar 21, 2023 •

edited

Loading

log_error in validation scales poorly #440

log_error in validation scales poorly #440

Comments

kevinschaper commented Mar 21, 2023

kevinschaper commented Mar 21, 2023 • edited Loading

kevinschaper commented Mar 21, 2023 •

edited

Loading