Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio Transcription Error - AbstractTranscriptTask - SQLITE_CANTOPEN_ISDIR #2267

Open
gfd2020 opened this issue Jul 25, 2024 · 7 comments
Open

Comments

@gfd2020
Copy link
Collaborator

gfd2020 commented Jul 25, 2024

I got the following error below when processing a ufdr when transcribing audio. Versions 4.1.5 and master have the same error.
Computer: 32 Core (threads). UFDR and output directory are on a network drive.

This image was also being processed with OCR and did not show any connection errors with ocr database.
Looking at the OCRParser code, I noticed that it has a synchronized connection control.
Wouldn't this error be because the implementation of the audio transcription class is not having the same concurrency treatment?

This error does not seem to be exactly what is reported (SQLITE_CANTOPEN_ISDIR) because the audio transcription sqlite file is properly connected to the connection. I believe that the exception was raised due to several connections writing to the database.

2024-07-25 10:23:21 [ERROR] [task.transcript.AbstractTranscriptTask] Unexpected exception while transcribing: 0000786-AUDIO.opus
java.io.IOException: org.sqlite.SQLiteException: [SQLITE_CANTOPEN_ISDIR] The file is really a directory (unable to open database file)
at iped.engine.task.transcript.AbstractTranscriptTask.storeTextInDb(AbstractTranscriptTask.java:169) ~[iped-engine-4.1.5.jar:?]
at iped.engine.task.transcript.AbstractTranscriptTask.process(AbstractTranscriptTask.java:404) [iped-engine-4.1.5.jar:?]
at iped.engine.task.transcript.AudioTranscriptTask.process(AudioTranscriptTask.java:41) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:277) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:192) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.5.jar:?]
at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.5.jar:?]
at iped.engine.core.Worker.process(Worker.java:177) [iped-engine-4.1.5.jar:?]
at iped.engine.core.Worker.run(Worker.java:265) [iped-engine-4.1.5.jar:?]
Caused by: org.sqlite.SQLiteException: [SQLITE_CANTOPEN_ISDIR] The file is really a directory (unable to open database file)
at org.sqlite.core.DB.newSQLException(DB.java:1012) ~[sqlite-jdbc-3.34.0.jar:?]
at org.sqlite.core.DB.newSQLException(DB.java:1024) ~[sqlite-jdbc-3.34.0.jar:?]
at org.sqlite.core.DB.execute(DB.java:866) ~[sqlite-jdbc-3.34.0.jar:?]
at org.sqlite.core.DB.executeUpdate(DB.java:904) ~[sqlite-jdbc-3.34.0.jar:?]
at org.sqlite.jdbc3.JDBC3PreparedStatement.executeUpdate(JDBC3PreparedStatement.java:98) ~[sqlite-jdbc-3.34.0.jar:?]
at iped.engine.task.transcript.AbstractTranscriptTask.storeTextInDb(AbstractTranscriptTask.java:167) ~[iped-engine-4.1.5.jar:?]
... 26 more

@lfcnassif
Copy link
Member

Hi @gfd2020,

I think the synchronization in OCRParser was used just to avoid a race condition that could cause the creation of more than 1 connection per OCR results DB, per process. If enableExternalParsing == true, several connections to the same OCR DB are created, one per parsing process.

I suggest you searching into the processing log for the same SQLite error, but coming from the OCRParser, it would be logged as WARN, not as ERROR, going just to the log, not to the console.

The approach used in AbstractAudioTranscriptTask was creating one connection per worker thread. I think it is fine and shouldn't cause the error you hit. I think the error you got was caused the unreliable SMB protocol, or maybe by some temporary network problem. If you find the same error in the log coming from the OCRParser, that would be a good confirmation the synchronization wouldn't help.

But if you can consistently reproduce the error, and if changing AbstractAudioTranscriptTask to use just 1 single static connection fixes the error, I agree to change it to 1 single static connection.

@gfd2020
Copy link
Collaborator Author

gfd2020 commented Jul 26, 2024

Hi @lfcnassif , thanks for the help.

I think the synchronization in OCRParser was used just to avoid a race condition that could cause the creation of more than 1 connection per OCR results DB, per process. If enableExternalParsing == true, several connections to the same OCR DB are created, one per parsing process.

I turned on enableExternalParsing now and it seems like warn (SQLITE_CANTOPEN_ISDIR) is happening more often, I'll check further.

I suggest you searching into the processing log for the same SQLite error, but coming from the OCRParser, it would be logged as WARN, not as ERROR, going just to the log, not to the console.

Yes, OCRParser just logged the same info (SQLITE_CANTOPEN_ISDIR) but just a WARN, no as ERROR.

The approach used in AbstractAudioTranscriptTask was creating one connection per worker thread. I think it is fine and shouldn't cause the error you hit. I think the error you got was caused the unreliable SMB protocol, or maybe by some temporary network problem. If you find the same error in the log coming from the OCRParser, that would be a good confirmation the synchronization wouldn't help.

I process this case several times and the application crash in all runs. OCRParser does not raiser ERROR, just WARN (SQLITE_CANTOPEN_ISDIR).

Just thinking here. Maybe if the sqlite database was in the iped temporary processing folder, this error would be mitigated, right?

But if you can consistently reproduce the error, and if changing AbstractAudioTranscriptTask to use just 1 single static connection fixes the error, I agree to change it to 1 single static connection.

I'll have to change the source code to make the task just have a static connection, right?

@lfcnassif
Copy link
Member

lfcnassif commented Jul 26, 2024

Hi @gfd2020,

I process this case several times and the application crash in all runs. OCRParser does not raiser ERROR, just WARN (SQLITE_CANTOPEN_ISDIR).

With enableExternalParsing = false, right?

Just thinking here. Maybe if the sqlite database was in the iped temporary processing folder, this error would be mitigated, right?

I think so. Not the same SQLite error, but others were also reported in the past related to sleuth.db being populated in a network share. I thought about using the temp folder for this in the past. Beyond TSK, OCR and the transcription modules, thumbnails and subitems extracted from containers are also stored in SQLite DBs in output folder. Teoretically all of them could be affected and ideally should use the same approach (created in output or temporary folder). And when using --append, the DBs would need to be moved from output to temp folder (moving the subitems DBs would take a reasonable ammount of time) to append the next evidence, then moved back to output again...

I'll have to change the source code to make the task just have a static connection, right?

Yes.

@gfd2020
Copy link
Collaborator Author

gfd2020 commented Jul 26, 2024

Hi @lfcnassif ,

I process this case several times and the application crash in all runs. OCRParser does not raiser ERROR, just WARN (SQLITE_CANTOPEN_ISDIR).

With enableExternalParsing = false, right?

Yes.

I think so. Not the same SQLite error, but others were also reported in the past related to sleuth.db being populated in a network share. I thought about using the temp folder for this in the past. Beyond TSK, OCR and the transcription modules, thumbnails and subitems extracted from containers are also stored in SQLite DBs in output folder. Teoretically all of them could be affected and ideally should use the same approach (created in output or temporary folder). And when using --append, the DBs would need to be moved from output to temp folder (moving the subitems DBs would take a reasonable ammount of time) to append the next evidence, then moved back to output again...

Wow. That would take a lot of work.

I'll continue doing some tests...

@gfd2020
Copy link
Collaborator Author

gfd2020 commented Aug 2, 2024

From the tests I did, sometimes the IPED crashes and other times it doesn't...

I tried to perform other configurations in the sqlite database connection parameters.
What gave the best result was changing the connection to WAL mode, as suggested on the sqlite help page.

Source: https://www.sqlite.org/useovernet.html

If this configuration mitigates the problem of using a network connection, I would suggest perhaps adding a configuration option to activate this mode.

Test case with approximately 4,686 candidate audio files for transcription.

transcriptions IPED version 4.1.5 ( IPED did not crash):
4052 tuples transcoded
864 KB of sqlite file size
10 erros of [SQLITE_CANTOPEN_ISDIR] in log files

transcriptions IPED Master with WAL mode modification:
4559 tuples transcoded
1160 KB of sqlite file size
zero erros of [SQLITE_CANTOPEN_ISDIR]


Second teste case with approximately 23079 candidate audio files for transcription.

transcriptions IPED version 4.1.5 ( IPED did not crash):
22918 tuples transcoded
5212 KB of sqlite file size
33 erros of [SQLITE_CANTOPEN_ISDIR] in log files

transcriptions IPED Master with WAL mode modification:
23072 tuples transcoded
6056 KB of sqlite file size
zero erros of [SQLITE_CANTOPEN_ISDIR]

@lfcnassif
Copy link
Member

lfcnassif commented Aug 3, 2024

Hi @gfd2020. Your tests results are very interesting. I've considered in the past to use WAL in sleuthkit sqlite DB, but abandoned the idea (without testing) because of a statement in the official sqlite site saying WAL does not work over a network file system (https://www.sqlite.org/wal.html), although it is a bit ambiguous, sqlite site may be referring to processes on different machines accessing the same DB, but it is not 100% clear to me it should work 100% fine if all processes or threads are on the same computer.

Have you tried to use just 1 static connection?

@gfd2020
Copy link
Collaborator Author

gfd2020 commented Aug 4, 2024

Have you tried to use just 1 static connection?

Hi @lfcnassif .
No. I just added wal mode to the audio transcription task and left the rest of the configuration as default.

I also did this test in the OCR task and the result in WAL mode was worse. I believe it because it has a slightly different implementation.

I think that for other tasks that already work and are stable, it's best not to change them. Just this audio one, if you could add this configurable option, it would be interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants