Skip to content

Operators

Aatman Vaidya edited this page Sep 10, 2024 · 6 revisions

These are modules that operate on media items and help analyse text, image, video, audio etc. These act as plugin code that are only loaded if specified in the config.yml file. Operators define ways in which you can manipulate data that your search engine wants to operate on

  • This Wiki page lists the description of each operator briefly.
  • Each Operator has a unit test file, requirements.in and a requirements.txt file that contains. The requirements file stores information on all the packages required to run the operator.

Audio Vector Embeddings (audio_vec_embedding.py)

Given an audio file, this methods finds a vector of 2048 dimensions using PANNs. PANN is a CNN that is pre-trained on lot of audio files. They have been used for audio tagging and sound event detection. The PANNs have been used to fine-tune several audio pattern recognition tasks, and have outperformed several state-of-the-art systems.

Embeddings for vector audio search

Audio embeddings are often generated using spectrograms or other audio signal features. In the context of audio signal processing for machine learning, the process of feature extraction from spectrograms is a crucial step. Spectrograms are visual representations of the frequency content of audio signals over time. The identified features in this context encompass three specific types:

  • Mel-frequency cepstral coefficients (MFCCs)
  • Chroma features: Chroma features represent the 12 distinct pitch classes of the musical octave and are particularly useful in music-related tasks.
  • Spectral contrast: Spectral contrast focuses on the perceptual brightness of different frequency bands within an audio signal.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named audio_vec_embedding.py and the test file is named test_audio_vec_embedding.py

To run the test, simply just run the test file

python -m unittest test_audio_vec_embedding.py

Object Detection using YOLO (detect_objects.py)

You only look once (YOLO) is a state-of-the-art, real-time object detection system. It is trained on the COCO dataset.

We use the segment model of YOLO -> YOLOv8 segment - https://docs.ultralytics.com/tasks/segment/#models an code example of YOLO object detection

from ultralytics import YOLO
model = YOLO('yolov8n-seg.pt')
result = model.predict('path/to/your/image', save=True, imgsz=1024, conf=0.5, project='sample_data', name='output')

The output image will be saved in sample_data/output folder, and the resulting image will be titled as output.png. This image will have bounding boxes with objects detected and will also show the segmented area.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named detect_objects.py and the test file is named test_detect_objects.py

To run the test, simply just run the test file

python -m unittest test_detect_objects.py

This will initiate the test and first the YOLO models .pt file will be downloaded, after the tests runs, you should get an OK message in the terminal indicating that the test has run successfully.

The output image will be saved in sample_data/output folder, and the resulting image will be titled as output.png. This image will have bounding boxes with objects detected and will also show the segmented area.

Tesseract OCR Operator (detect_text_in_image_tesseract.py)

For each language support, we need to install separate tesseract operators for each language. Right now the current operator only supports English and Hindi languages.

For Linux, you can follow these links to understand how and what modules to install for each language.

To extract text from an image, we pass the image through a tesseract function like this

data = pytesseract.image_to_string(image, lang='eng+hin', config='--psm 6 --oem 1')

Here the config settings help us define some more insight into the image and LSTM blocks for the image extraction engines.

You can take a look at the operator and the test of the operator for the entire code.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named detect_text_in_image_tesseract.py and the test file is named test_detect_text_in_image_tesseract.py

To run the test, simply just run the test file

python -m unittest test_detect_text_in_image_tesseract.py

The test will check if text was extracted correctly or not, it will fetch an sample image from the sample_data folder. You should get an OK message in the terminal indicating that the test has run successfully.

Audio Embedding Operator (LAION CLAP Model) (audio_vec_embedding_clap.py)

The LAION CLAP (Contrastive Language-Audio Pretraining) model is a sophisticated language-audio model trained using contrastive learning. This approach allows the model to learn a joint representation of audio and text modalities, enabling seamless interaction between the two.

Architecture Overview:

  • Audio Encoder:

    • The audio encoding process is handled by a Hierarchical Token-Semantic Audio Transformer (HTSAT) model, which is composed of four Swin-Transformer blocks.
    • The output of the audio encoder is a 768-dimensional vector, capturing essential audio features.
  • Text Encoder:

    • For text encoding, the model employs the RoBERTa model, which is widely recognized for its robust natural language processing capabilities.
  • Projection Layers:

    • The penultimate layer of the architecture includes two Multi-Layer Perceptron (MLP) layers with ReLU activation. These layers project the audio and text embeddings to a common 512-dimensional space, which serves as the final representation during training.

Audio Data Processing:

  • Input Specifications:
    • Each audio input is 10 seconds long, processed with a hop size of 480 and a window size of 1024.
    • The Short-Time Fourier Transform (STFT) and mel-spectrograms are computed using 64 mel-bins.
    • This preprocessing results in an audio input shape of (T = 1024, F = 64) before it is passed to the audio encoder.

How to Run the Test:

The LAION CLAP operator and test files are located in the src/core/operators folder within the codebase. The operator is named audio_vec_embedding_clap.py, and the corresponding test file is test_audio_vec_embedding_clap.py.

To run the test, use the following command:

python -m unittest test_audio_vec_embedding_clap.py

Video Embedding Operator (CLIP VL Transformer) (vid_vec_rep_clip.py and classify_video_zero_shot.py)

CLIP is a multi-modal vision and language model. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features are then projected to a latent space with identical dimension (read more).

This approach allows the model to learn a joint representation of video and text modalities, enabling seamless interaction between the two and unlocking abilities like zero-shot classification.

Video Embeddings:

  • I-frames are extracted from each video to find the most important frames.
  • Each frame is neurally encoded using the CLIP's ViT to generate 512 dimension vector embeddings.
  • Each frame's embedding is passed through a pooling layer to make the average video vector embedding.

Zero-shot Classification:

Zero-shot classification means classifying the input into labels that the model has never seen before.

  • Frames are extracted as mentioned above and the list of output classes are given. (Ex: dog, cat etc.)
  • Video embeddings are created using the above steps and text embeddings are created out of each of the labels.
  • Similarity score is calculated between the video embeddings and each of the labels to determine the most probable label.

How to Run the Test:

The CLIP operators and test files are located in the src/core/operators folder within the codebase. The operators are named vid_vec_rep_clip.py and classify_video_zero_shot.py, and the corresponding test files are test_vid_vec_rep_clip.py and test_classify_video_zero_shot.py.

To run the test, use the following command:

python -m unittest test_vid_vec_rep_clip.py
python -m unittest test_classify_video_zero_shot.py

Dimension Reduction Operator (t-SNE) (dimension_reduction.py)

This operator leverages the t-SNE algorithm. The t-SNE (t-Distributed Stochastic Neighbor Embedding) model is a popular dimensionality reduction technique primarily used for visualizing high-dimensional data in a lower-dimensional space.

Initialization Parameters:

  • n_components: Number of dimensions to reduce the data to. Default is 2, which is common for visualization.
  • perplexity: A parameter influencing the balance between local and global data relationships. Default is 30.
  • learning_rate: Learning rate for the optimization process. Default is 150.
  • n_iter: Number of iterations for the optimization process. Default is 1000.
  • random_state: Seed for random number generation, ensuring reproducibility. Default is 42.
  • method: Algorithm to use for gradient calculation. Default is 'barnes_hut'.

Data Processing:

  • Input Specifications: The input should be a list of dictionaries, where each dictionary contains:

    • payload (str or any identifier): An identifier or label associated with the embedding.
    • embedding (list or numpy array): A 1D array representing a high-dimensional embedding.
    [
     {"payload": "123", "embedding": [1, 2, 3, 4, 5]},
     {"payload": "124", "embedding": [6, 7, 8, 9, 10]}
    ]
    
  • Output Specifications: The output is a list of dictionaries, where each dictionary contains:

    • payload (str or any identifier): The same identifier or label as in the input.
    • reduced_embedding (list): A 1D array of reduced dimensions for each embedding.
    [
     {"payload": "123", "reduced_embedding": [0.1, 0.2]},
     {"payload": "124", "reduced_embedding": [0.3, 0.4]}
    ]
    

How to Run the Test:

The Dimension Reduction operator and test files are located in the src/core/operators folder within the codebase. The operator is named dimension_reduction.py, and the corresponding test file is test_dimension_reduction.py.

To run the test, use the following command:

python -m unittest test_dimension_reduction.py

Extensibility:

The Dimension Reduction operator is designed with an abstract base class, making it extensible for other dimensionality reduction algorithms. Currently, only t-SNE is implemented, but other techniques like PCA (Principal Component Analysis) or UMAP (Uniform Manifold Approximation and Projection) can be added with minimal changes by following the same structure. Simply implement new reduction classes adhering to the DimensionReduction interface, which includes initialize() and run() methods.