Skip to content
Aurelie Herbelot edited this page Jul 1, 2022 · 4 revisions

Multilingual PeARS

This repository contains user-friendly code to train one's own document embedding model, using the Fruit Fly Algorithm.

The entire code runs in one single command, but for those who want to know what happens in the background, here is a little summary.

Training a sentencepiece model

Before doing anything, we need to train a tokeniser to deal with the particular language of interest. We use sentencepiece, a widely-recognized model, which splits raw text into so-called 'wordpieces', given a (learned) vocabulary of k tokens. Sentencepiece training is done in the spm directory. The result of this process is a sentencepiece model and vocabulary, to be found in the spm folder, under the language of interest. You can inspect the created vocabulary file to make sure everything is okay.

Downloading and pre-processing Wikipedia

Running code from the datasets directory, the system downloads and pre-processes an entire Wikipedia dump, using our trained sentencepiece model. You end up with .sp files in the ./datasets/data// directory, containing the original dump split into documents, cleaned of wiki markup, and tokenized.

Prepare a Wikipedia training set for further training

The system extracts 50,000 Wiki articles that will be used for tuning the dimensionality reduction, clustering and fruit fly models. For Wikipedia snapshots with more than one dump file, 30k articles are taken from the first dump file (which usually covers a wide range of fundamental topics), and 20k are randomly selected from the rest of the snapshot.

Training clustering and fruit fly algorithms on the Wiki corpus

We use UMAP for dimensionality reduction and Birch for clustering. Both models are trained on the Wiki training set. We then dimensionality-reduce and cluster the Wikipedia data, file by file, using the models we have trained. In the process, we also get an interpretable representation of the UMAP clusters, generating characteristic keywords to describe each cluster. The next and final step is to put the UMAP representations through the Fruit Fly algorithm to produce binary vectors.

The UMAP model is optimized on a purity measure involving Wikipedia categories. For each document in the training set, the nearest neighbours of that document are computed. For each neighbour that is tagged with at least one Wikipedia category also included in the target's categories, the score for that document is incremented by 1. The final score is the mean of all training documents' scores.

The Birch model is applied to the dimensionality-reduced vectors from the best UMAP model.

Finally, the Fruit Fly Algorithm is optimized on a precision-at-k measure, using the Birch labels. For each hashed document, its 20 neighbours are computed using the hamming distance. Precision-at-k is computed over the 20 neighbours: each neighbour with a Birch label equal to the target's label increases the score by one. As for the UMAP model, the final score is the mean of all training documents' scores.

Applying the pipeline to unseen data

Once all models are trained, we launch the pipeline to process the entire Wikipedia dump, thus testing the system on unseen documents. The result is a set of documents in (locality-sensitive) binary format. For the English Wikipedia, this comes to around 1.4 million hashes.