Skip to content

Basic Topic Modeling with LDA

EdwardRaff edited this page Oct 2, 2016 · 1 revision

#Introduction

Topic Modeling is a common and popular method to explore textual datasets. In particular, the Latent Dirichlet Allocation (or LDA) algorithm has been incredibly popular over the years. By no means is it perfect, but it works well enough on most datasets that it's taken over, and many extensions of it exist. JSAT currently has a implementation of this algorithm that was meant for processing larger collections. However this example will just demonstrate how to use it and extract some topics from a model.

CODE: LDA for the AP dataset

This example is run on a dataset from David Blei, the main inventor of LDA, and was used in the original paper. The page it can be obtained from is here, and you should be able to do some comparisons on speed and output. Note that this is not an exact replication, this isn't the same algorithm for solving LDA and the below code doesn't use a stoplist.

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.logging.Level;
import java.util.logging.Logger;
import jsat.DataSet;
import jsat.linear.Vec;
import jsat.text.TextDataLoader;
import jsat.text.tokenizer.NaiveTokenizer;
import jsat.text.tokenizer.StopWordTokenizer;
import jsat.text.tokenizer.Tokenizer;
import jsat.text.topicmodel.OnlineLDAsvi;
import jsat.text.wordweighting.WordCount;
import jsat.text.wordweighting.WordWeighting;
import jsat.utils.IndexTable;
import jsat.utils.SystemInfo;

/**
 * This will give a basic example of using Latent Dirichlet Allocation (LDA) to
 * perform topic modeling. We will use a small dataset from the Associated Press
 * that can be obtained
 * <a href="http://www.cs.columbia.edu/~blei/lda-c/ap.tgz">here</a>. LDA is not
 * an easy optimization problem, and the stochastic implementation in JSAT can
 * lead to some different results between runs and between implementations. You
 * can compare the topics you get from this demo with the topics David Blei gets
 * <a href="http://www.cs.columbia.edu/~blei/lda-c/ap-topics.pdf">here</a> using
 * a different batch version of this algorithm.
 *
 * @author Edward Raff
 */
public class LatentDirichletAllocationAP
{

    /**
     * This class would normally be defined in another file. It extends the
     * TextDataLoader abstract class, which gives us a framework for loading
     * text datasets and converting them into bag-of-word style feature vectors.
     *
     * Look at the wikipedia example for more information about how the
     * TextLoaders work.
     */
    public static class APLoader extends TextDataLoader
    {
        File apFile;
        public APLoader(File apFile, Tokenizer tokenizer, WordWeighting weighting)
        {
            super(tokenizer, weighting);
            this.apFile = apFile;
        }

        @Override
        public void initialLoad()
        {
            try
            {
                //the AP data is stored as a document per line. This is just a hueristic extraction of what we want
                List<String> lines = Files.readAllLines(apFile.toPath(), Charset.defaultCharset());
                for(int i = 1; i < lines.size(); i++)
                {
                    if(lines.get(i-1).trim().startsWith("<TEXT>"))
                        addOriginalDocument(lines.get(i).trim());
                }
            }
            catch (IOException ex)
            {
                Logger.getLogger(LatentDirichletAllocationAP.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
        
    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException
    {
        //you will need to replace this path with wherever you store your copy of the dataset
        File apFile = new File("ap.txt");
        /*
         * We can use a simple tokenization algorithm for this dataset. In 
         * general, LDA can be sensitive to the vocabulary used. It is best 
         * practice to use a stop-word list, which JSAT makes easy with the 
         * StopWordTokenizer class. Since stopword lists are generally 
         * copyrighted, we don't include one in this example. 
         */
        NaiveTokenizer tokenizer = new NaiveTokenizer(true);
        tokenizer.setMinTokenLength(4);
        tokenizer.setNoDigits(true);
        
        //the LDA algorithm explicitly expects the WordCount weighting scheme
        WordWeighting wordweighting = new WordCount();
        
        //We create out dataset loader here, giving it the file path, tokenizer, and weighting scheme
        TextDataLoader apLoader = new APLoader(apFile, tokenizer, wordweighting);
        
        DataSet data = apLoader.getDataSet();//and the loader takes care of the rest!
        
        //we will use multiple-cores to process this data
        ExecutorService ex = Executors.newFixedThreadPool(SystemInfo.LogicalCores);
        
        /**
         * K is the number of topics we would like to discover. 
         * Trying adjusting it and seeing how the results change
         */
        int K = 100;
        OnlineLDAsvi lda = new OnlineLDAsvi();
        //the parameters set here are generaly decent defaults. The OnlineLDAsvi
        //documentation includes a table of values to test that are recomended 
        //by the original paper
        lda.setAlpha(1.0/K);
        lda.setEta(1.0/K);
        lda.setKappa(0.6);
        lda.setMiniBatchSize(256);
        //Because this is a small dataset, and this algorithm intended for 
        //larger corpora , we will do more than one epoch and set tau0 to a 
        //small value
        lda.setEpochs(10);
        lda.setTau0(1);
        lda.model(data, K, ex);
        
        /*
         * Now we can loop through the topics LDA found and print out the top 10
         * words for each topic. The TextLoader class kept track of the index 
         * for each word, so we can print out something human readable. 
         */
        for(int k = 0; k < lda.getK(); k++)
        {
            Vec topic_k = lda.getTopicVec(k);
            /*
             * The topic array will have the ordering of the words. We will use
             * the IndexTable to get the ordering of the indices with the
             * largest coefficient values. These will be the most important
             * words for that topic.
             */
            IndexTable it = new IndexTable(topic_k.arrayCopy());
            it.reverse();
            System.out.print("Topic " + k + ": ");
            for(int i = 0; i < 10; i++)
            {
                int indx = it.index(i);
                System.out.print(apLoader.getWordForIndex(indx) + ", ");
            }
            System.out.println("\n");
        }
        
        /**
         * Now we should have a bunch of topics to look at! The topics have no
         * particular order, so you may see something slightly different.
         * 
         * Some topics will contain just common words, these are things a
         * stopword list would help avoid. For example, I obtained a topic:
         * Topic 1: said, with, that, after, from, they, were, when, year, years, 
         * Topic 93: that, more, will, have, than, this, people, most, only, years, 
         * 
         * Looking at the top 10 words, its generally pretty easy to figure out
         * what the topic is and most should be reasonable. LDA dose not
         * understand correlation between topics, so you may see some that seem
         * like repeat topics. For example, some interesting topics I get are
         * below:
         * Topic 3: gorbachev, soviet, government, moscow, communist, republic, news, union, which, official, 
         * Topic 7: offer, stock, share, takeover, acquisition, shareholders, hostile, acquire, ward, directors, 
         * Topic 8: with, drug, were, cocaine, drugs, said, enforcement, jury, arrested, grand, 
         * Topic 17: north, south, korea, korean, seoul, motion, refugee, identity, mobile, successfully, 
         * Topic 43: germany, west, german, border, poland, berlin, western, agency, czechoslovakia, polish, 
         * Topic 51: bush, budget, congress, president, billion, administration, deficit, spending, house, congressional, 
         * Topic 60: saudi, arabia, smoking, opec, continental, tobacco, appropriations, coal, transportation, parker, 
         * Topic 78: workers, union, contract, employees, strike, claims, jobs, worker, work, contracts, 
         * Topic 99: women, club, committees, amendment, cable, fees, discrimination, status, court, petition, 
         */
        
        ex.shutdown();//clean up
    }
    
}