LM-LSS Notes¶

Chap 0¶

A sequence of characters is document.

A set of documents is a corpus, which we call \(D\).

Sample of documents, each \(n_L\) long, drawn from vocab of \(n_V\) words, then the
dimension of the representation of the document would be \(n_V^{n_L}\).

This indicates text is high-dimension data.

Tokenization and Dimension Reduction focuses on this problem: we throw away the
useless information. We transform unstructured corpus \(D\) to a useful matrix \(X\).

NLP can now handle local language well, so we can identify which pronoun indicates
what part of the sentence.

Dictionary-based method: we use regular expression to capture a pre-list of words
to analyse the corpus.

Chap 1¶

LDA is short for Latent Dirichlet Allocation, which is an unsupervised model capturing
topics in a document. The input is several documents, each represented as word-bag, i.e.
TF-IDF matrix, and the output is each topic, represented as a prob. distribution on words.
Then we can calc the similarity between documents using cos-distance between topic representation,
like document \(A\) contains 30% topic \(a\), 70% topic \(b\); document \(B\) contains ... etc..

We can also use pretrained S-Bert to get sentence vector, then use clustering technique to get
topics. The advantages are: in LDA, worker and employer are two independent words.. Also it gives
a topic for each sentence. So LDA is more suitable for long documents.

Once we have a topic, we can sample some sentences from this topic, and ask ChatGPT to
summarize a topic for these sentences.

LM-LSS Notes¶

Chap 0¶

Chap 1¶

Comments