Values close to 1 what is document analysis pdf very similar words while values close to 0 represent very dissimilar words. Animation of the topic detection process in a document-word matrix.

Every column corresponds to a document, every row to a word. LSA groups both documents, which use similar words, as well as words which occur in a similar set of documents. The resulting patterns are used to detect latent components. This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used.

This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense. This approximation has a minimal error. But more importantly we can now treat the term and document vectors as a “semantic space”. These new dimensions do not relate to any comprehensible concepts. They are a lower-dimensional approximation of the higher-dimensional space.

Documents and term vector representations can be clustered using traditional clustering algorithms like k-means using similarity measures like cosine. Given a query, view this as a mini document, and compare it to your documents in the low-dimensional space. To do the latter, you must first translate your query into the low-dimensional space. Synonymy is the phenomenon where different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query.