25 Haziran 2023 Pazar

tsvector Veri Yapısı Nedir? - Metni Lexeme'lere (Kelimelere) Böler - Full Text Search İçindir

Giriş
Açıklaması şöyle. Bir metni kelimeler olarak saklar. Yani tsvector metni Lexeme'lere böler.
The `ts_vector` data type in PostgreSQL represents a document as a sequence of lexemes (words) along with their positions and weights. It is created using the `to_tsvector` function, which takes a configuration name (specifying the language and text processing rules) and a text value as input.
Açıklaması şöyletsvector kelimesi text search vector anlamına gelir.
tsvector is a particular data type that stores text structure in the document format. tsvector stands for text search vector. We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.
Document format için açıklama şöyle
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
lexeme vs stem
Açıklaması şöyle. Sanırım lexeme bir seti temsil ediyor. Bu setin temel haline ise lemma deniliyor.
A "lexeme" is a theoretical thing, a unit in the mental lexicon. You can think of it as being an entire dictionary entry, but in our mental knowledge bank of what words mean rather than a physical book.

A "stem" is a practical thing: it's the part of a word that you stick affixes onto. The stem of play, playing, plays, played, etc is play-. In English the stem usually looks like an actual word, but it doesn't have to be: in Latin, the root of the Latin words amīcus, amīcī, amīcum, amīcō, etc is amīc-, which isn't a valid Latin word on its own. So you'll sometimes find the word "lemma" used to mean "the stem, with some default affix attached to make it a real word" (in Latin, that would be amīcus).

The concept of a lexeme is pretty standard across languages. No matter what language you speak, you have some sort of mental understanding of what words mean. But the concept of a stem is very useful in some languages and nigh useless in others. It all depends how much the language uses affixes.
Örnek
Şöyle yaparız ve çıktı olarak şunu alırız
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;

'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

Hiç yorum yok:

Yorum Gönder