PostgreSQL Çorbası: pg_trgm Module - Pattern Matching Full Text Search İçindir

Giriş
Pattern matching veya full text search başlığı altına girer. Açıklaması şöyle. trgm kısaltması trigrams anlamına gelir.

The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching

trigram için açıklaması şöyle. Yani 3 karakter anlamına gelir.

Postgres provides a module called pg_trgm which ships different functions to work with trigrams matching. A trigram is a group of three consecutive characters taken from a string. Based on this we can measure the similarity of two strings by counting the number of trigrams they share and estimating how similar they are on a scale between 0 and 1. For example, the string “hello” would be represented by the following set of trigrams:

“ h”, “ he”, “hel”, “ell”, “llo”, “lo “

In order to make use of this module we need to enable it using the following statement:

CREATE EXTENSION pg_trgm;

pg_trm module için açıklama şöyle. Yani diğer araçlar kadar gelişmiş olmayabilir

PostgreSQL has a lot of advanced extensions and indexes to speed up the same. But one should keep in mind that these have their own limitations in terms of speed and might not work with languages other than English/multi-byte encoding, such as UTF-8. However, these fill the critical gap where the functionalities mentioned are necessary, but we cannot go for a full-fledged search solution such as Solr/Elasticsearch.

Kurulum

Şöyle yaparız

CREATE EXTENSION pg_trgm;

tsvector Nedir?

tsvector Veri Yapısı Nedir? yazısına bakabilirsiniz

to_tsvector Metodu - Text to tsvector Type

to_tsvector metodu yazısına bakabilirsiniz

tsquery Nedir? - Searching Against the Vector

Örnek

Şöyle yaparız ve çıktı olarak şunu alırız

SELECT to_tsvector('the cat got scared by a cucumber') 
@@to_tsquery('scares')

true

Açıklaması şöyle

By now, we know that the tsvector is not a standard text column, and there is a lot of pre-processing that happens for easier search. We need to use a particular function called tsquery to search against the tsvector column.

The output as we see is simply true or false. In this case, note that we searched for the word scares, but still, the result was true. That was because of the normalization we saw earlier. The @@ operator is used for direct matches. We have other operators that perform more sophisticated matches.

Örnek

Şöyle yaparız

SELECT episode_narrative FROM storm_events WHERE
  episode_narrative_tsv @@ to_tsquery('logan & international & airport')

GIST Index Yaratma - Generalized Search Tree

Örnek

Şöyle yaparız

CREATE TABLE IF NOT EXISTS PEOPLE (
  ID VARCHAR(36) PRIMARY KEY,
  NICKNAME VARCHAR(32) UNIQUE NOT NULL,
  NAME VARCHAR(100) NOT NULL,
  BIRTHDATE VARCHAR(10) NOT NULL,
  STACK TEXT[],
  SEARCH_TRGM TEXT
);

CREATE EXTENSION IF NOT EXISTS PG_TRGM;
CREATE INDEX CONCURRENTLY IF NOT EXISTS IDX_PEAPLE_SEARCH_TGRM 
ON PEOPLE USING GIST (SEARCH_TRGM GIST_TRGM_OPS);

GIN Index Yaratma - Generalized Inverted Index

gin biraz daha hızlı bir Full Text Search index galiba.

Örnek

Şöyle yaparız

CREATE INDEX trgm_idx on storm_events USING gin(event_type GIN_TGRM_OPS);

Örnek - tsvector alanına göre index

Şöyle yaparız

ALTER TABLE storm_events ADD COLUMN episode_narrative_tsv tsvector;
    
UPDATE storm_events SET episode_narrative_tsv = to_tsvector(episode_narrative);

CREATE INDEX tsv_idx ON storm_events USING gin(episode_narrative_tsv);

operator <->

Örnek
Şöyle yaparız.

SELECT t, 'super cool' <-> t AS dist
  FROM test_trgm
  ORDER BY dist DESC LIMIT 10;

similarity metodu
Açıklaması şöyle

The SIMILARITY function returns what is called a score. We can adjust this score according to the matches we need. The default score is 0.3 And if we are not worried about the score, then we can use the below shorthand.

SELECT * FROM storm_events WHERE event_type % 'Drot'

Örnek

Tablomuzda şöyle bir veri olsun

- Hello
? Hello
| Hello
$ Hello
! Hello
!? Hello

Şöyle yaparız.

SELECT my_column, similarity('$ Hello', my_column) AS sml
FROM my_table
WHERE my_column % '$ Hello'
ORDER BY sml DESC, my_column;

Çıktı olarak 1 alırız.

show_trgm metodu
Örnek
Şöyle yaparız.

SELECT show_trgm('$ Hello');

            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
(1 row)

SELECT show_trgm('- Hello');

            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
(1 row)

word_similarity metodu
Örnek ver

strict_word_similarity metodu
Örnek ver

show_limit metodu
Örnek ver

show_limit metodu
Örnek ver

STRING_TO_ARRAY metodu

Elimizde şöyle bir tablo olsun

CREATE SEQUENCE todo_id_seq
  START WITH 1
  INCREMENT BY 10
  NO MINVALUE
  NO MAXVALUE CACHE 1;

CREATE TABLE t_todo
(
  id         BIGINT                 NOT NULL DEFAULT nextval('todo_id_seq'),
  ext_ref    CHARACTER VARYING(255) NOT NULL,
  title      CHARACTER VARYING(255) NOT NULL,
  active     BOOLEAN                NOT NULL DEFAULT TRUE,
  created_at TIMESTAMP              NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP              NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  UNIQUE (ext_ref),
  UNIQUE (title)
);

Şöyle yaparız

@Repository
public interface ToDoRepository extends JpaRepository<ToDo, Long> {

  @Query(value = "SELECT * FROM t_todo todo WHERE ?1 %
                  ANY(STRING_TO_ARRAY(todo.title,' '))", 
         nativeQuery = true)
 List<ToDo> findAllMatching(String partialTitle);
}

Açıklaması şöyle

One more thing…the % operator uses the default similarity threshold having a value of 0.3 but if we need to adjust it (e.g 0.5) we can use the SIMILARITY function:

Bu durumda şöyle yaparız

SELECT * FROM t_todo todo WHERE SIMILARITY(todo.title, 'Clean') > 0.4

2. Root (Kök ve Lexeme) Kelime Arama

Örnek

Elimizde şöyle bir tablo olsun

CREATE TABLE log(
   name TEXT,
   description TEXT
);

INSERT INTO log VALUES('ERROR1','Failed to retreive credentials');
INSERT INTO log VALUES('ERROR2','Fatal error. No records present. Please try again with a
 different value');
INSERT INTO log VALUES('ERROR3','Unable to connect. Credentials missing');

Arama için şöyle yaparız

SELECT * FROM log WHERE to_tsvector(description) @@ to_tsquery('miss');

Açıklaması şöyle

The to_tsvector function converts the values to their lexeme and the to_tsquery function will try to match the words.

3. Levenshtein Matching

Kurulum

Şöyle yaparız

CREATE EXTENSION fuzzystrmatch;
Örnek

Şöyle yaparız

SELECT * FROM storm_events WHERE levenshtein(event_type, 'Drot') < 4

4. Phonetic Match

Açıklaması şöyle

The fuzzystrmatch extension has another cool feature where we can search for text which does not have similar spelling but sounds the same.

Örnek

Şöyle yaparız. Telaffuzu "he" kelimesine benzeyen Heat ve Hail gibi şeyler döndürür

SELECT DISTINCT(event_type) FROM storm_events WHERE DIFFERENCE(event_type, 'he') > 2

PostgreSQL Çorbası

9 Temmuz 2019 Salı

pg_trgm Module - Pattern Matching Full Text Search İçindir

Hiç yorum yok:

Yorum Gönder