PostgreSQL Çorbası: Haziran 2023

25 Haziran 2023 Pazar

TO_TSVECTOR metodu - Text to tsvector Type - Full Text Search İçindir

to_tsvector Nedir? - Text to tsvector Type

Açıklaması şöyle

We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.

Açıklaması şöyle

The `to_tsvector` function tokenizes the text, removes stop words, applies stemming (reducing words to their root form), and assigns weights to the tokens based on their importance. The resulting `ts_vector` object is a sorted list of lexemes with their respective positions and weights. For example, the text “The quick brown fox” in English might be represented as a `ts_vector` like this: `’brown’:3 ‘fox’:4 ‘quick’:2`.

Örnek

Şöyle yaparız ve çıktı olarak şunu alırız

SELECT to_tsvector('the cat got scared by a cucumber');

'car':2 'cucumb':7 'got':3 'scare':4

tsvector Veri Yapısı Nedir? - Metni Lexeme'lere (Kelimelere) Böler - Full Text Search İçindir

Giriş

Açıklaması şöyle. Bir metni kelimeler olarak saklar. Yani tsvector metni Lexeme'lere böler.

The `ts_vector` data type in PostgreSQL represents a document as a sequence of lexemes (words) along with their positions and weights. It is created using the `to_tsvector` function, which takes a configuration name (specifying the language and text processing rules) and a text value as input.

Açıklaması şöyle. tsvector kelimesi text search vector anlamına gelir.

tsvector is a particular data type that stores text structure in the document format. tsvector stands for text search vector. We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.

Document format için açıklama şöyle

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.

lexeme vs stem

Açıklaması şöyle. Sanırım lexeme bir seti temsil ediyor. Bu setin temel haline ise lemma deniliyor.

A "lexeme" is a theoretical thing, a unit in the mental lexicon. You can think of it as being an entire dictionary entry, but in our mental knowledge bank of what words mean rather than a physical book.

A "stem" is a practical thing: it's the part of a word that you stick affixes onto. The stem of play, playing, plays, played, etc is play-. In English the stem usually looks like an actual word, but it doesn't have to be: in Latin, the root of the Latin words amīcus, amīcī, amīcum, amīcō, etc is amīc-, which isn't a valid Latin word on its own. So you'll sometimes find the word "lemma" used to mean "the stem, with some default affix attached to make it a real word" (in Latin, that would be amīcus).

The concept of a lexeme is pretty standard across languages. No matter what language you speak, you have some sort of mental understanding of what words mean. But the concept of a stem is very useful in some languages and nigh useless in others. It all depends how much the language uses affixes.

Örnek

Şöyle yaparız ve çıktı olarak şunu alırız

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;

'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

22 Haziran 2023 Perşembe

Debezium Connector transforms - route

Giriş

Değişen tablo ismine göre çıktının yönlendirileceği topic ismini belirtir

Örnek

Şöyle yaparız

{
  "name": "pg_user_data-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.server.id": "184055",
    "database.server.name": "dbserver2",
    "database.include": "user_data",
    "database.dbname": "user_data",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.user_data",
    "transforms": "route",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
    "transforms.route.replacement": "$3"
  }
}

public.customers.orders tablosunu orders isimli topic'e yazar

Örnek

Şöyle yaparız

"config": {
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "localhost",
  "database.port": "5432",
  "database.user": "sample_user",
  "database.password": "sample_pass",
  "database.dbname": "sample_db",
  "database.server.name": "sample_servername",
  "table.include.list": "sample_schema.sample_table",
  "topic.prefix": "sample.topic.prefix",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "schema.include.list": "sample_schema",
  "transforms": "unwrap,reroute_topic",
  
  "transforms.reroute_topic.type": "io.debezium.transforms.ByLogicalTableRouter",
  "transforms.reroute_topic.key.enforce.uniqueness": "false",
  "transforms.reroute_topic.topic.regex": "sample_reroute_source_topic",
  "transforms.reroute_topic.topic.replacement": "sample_reroute_target_topic",
  
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
  "transforms.unwrap.drop.tombstones": "false",
  
  "key.converter": "org.apache.kafka.connect.json.JsonConverter",
  "key.converter.schemas.enable": "false",
  "value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "value.converter.schemas.enable": "false",
  "snapshot.mode": "initial",
  "decimal.format": "NUMERIC",
  "json.output.decimal.format": "NUMERIC",
  "decimal.handling.mode": "string"
}

Burada sample_schema.sample_table tablosu takip ediliyor ve değişiklikler sample_reroute_target_topic isimli topic'e yazılıyor

JSON ->> Operator - Nesnenin Alanına Erişir WHERE Cümlesinde Kullanılabilir

Giriş

Açıklaması şöyle. Bu operatör hem JSON hem de JSONB için kullanılabilir

When we use ->> operator of JSONB, PostgreSQL can use B-tree or Hash index for processing the operations. ->> operator returns the value of the specified attribute in text format.

Kullanım

İki kullanımı var

1. Nesnenin bir alanının değerine text olarak erişebiliriz. Açıklaması şöyle.

Postgres supports JSON type from v9.2. It has added many operators and functions for data manipulation of JSON. The operator -> returns JSON object field by key. The operator ->> returns JSON object field by text.

2. WHERE cümlesinde kullanılabilir

Örnek

Elimizde şöyle bir tablo olsun ve tabloya veri girelim

CREATE TABLE sensor_data (
    id SERIAL NOT NULL PRIMARY KEY,
    data JSON NOT NULL
);

INSERT INTO sensor_data (data)
  VALUES('{ "ip": "J10.3.2.4", "payload": {"temp": "33.5","brightness": "73"}}');

Veriyi sorgulamak için şöyle yaparız. Çıktı olarak ip değerlerini text olarak alırız.

SELECT data->> 'ip' AS ip FROM sensor_data;

Örnek - = Koşulu

Şöyle yaparız

SELECT orders FROM lunchorders WHERE orders ->> 'order_date' = '2020-12-11';

Örnek - > Koşulu

Şöyle yaparız.

SELECT * FROM lunchorders WHERE  (orders -> 'order_details' ->> 'cost')::numeric > 4.50;

Örnek - LIKE

Şöyle yaparız

SELECT * FROM user_details WHERE details ->> 'alternateContacts' like '%777%';

JPA

Elimizde şöyle bir tablo olsun

CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255),
  data JSONB
);

Entity şöyle olsun. Burada JSONB verisini Java'da String olarak saklıyoruz

@Entity
@Getter
@Setter
@Table(name = "employee")
public class Employee {

  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String name;

  @Column(columnDefinition = "jsonb")
  private String data;
}

SpringBoot projesinde @Query yani SQL ile şöyle yaparız. JSON ->> Operator kullanılıyor

public interface EmployeeRepository extends JpaRepository<Employee, Long> {

  @Query(value = "SELECT * FROM employee WHERE data->>'department' = ?1", 
         nativeQuery = true)
  List<Employee> findByDepartment(String department);
}

Specification ile şöyle yaparız. Burada hasDepartment() bir static factory ve bir tane Specification döndürülüyor. Specification Root ve CriteriaQuery nesnelerini kullanarak Type Safe Query imkanı tanıyor. Burada JSON ->> Operator yerine aynı şey olan jsonb_extract_path_text() kullanılıyor.

@Repository
public interface EmployeeSpecificationRepository extends JpaRepository<Employee, Long>, 
  JpaSpecificationExecutor<Employee> {
}

public class EmployeeSpecification {

  public static Specification<Employee> hasDepartment(String department) {
    return new Specification<Employee>() {
      @Override
      public Predicate toPredicate(Root<Employee> root, CriteriaQuery<?> query,
        CriteriaBuilder cb) {
        return cb.equal(
          cb.function("jsonb_extract_path_text", String.class, root.get("data"), 
            cb.literal("department")),
          department
        );
      }
    };
  }
}

Her iki yöntemi de kullanan Controller için şöyle yaparız

@RestController
public class EmployeeController {

  @Autowired
  private EmployeeRepository employeeRepository;

  @Autowired
  private EmployeeSpecificationRepository employeeSpecificationRepository;

  @GetMapping("/employees/{department}")
  public List<Employee> getByDepartment(@PathVariable String department) {
    return employeeRepository.findByDepartment(department);
  }

  @GetMapping("/employees/Specification/{department}")
  public List<Employee> getBySpecificationDepartment(@PathVariable String department) {
    Specification<Employee> spec = EmployeeSpecification.hasDepartment(department);
    return employeeSpecificationRepository.findAll(spec);
  }
}

16 Haziran 2023 Cuma

PostgreSQL'i Message Queue Olarak Kullanmak

Burada bir örnek var

9 Haziran 2023 Cuma

TimescaleDB Extension

Giriş

İlk defa burada gördüm

VERSION metodu

Örnek

Şöyle yaparız

> SELECT version();                                                
PostgreSQL 9.6.6 on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit(1 row)

7 Haziran 2023 Çarşamba

Debezium Connector transforms - unwrap ExtractNewRecordState - Sadece Değişen Alan Gelir

Giriş

Şu iki satırı yazmak gerekir.

1. transforms": "unwrap"

2. "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",

Böylece çıktıda before alanı hep null gelir. Yani şöyledir

{
  "source": {
    ...
  },
  "before": null,
  "after": {
    "id": 1,
    "name": "John Doe",
    "age": 30
  },
  "op": "c",
  "ts_ms": 1654316585000
}

Açıklaması şöyle

By default, Debezium sends all events in an envelope that includes many pieces of information about the change captured. I’m only interested in reading the changed value here, so the command tells Kafka Connect to keep this information and discard the rest.

Örnek

Şöyle yaparız

"name": "source-productcategory-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "your-host-ip",
        "database.port": "5432",
        "database.user": "user",
        "database.password": "password",
        "database.dbname": "AdventureWorks",
        "plugin.name": "pgoutput",
        "database.server.name": "source",
        "key.converter.schemas.enable": "false",
        "value.converter.schemas.enable": "false",
        "transforms": "unwrap",
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "key.converter": "org.apache.kafka.connect.json.JsonConverter",
        "table.include.list": "public.factinternetsales_streaming",
        "slot.name" : "dbz_sales_transaction_slot"
    }

Örnek - add.fields

Şöyle yaparız

{
  "name": "postgres-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "localhost", 
    "database.port": "5432", 
    "database.user": "postgres",
    "topic.prefix": "postgres", 
    "database.password": "postgres", 
    "database.dbname" : "postgres", 
    "database.server.name": "postgres",
    "slot.name": "debezium",
    "plugin.name": "pgoutput",
    "table.include.list": "public.dbz_test",
    "transforms" : "unwrap",
    "transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields":"op,table,lsn,source.ts_ms,db",
    "transforms.unwrap.drop.tombstones":"true",
    "transforms.unwrap.delete.handling.mode":"rewrite",
    "drop.tombstones": "true"
  }

COLLATE

Giriş

Açıklaması şöyle

It's worth noting that collations can be customized and specified at different levels, including the database, schema, and individual columns.

Tablo Seviyesi

Örnek

Şöyle yaparız

CREATE TABLE employees (
  id SERIAL PRIMARY KEY,
  name TEXT COLLATE "en_US" -- Specify the collation at the table level
);

Örnek

Şöyle yaparız

CREATE TABLE IF NOT EXISTS public.factinternetsales_streaming
(
    productkey bigint,
    customerkey bigint,
    salesterritorykey bigint,
    salesordernumber text COLLATE pg_catalog."default",
    totalproductcost double precision,
    salesamount double precision
)

Burada pg_catalog."default" ile veri tabanı yaratılırken belirtilen collation kullanılması isteniyor. Eğer veri tabanı yaratılırken collation belirtilmemişse sistemin varsayılan değeri kullanılır

5 Haziran 2023 Pazartesi

Debezium Connector key.converter ve value.converter Alanları

Örnek - JsonConverter

Şöyle yaparız. Burada key ve value alanları Json olarak Kafka'ya yazılıyor

$ echo '
{
    "name": "arctype-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "tasks.max": "1",
        "plugin.name": "wal2json",
        "database.hostname": "db",
        "database.port": "5432",
        "database.user": "postgres",
        "database.password": "arctype",
        "database.dbname": "postgres",
        "database.server.name": "ARCTYPE",
        "key.converter": "org.apache.kafka.connect.json.JsonConverter",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "key.converter.schemas.enable": "false",
        "value.converter.schemas.enable": "false",
        "snapshot.mode": "always"
    }
}
' > debezium.json

$ curl -i -X POST \
         -H "Accept:application/json" \
         -H "Content-Type:application/json" \
         127.0.0.1:8083/connectors/ \ccySQL Connector

Örnek - JsonConverter

Şöyle yaparız. Burada key için StringConverter, value için AvroConverter kullanılıyor ve Kafka'ya yazılıyor

{
  "name": "postgres-source",
  "config": {"connector.class":"io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max":"1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname" : "students",
    "database.server.name": "dbserver1",
    "database.whitelist": "students",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.students",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "true",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope"
  }
}

transforms.unwrap.type için açıklama şöyle

The value of transforms.unwrap.type can be set to various predefined transformation types depending on the desired behavior. Some common values include:

- org.apache.kafka.connect.transforms.UnwrapFromEnvelope: This transformation is used to extract the payload from an envelope structure. It is often used when messages are wrapped in an envelope format, such as Apache Avro's envelope or Confluent's schema registry envelope.

- org.apache.kafka.connect.transforms.ExtractField$Key: This transformation extracts a specific field from the key of the message. It is useful when you want to perform further processing based on a specific key field.

- org.apache.kafka.connect.transforms.ExtractField$Value: This transformation extracts a specific field from the value of the message. It is useful when you want to extract a specific value field for further processing or downstream operations.

Please note that these are just a few examples, and the available transformation types may vary depending on the version of Kafka Connect and the connectors you are using.

When using transforms.unwrap.type, you will typically need to configure additional properties specific to the chosen transformation type, such as transforms.unwrap.field to specify the field to be extracted or manipulated.

2 Haziran 2023 Cuma

Clustered vs Non-clustered indexes

Clustered Index

Primary Key anlamına gelir gibi düşünülebilir. Açıklaması şöyle.

Clustered and non-clustered indexes
The primary key index is also a clustered index, and the non-primary key index is non-clustered.

Non-Clustered Index

CREATE INDEX ile yaratılır

1 Haziran 2023 Perşembe

Materialize - Streaming Database

Giriş

Açıklaması şöyle

What is Materialize?

Materialize is a streaming database for real-time applications. It allows you to work with streaming data from multiple external sources using nothing but standard SQL.

You write arbitrarily complex queries; Materialize takes care of maintaining the results automatically up to date as new data arrives.

Your applications can query Materialized to get blazing fast results which often falls in the sub-second latency range.

PostgreSQL Uyumluluğu

Açıklaması şöyle

As a developer, you work with Materialize as if it were a PostgreSQL database. Materialize is wire compatible with Postgres, allowing you to use a CLI tool like psql or mzcli (a wrapper around psql) to issue ANSI-92 compatible SQL statements against it.

CREATE SOURCE

Örnek

Şöyle yaparız

CREATE SOURCE raw_pageviews FROM KAFKA BROKER 'kafka:9092' TOPIC 'page_views' FORMAT BYTES;

// The raw_pageviews source produces data as a single text column containing JSON. 
// To extract the JSON fields for each page view event, you can use the built-in 
// jsonb operators
CREATE VIEW pageviews AS
  SELECT
    (pageview_data->'user_id')::INT as user_id,
    (pageview_data->'url')::STRING as url,
    (pageview_data->'channel')::STRING as channel,
    TO_TIMESTAMP((pageview_data->'received_at')::INT AS ts
  FROM (
    SELECT CONVERT_FROM(data, 'utf8')::jsonb AS pageview_data
    FROM pageviews
);

Buraya kadar sadece non-materialized view yarattık. Sonra şöyle yaparız

CREATE MATERIALIZED VIEW page_views_by_channel AS
SELECT channel,
       count(*) as pageview_count
FROM pageviews
GROUP BY channel;

// Reading the output of a materialized view
SELECT * from page_views_by_channel;

CREATE SINK

Şöyle yaparız

CREATE SINK dashboard_sink
FROM page_views_by_channel
INTO KAFKA BROKER 'localhost' TOPIC 'frank-quotes-sink'
FORMAT AVRO USING
    CONFLUENT SCHEMA REGISTRY 'http://localhost:8081';

Joining multiple data sources

Şöyle yaparız. Burada pageviews ve users birleştiriliyor

CREATE MATERIALIZED VIEW pageviews_by_user_segment AS
SELECT
  users.is_vip,       
  pageviews.channel,       
  date_trunc('hour', pageviews.ts) as ts_hour,
  count(*) as pageview_count     
FROM users     
JOIN pageviews ON pageviews.user_id = users.id     
GROUP BY 1,2,3;