25 Haziran 2023 Pazar

TO_TSVECTOR metodu - Text to tsvector Type - Full Text Search İçindir

to_tsvector Nedir? - Text to tsvector Type
Açıklaması şöyle
We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.
Açıklaması şöyle
The `to_tsvector` function tokenizes the text, removes stop words, applies stemming (reducing words to their root form), and assigns weights to the tokens based on their importance. The resulting `ts_vector` object is a sorted list of lexemes with their respective positions and weights. For example, the text “The quick brown fox” in English might be represented as a `ts_vector` like this: `’brown’:3 ‘fox’:4 ‘quick’:2`. 

Örnek
Şöyle yaparız ve çıktı olarak şunu alırız
SELECT to_tsvector('the cat got scared by a cucumber'); 'car':2 'cucumb':7 'got':3 'scare':4

tsvector Veri Yapısı Nedir? - Metni Lexeme'lere (Kelimelere) Böler - Full Text Search İçindir

Giriş
Açıklaması şöyle. Bir metni kelimeler olarak saklar. Yani tsvector metni Lexeme'lere böler.
The `ts_vector` data type in PostgreSQL represents a document as a sequence of lexemes (words) along with their positions and weights. It is created using the `to_tsvector` function, which takes a configuration name (specifying the language and text processing rules) and a text value as input.
Açıklaması şöyletsvector kelimesi text search vector anlamına gelir.
tsvector is a particular data type that stores text structure in the document format. tsvector stands for text search vector. We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.
Document format için açıklama şöyle
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
lexeme vs stem
Açıklaması şöyle. Sanırım lexeme bir seti temsil ediyor. Bu setin temel haline ise lemma deniliyor.
A "lexeme" is a theoretical thing, a unit in the mental lexicon. You can think of it as being an entire dictionary entry, but in our mental knowledge bank of what words mean rather than a physical book.

A "stem" is a practical thing: it's the part of a word that you stick affixes onto. The stem of play, playing, plays, played, etc is play-. In English the stem usually looks like an actual word, but it doesn't have to be: in Latin, the root of the Latin words amīcus, amīcī, amīcum, amīcō, etc is amīc-, which isn't a valid Latin word on its own. So you'll sometimes find the word "lemma" used to mean "the stem, with some default affix attached to make it a real word" (in Latin, that would be amīcus).

The concept of a lexeme is pretty standard across languages. No matter what language you speak, you have some sort of mental understanding of what words mean. But the concept of a stem is very useful in some languages and nigh useless in others. It all depends how much the language uses affixes.
Örnek
Şöyle yaparız ve çıktı olarak şunu alırız
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;

'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

22 Haziran 2023 Perşembe

Debezium Connector transforms - route

Giriş
Değişen tablo ismine göre çıktının yönlendirileceği topic ismini belirtir

Örnek
Şöyle yaparız
{
  "name": "pg_user_data-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.server.id": "184055",
    "database.server.name": "dbserver2",
    "database.include": "user_data",
    "database.dbname": "user_data",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.user_data",
    "transforms": "route",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
    "transforms.route.replacement": "$3"
  }
}
public.customers.orders tablosunu orders isimli topic'e yazar

Örnek
Şöyle yaparız
"config": {
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "localhost",
  "database.port": "5432",
  "database.user": "sample_user",
  "database.password": "sample_pass",
  "database.dbname": "sample_db",
  "database.server.name": "sample_servername",
  "table.include.list": "sample_schema.sample_table",
  "topic.prefix": "sample.topic.prefix",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "schema.include.list": "sample_schema",
  "transforms": "unwrap,reroute_topic",
  
  "transforms.reroute_topic.type": "io.debezium.transforms.ByLogicalTableRouter",
  "transforms.reroute_topic.key.enforce.uniqueness": "false",
  "transforms.reroute_topic.topic.regex": "sample_reroute_source_topic",
  "transforms.reroute_topic.topic.replacement": "sample_reroute_target_topic",
  
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
  "transforms.unwrap.drop.tombstones": "false",
  
  "key.converter": "org.apache.kafka.connect.json.JsonConverter",
  "key.converter.schemas.enable": "false",
  "value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "value.converter.schemas.enable": "false",
  "snapshot.mode": "initial",
  "decimal.format": "NUMERIC",
  "json.output.decimal.format": "NUMERIC",
  "decimal.handling.mode": "string"
}
Burada sample_schema.sample_table tablosu takip ediliyor ve değişiklikler sample_reroute_target_topic isimli topic'e yazılıyor



JSON ->> Operator - Nesnenin Alanına Erişir WHERE Cümlesinde Kullanılabilir

Giriş
Açıklaması şöyle. Bu operatör hem JSON hem de JSONB için kullanılabilir
When we use ->> operator of JSONB, PostgreSQL can use B-tree or Hash index for processing the operations. ->> operator returns the value of the specified attribute in text format. 

Kullanım
İki kullanımı var
1. Nesnenin bir alanının değerine text olarak erişebiliriz. Açıklaması şöyle
Postgres supports JSON type from v9.2. It has added many operators and functions for data manipulation of JSON. The operator -> returns JSON object field by key. The operator ->> returns JSON object field by text.
2. WHERE cümlesinde kullanılabilir

Örnek
Elimizde şöyle bir tablo olsun ve tabloya veri girelim
CREATE TABLE sensor_data (
    id SERIAL NOT NULL PRIMARY KEY,
    data JSON NOT NULL
);

INSERT INTO sensor_data (data)
  VALUES('{ "ip": "J10.3.2.4", "payload": {"temp": "33.5","brightness": "73"}}');
Veriyi sorgulamak için şöyle yaparız. Çıktı olarak ip değerlerini text olarak alırız.
SELECT data->> 'ip' AS ip FROM sensor_data;
Örnek - = Koşulu
Şöyle yaparız
SELECT orders FROM lunchorders WHERE orders ->> 'order_date' = '2020-12-11';
Örnek - >  Koşulu
Şöyle yaparız.
SELECT * FROM lunchorders WHERE  (orders -> 'order_details' ->> 'cost')::numeric > 4.50;
Örnek - LIKE
Şöyle yaparız
SELECT * FROM user_details WHERE details ->> 'alternateContacts' like '%777%';

JPA
Elimizde şöyle bir tablo olsun
CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255),
  data JSONB
);
Entity şöyle olsun. Burada JSONB verisini Java'da String olarak saklıyoruz
@Entity
@Getter
@Setter
@Table(name = "employee")
public class Employee {

  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String name;

  @Column(columnDefinition = "jsonb")
  private String data;
}
SpringBoot projesinde @Query yani SQL ile şöyle yaparızJSON ->> Operator kullanılıyor
public interface EmployeeRepository extends JpaRepository<Employee, Long> {

  @Query(value = "SELECT * FROM employee WHERE data->>'department' = ?1", 
         nativeQuery = true)
  List<Employee> findByDepartment(String department);
}
Specification ile şöyle yaparız. Burada hasDepartment() bir static factory ve bir tane Specification döndürülüyor. Specification Root ve CriteriaQuery nesnelerini kullanarak Type Safe Query imkanı tanıyor. Burada JSON ->> Operator yerine aynı şey olan jsonb_extract_path_text() kullanılıyor.
@Repository
public interface EmployeeSpecificationRepository extends JpaRepository<Employee, Long>, 
  JpaSpecificationExecutor<Employee> {
}

public class EmployeeSpecification {

  public static Specification<Employee> hasDepartment(String department) {
    return new Specification<Employee>() {
      @Override
      public Predicate toPredicate(Root<Employee> root, CriteriaQuery<?> query,
        CriteriaBuilder cb) {
        return cb.equal(
          cb.function("jsonb_extract_path_text", String.class, root.get("data"), 
            cb.literal("department")),
          department
        );
      }
    };
  }
}
Her iki yöntemi de kullanan Controller için şöyle yaparız
@RestController
public class EmployeeController {

  @Autowired
  private EmployeeRepository employeeRepository;

  @Autowired
  private EmployeeSpecificationRepository employeeSpecificationRepository;

  @GetMapping("/employees/{department}")
  public List<Employee> getByDepartment(@PathVariable String department) {
    return employeeRepository.findByDepartment(department);
  }

  @GetMapping("/employees/Specification/{department}")
  public List<Employee> getBySpecificationDepartment(@PathVariable String department) {
    Specification<Employee> spec = EmployeeSpecification.hasDepartment(department);
    return employeeSpecificationRepository.findAll(spec);
  }
}










9 Haziran 2023 Cuma

TimescaleDB Extension

Giriş
Açıklaması şöyle
TimescaleDB is a PostgreSQL extension optimized for handling time-series data, making it a powerful tool for data warehousing. Here’s a detailed explanation:

Partitioning
Açıklaması şöyle
TimescaleDB automatically partitions data into chunks based on time intervals. This improves query performance and simplifies data management.
Compression: Built-in compression reduces storage requirements and enhances I/O performance.

VERSION metodu

Örnek
Şöyle yaparız
> SELECT version();                                                
PostgreSQL 9.6.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit(1 row)

7 Haziran 2023 Çarşamba

Debezium Connector transforms - unwrap ExtractNewRecordState - Sadece Değişen Alan Gelir

Giriş
Şu iki satırı yazmak gerekir.
1. transforms": "unwrap"
2. "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
Böylece çıktıda  before alanı hep null gelir. Yani şöyledir
{
"source": { ... }, "before": null, "after": { "id": 1, "name": "John Doe", "age": 30 }, "op": "c", "ts_ms": 1654316585000 }
Açıklaması şöyle
By default, Debezium sends all events in an envelope that includes many pieces of information about the change captured. I’m only interested in reading the changed value here, so the command tells Kafka Connect to keep this information and discard the rest.

Örnek
Şöyle yaparız
"name": "source-productcategory-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "your-host-ip",
        "database.port": "5432",
        "database.user": "user",
        "database.password": "password",
        "database.dbname": "AdventureWorks",
        "plugin.name": "pgoutput",
        "database.server.name": "source",
        "key.converter.schemas.enable": "false",
        "value.converter.schemas.enable": "false",
        "transforms": "unwrap",
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "key.converter": "org.apache.kafka.connect.json.JsonConverter",
        "table.include.list": "public.factinternetsales_streaming",
        "slot.name" : "dbz_sales_transaction_slot"
    }
Örnek - add.fields
Şöyle yaparız
{
  "name": "postgres-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "localhost", 
    "database.port": "5432", 
    "database.user": "postgres",
    "topic.prefix": "postgres", 
    "database.password": "postgres", 
    "database.dbname" : "postgres", 
    "database.server.name": "postgres",
    "slot.name": "debezium",
    "plugin.name": "pgoutput",
    "table.include.list": "public.dbz_test",
    "transforms" : "unwrap",
    "transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields":"op,table,lsn,source.ts_ms,db",
    "transforms.unwrap.drop.tombstones":"true",
    "transforms.unwrap.delete.handling.mode":"rewrite",
    "drop.tombstones": "true"
  }


COLLATE

Giriş
Açıklaması şöyle
It's worth noting that collations can be customized and specified at different levels, including the database, schema, and individual columns. 
Tablo Seviyesi
Örnek
Şöyle yaparız
CREATE TABLE employees (
  id SERIAL PRIMARY KEY,
  name TEXT COLLATE "en_US" -- Specify the collation at the table level
);
Örnek
Şöyle yaparız
CREATE TABLE IF NOT EXISTS public.factinternetsales_streaming
(
    productkey bigint,
    customerkey bigint,
    salesterritorykey bigint,
    salesordernumber text COLLATE pg_catalog."default",
    totalproductcost double precision,
    salesamount double precision
)
Burada pg_catalog."default"  ile veri tabanı yaratılırken belirtilen collation kullanılması isteniyor. Eğer veri tabanı yaratılırken collation belirtilmemişse  sistemin varsayılan değeri kullanılır

5 Haziran 2023 Pazartesi

Debezium Connector key.converter ve value.converter Alanları

Örnek - JsonConverter
Şöyle yaparız. Burada key ve value alanları Json olarak Kafka'ya yazılıyor
$ echo '
{
    "name": "arctype-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "tasks.max": "1",
        "plugin.name": "wal2json",
        "database.hostname": "db",
        "database.port": "5432",
        "database.user": "postgres",
        "database.password": "arctype",
        "database.dbname": "postgres",
        "database.server.name": "ARCTYPE",
        "key.converter": "org.apache.kafka.connect.json.JsonConverter",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "key.converter.schemas.enable": "false",
        "value.converter.schemas.enable": "false",
        "snapshot.mode": "always"
    }
}
' > debezium.json

$ curl -i -X POST \
         -H "Accept:application/json" \
         -H "Content-Type:application/json" \
         127.0.0.1:8083/connectors/ \ccySQL Connector
Örnek - JsonConverter
Şöyle yaparız. Burada key için StringConverter, value için AvroConverter kullanılıyor ve Kafka'ya yazılıyor
{
  "name": "postgres-source",
  "config": {"connector.class":"io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max":"1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname" : "students",
    "database.server.name": "dbserver1",
    "database.whitelist": "students",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.students",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "true",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope"
  }
}
transforms.unwrap.type için açıklama şöyle
The value of transforms.unwrap.type can be set to various predefined transformation types depending on the desired behavior. Some common values include:

- org.apache.kafka.connect.transforms.UnwrapFromEnvelope: This transformation is used to extract the payload from an envelope structure. It is often used when messages are wrapped in an envelope format, such as Apache Avro's envelope or Confluent's schema registry envelope.

- org.apache.kafka.connect.transforms.ExtractField$Key: This transformation extracts a specific field from the key of the message. It is useful when you want to perform further processing based on a specific key field.

- org.apache.kafka.connect.transforms.ExtractField$Value: This transformation extracts a specific field from the value of the message. It is useful when you want to extract a specific value field for further processing or downstream operations.

Please note that these are just a few examples, and the available transformation types may vary depending on the version of Kafka Connect and the connectors you are using.

When using transforms.unwrap.type, you will typically need to configure additional properties specific to the chosen transformation type, such as transforms.unwrap.field to specify the field to be extracted or manipulated.

2 Haziran 2023 Cuma

Clustered vs Non-clustered indexes

Clustered Index
Primary Key anlamına gelir gibi düşünülebilir. Açıklaması şöyle.  
Clustered and non-clustered indexes
The primary key index is also a clustered index, and the non-primary key index is non-clustered. 
Bazı veri tabanlarında Clustered Index ile tüm satıra erişilebiliyor. Açıklaması şöyle
... when you lookup the key, and get to a leaf page, you will find the full row with all its columns. This is what is referred to as clustered index. 
Postgres böyle değil. Açıklaması şöyle
All indexes including the primary index in Postgres are secondary, they all point to the row identifier.

The row identifier in Postgres, aka the tuple id, is the unique identifier in Postgres. It is a combination of the page number in the table file and position of the tuple in that page, making it unique.

What is fascinating about the tuple id is its not indexed, it gets you the row directly with a simple, often one, I/O.

Non-Clustered Index
CREATE INDEX ile yaratılır

1 Haziran 2023 Perşembe

Materialize - Streaming Database

Giriş
Açıklaması şöyle
What is Materialize?

Materialize is a streaming database for real-time applications. It allows you to work with streaming data from multiple external sources using nothing but standard SQL.

You write arbitrarily complex queries; Materialize takes care of maintaining the results automatically up to date as new data arrives.

Your applications can query Materialized to get blazing fast results which often falls in the sub-second latency range.
PostgreSQL Uyumluluğu
Açıklaması şöyle
As a developer, you work with Materialize as if it were a PostgreSQL database. Materialize is wire compatible with Postgres, allowing you to use a CLI tool like psql or mzcli (a wrapper around psql) to issue ANSI-92 compatible SQL statements against it.
CREATE SOURCE
Örnek
Şöyle yaparız
CREATE SOURCE raw_pageviews FROM KAFKA BROKER 'kafka:9092' TOPIC 'page_views' FORMAT BYTES;

// The raw_pageviews source produces data as a single text column containing JSON. 
// To extract the JSON fields for each page view event, you can use the built-in // jsonb operators
CREATE VIEW pageviews AS
  SELECT
    (pageview_data->'user_id')::INT as user_id,
    (pageview_data->'url')::STRING as url,
    (pageview_data->'channel')::STRING as channel,
    TO_TIMESTAMP((pageview_data->'received_at')::INT AS ts
  FROM (
    SELECT CONVERT_FROM(data, 'utf8')::jsonb AS pageview_data
    FROM pageviews
);
Buraya kadar sadece non-materialized view yarattık. Sonra şöyle yaparız
CREATE MATERIALIZED VIEW page_views_by_channel AS
SELECT channel,
       count(*) as pageview_count
FROM pageviews
GROUP BY channel;

// Reading the output of a materialized view
SELECT * from page_views_by_channel;
CREATE SINK
Şöyle yaparız
CREATE SINK dashboard_sink
FROM page_views_by_channel
INTO KAFKA BROKER 'localhost' TOPIC 'frank-quotes-sink'
FORMAT AVRO USING
    CONFLUENT SCHEMA REGISTRY 'http://localhost:8081';
Joining multiple data sources
Şöyle yaparız. Burada pageviews ve users birleştiriliyor
CREATE MATERIALIZED VIEW pageviews_by_user_segment AS
SELECT
  users.is_vip,       
  pageviews.channel,       
  date_trunc('hour', pageviews.ts) as ts_hour,
  count(*) as pageview_count     
FROM users     
JOIN pageviews ON pageviews.user_id = users.id     
GROUP BY 1,2,3;