ElasticSearch 教學

15 min readJan 19, 2023

ElasticSearch 是一款強大的文本搜尋工具可以應用於:

網站搜索
日誌分析
地理空間分析處理等
其他

接著我們為了便於操作 ElasticSearch 所以我們會多配置一個 Kibana 用 GUI 的方式去操作 ElasticSearch。且為了本地的開發所以我們使用 Docker 的方式進行測試。

docker-compose.yml 如下, 我們使用 kibana 8.5.3 的版本進行測試

version: "3.9"

services:
  elasticsearch:
    build:
      context: .
      # 此 dockerfile 可以看下方的設定
      dockerfile: ./docker/es/Dockerfile
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - xpack.security.http.ssl:enabled=false
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      - elastic
  kibana:
    image: kibana:8.5.3
    container_name: kibana
    depends_on:
      - elasticsearch
    links:
      - elasticsearch:elasticsearch
    ports:
      - 5601:5601
    volumes:
       # 此 config 的設定可以看下方的設定
      - ./config/kibana/kibana.yml:/usr/share/kibana/config/kibana.yml
    networks:
      - elastic

networks:
  elastic:
    driver: bridge

elasticsearch dockerfile 如下:

FROM elasticsearch:8.5.3

# 安裝中文 analzyer
RUN bin/elasticsearch-plugin install analysis-smartcn

kibana config.yml 設定如下:

# To allow connections from remote users, set this parameter to a non-loopback address.
server.host: "0.0.0.0"
# The URLs of the Elasticsearch instances to use for all your queries.
elasticsearch.hosts: ["http://elasticsearch:9200"]

都建置好 docker 相關設定之後直接使用以下指令去啟動 elasticsearch & kibana 服務

docker-compose up -d

服務都正常啟動後可以進入 http://127.0.0.1:5601/ (kibana 的畫面)

之後為了便於操作 ElasticSearch 我們都會在 dev tool 去下 DSL 去執行 elasticsearch

服務都可以正常運作後，後面將講解 ElasticSearch 基本語法以及知識。

ElasticSearch 資料結構

我們知道資料庫有 database, table, row 等相關知識，而 ElasticSearch 也有

Index: 代表文件的儲存位置，類似於 RDMS 的 database
Document: 代表文件資料, 類似於 RDMS 的 row
Field: 代表文件的資料欄位, 類似於 RDMS 的 column
DSL: ElasticSearch 語法，類似於 RDMS 的 SQL 語法

而資料在建立時不像 RDMS 那樣需要先制定 schema 後才能新增資料， ElasticSearch 有個 Dynamic Mapping 的功能會根據你新增的資料去自動判斷該資料型別。(這邊的 mapping 可以想像成 RDMS 的 schema)

ElasticSearch _mapping 資料大概如下，會在後續再做更細部的說明。

使用語法如下:

GET /<index name>/_mapping

ElasticSearch 資料類型

text
long
boolean
double
binary
keyword

首先我們先在 ElasticSearch 新增一筆資料看看, 新增語法有以下幾種方式

PUT /<index>/_create/<_id>
POST /<index>/_create/<_id>
POST /<index>/_doc

例如執行以下語法:

POST /courses/_doc
{
  "title": "測試新增"
}

會建立一個名叫 courses 的 index, 且資料有 title，執行結果如下:

{
  "_index": "courses",
  "_id": "TiMcXIUBK5nGxF3yHKkz",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 8,
  "_primary_term": 1
}

json 資訊說明:

1. _index: 該 document 位在哪個 index 裡面

2. _id: document id (唯一值)

3. _version: 版本, 假如有在修改的話版本會再增加

可以發現 ElasticSearch 會自動幫我們生成 _id , 而我們也可以自定義 _id, 要自定義的話則是使用上面的第一種或者第二種方式(這邊建議不要自定義 _id 因為假如 _id 重複的話是會出錯的), 而此 _id 主要用於搜尋特定 document 時會使用到, 所以假如我們要取得該筆資料的話則是使用

GET /<index>/_doc/<_id>

GET /courses/_doc/TiMcXIUBK5nGxF3yHKkz

該筆 elasticsearch 資料

{
  "_index": "courses",
  "_id": "TiMcXIUBK5nGxF3yHKkz",
  "_version": 1,
  "_seq_no": 8,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "title": "測試新增"
  }
}

_source: 主要資料

假如要查詢該 index 下的所有資料可以使用以下指令即可查看到所有資料。

POST /<index>/_search
{
  "query": {
    "match_all": {}
  }
}

ps: 很多搜尋的方式都是使用 POST /<index>/_search 只是參數不同而已

結果圖如下:

{
  "took": 101,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 22,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
       ......
     ]
  }
}

回傳資料說明:

1. took: 花費的時間（毫秒)

2. timed_out: 是否發生 timeout

3. hits.total.value: 總共比數是多少，預設只會顯示 100000 筆資料

4. hits.total.relation: 這邊的 eq 表示實際的資料筆數為 22

5. max_score: 最高的比分

6. hits.hits: 儲存匹配到的資料, 只會顯示前 10 筆資料

那假如資料筆數超過 10 筆想要抓取後面 10 筆的話就要使用 page through, 簡單說就是透過 from 以及 size 去取得後面幾筆的資料。

from: 從第幾筆資料開始（預設為0)

size: 取幾筆資料回來

用上面的例子來看的話我總共資料有 22 筆假如要取第 11–20 筆資料的話

POST /courses/_search
{
  "query": {
    "match_all": {}
  },
  "from": 10, // 因為預設值是從 0開始因此第 11 筆資料要從 10 開始
  "size": 10
}

而我們在搜尋的時候不外乎需要下一個條件是，而此條件是在 ElasticSearch 稱呼為 bool query, 以下為常見的 bool query

must: 一定要有 (影響計分)
should: 可有可無 (影響計分)
filter: 資料篩選 (不會影響計分)
must_not: 一定不會有 (不會影響計分)

我們接下來直接看範例會比較清楚 bool query

POST /courses/_search
{
  "query": {
    "bool": {
      "must": [
          {
            "match": {
              "age": 40
            }
          }
       ],
      "filter": [
        {
          "temrs": {
            "gender.keyword": "FEMALE"
          }
        }
      ]
    }
  }
}

上面的 query 用白話一點說明的話就是，過濾出性別為 FEMALE 的資料且年紀符合 40 歲。 ( gender == ‘FEMALE’ && age == 40)

ps: ElasticSearch 將每個條件(object) 都稱呼為 query clause, 用上面的例子來看的話 query clause 為 match: {“age”: 40}

以下針對 should 做特別的說明，因為大家都會將 should 跟 RDMS 的 OR 聯想成一塊，但是實際上他們還是有個細微的差異存在。以下會列舉一個例子來說明

POST /bank/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "id": 10
          }
        },
        {
          "match": {
            "id": 20
          }
        }
      ],
      "filter": [
        {
          "term": {
            "gender.keyword": "FEMALE"
          }
        }
      ]
    }
  }
}

這邊應該很多人會解釋成找出性別為 FEMALE 且 id = 10 或者 id = 20 的資料，但是實際上 ElasticSearch 是會翻譯成性別為 FEMALE 或者 id = 10 或者 id = 20, 假如要避免這個狀況發生的話則要使用 minimum_should_match 亦或是搭配 match。

搭配 minimum_should_match 的話為如下

POST /bank/_search
{
  "query": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "match": {
            "id": 10
          }
        },
        {
          "match": {
            "id": 20
          }
        }
      ],
      "filter": [
        {
          "term": {
            "gender.keyword": "FEMALE"
          }
        }
      ]
    }
  }
}

改成搭配 match 的話為如下:

POST /bank/_search
{
  "query": {
    "bool": {
      "match": [
         {
          "bool": {
            "should": [
                {
                  "match": {
                    "id": 10
                  }
                },
                {
                  "match": {
                    "id": 20
                  }
                }
            ]
          }
        }
      ],
      "filter": [
        {
          "term": {
            "gender.keyword": "FEMALE"
          }
        }
      ]
    }
  }
}

上面的解釋的話就是兩個範例都是至少要匹配一個 should，所以就會抓到性別為女性且 id = 10 或者 id = 20 的資料。

mapping

我們之前介紹過 mapping, 我們可以把它想像成 RDMS 的 schema。

然後我們有兩個時機點可以操作 mapping。

創建 index 的時候設計 mapping

PUT /users
{
  "mappings": {
    "properties": {
      "id": {
        "type": "long"
      }
    }
  }
}

以上的例子則是建立一個 users 的 index 且資料結構有 id 型別為 long。

2. 將 mapping 加入到已經存在的 index

PUT /users/_mapping
{
  "properties": {
      "title": {
        "type": "text"
      }
    }
}

對 users 的 mapping 增加 title field 且型別為 text。

我們已上看到都是建立 mapping 亦或是增加 field 到 mapping 裡面，而 ElasticSearch 是不予許更改 field 的 data type, 倘若有需要更改 field 的 data type 的話則可以使用 reindex，簡單說就是將 A index 的資料搬到 B index 。

而假如要取得特定 index 的 mapping 資料的話可以使用以下指令

GET /<index>/_mapping

回傳資料如下:

{
  "courses": {
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

ps: _mapping index 為 true 的話才能被搜尋出來

multi-field:

有時某個欄位你會希望她擁有多種資料類型，那麼你可以使用 mulit-fields 定義多種資料類型，操作方法如下:

PUT <index>
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": { (2)
          "raw": { (1)
            "type":  "keyword"
          }
        }
      }
    }
  }
}

上面的的意思為 city 宣告一個 text 型別且它擁有 keyword 型別，只是它的 keword 型別為 city.raw, “raw” 這個字是可以自定義的，通常假如欄位類型為 keyword 的話此部分會定義為 keyword。

query DSL:

query_string
simple_query_string
match
match_phrase
match_phrase_prefix

Relation:

我們的資料結構中通常會包含一些關聯係，例如課程的資料可能會有領域甚至是講師等資料，那我們要將其關聯的資料儲存在 ES 裡頭呢？方法有如下幾種。

inner object:

簡單說就是一般的 object，而一般的 object 則是會攤平物件資料，假如你的資料格式如下

{
  "user":  [
      {
       "firstname": "hello",
       "lastname": "world"
      }
  ]
}

es 會將其轉換成

{
 "user.firstname": "hello",
 "user.lastname": "world"
}

資料攤平有個問題就是會使欄位之間的關聯消失，也就是 firstname 以及 lastname 是獨立的，舉個例子來說假如現在資料有

{
   "user": [
     {
        "firstname": "hello",
       "lastname": "lastname"
    },
    {
       "firstname": "firstname",
       "lastname": "world"
    }
  ]
}

此時假如使用 match 去篩選出 firstname 要等於 hello 並且 lastname 為 world 的話則會篩選出資料，但是你有可能是要篩選出在同一個 json 裡面 firstname 等於 hello 並且 lastname 為 world 的資料

GET <index>/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "user.firstname": "hello"
          }
        },
        {
          "match": {
            "user.lastname": "world"
          }
        }
      ]
    }
  }
}

2. nested

定義資料的類型為 nested, 並且可以解決上面 inner object 所出現的關聯係失效問題。

假如要使用 nested 的話用上面的例子則定義結構的方式要如下

PUT <index>
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested",
        "properties": {
          "firstname": {
            "type": "keyword"
          },
          "lastname": {
            "type": "keyword"
          }
         } 
      }
    }
  }
}

然後因為使用 nested 定義資料類型的話，在 query 資料的話要搭配 nested, 使用方式如下:

GET /<index>/_search
{
    "query": {
        "nested": {
         "path": "user",
            "query": {
                "bool": {
                 "must": [
                        { "term": { "user.firstname": "hello" } },
                        { "term": { "user.lastname": "world" } }
                    ]
                }
            }
        }
    }
}

這邊一定要指定 path, 也就是設定要從哪個 nested type 開始找尋資料。

參考資料:

https://kucw.github.io/blog/2018/6/elasticsearch-nested/

3. parent & child

參考資源:

https://myapollo.com.tw/zh-tw/docker-elasticsearch/

ElasticSearch 教學

ElasticSearch 資料結構

mapping

Written by Gary Ng

No responses yet