Elasticsearch

基於 Apache Lucene 的分散式搜尋與分析引擎，提供 RESTful API，常用於全文搜尋、Log 分析、監控資料查詢與近即時資料探索。

授權說明：2021 年 Elasticsearch / Kibana 7.11 起，Elastic 不再提供 Apache 2.0 distribution，改為 Elastic License 2.0 與 SSPL；2024 年 9 月起，Elastic 又將 AGPLv3 加入為部分原始碼的授權選項。Elastic 官方發行版本仍持續以 Elastic License 提供。

它適合處理大量結構化、半結構化與非結構化資料，但本質上不是 Data Warehouse，也不是單純的 Data Lake，而是偏向「可搜尋的索引層」或 Search Serving Layer。

Elasticsearch: The Official Distributed Search & Analytics Engine | Elastic

核心特性

1. Inverted Index
將文字拆成 token，建立「詞 → 文件」的索引。對全文搜尋場景而言，透過 inverted index 與 analyzer 建立詞項索引，通常比直接使用 SQL LIKE '%keyword%' 更適合；但若資料庫本身使用 full-text index，則要視索引設計與查詢需求比較。

2. Shard 分散式架構
Elasticsearch 會把 index 切成多個 shard，每個 shard 本身是一個 Lucene index；shard 可以分散到多個 node，提高儲存、寫入與查詢能力。Replica shard 則用於容錯與提升讀取能力。

3. Near Real-Time Search
文件寫入後不會立刻被搜尋到，而是等 refresh 後才可搜尋。預設情況下，Elasticsearch 約每秒 refresh 一次（且只針對最近 30 秒內有被搜尋過的 index），所以通常可在約 1 秒內搜尋到新資料。

refresh 不是 flush，也不是 commit；它只是讓新的 segment 對 search 可見。

與 ETL 的關係

Elasticsearch 在資料流裡的位置要看場景。

Log / 事件分析

在 ELK Stack 中，資料通常快速進入系統。Logstash 比較像 streaming / event-based 的 ingest pipeline，負責 ingest、parse、filter、transform、enrich，最後 output 到 Elasticsearch：

Logstash      → Extract + Transform
Elasticsearch → Index + Search
Kibana        → Visualize

這種情境比較像「先大量收資料 → 建索引 → 之後用 Kibana 查詢分析」。

但它不是完整 Data Lake，因為 Elasticsearch 有 mapping、index schema、查詢最佳化設計，不適合當原始資料永久歸檔層。

產品搜尋 / 商業資料

例如商品搜尋、文章搜尋、站內搜尋，通常需要先做 ETL：

資料來源
→ 清洗
→ 欄位標準化
→ 同義詞 / 分詞 / 權重設計
→ 寫入 Elasticsearch
→ 搜尋 API

這裡 Elasticsearch 比較像 Search Serving Layer。

三者差異

Data Lake       → 原始資料、schema-on-read、長期存放
Elasticsearch   → 索引資料、有 mapping、用來搜尋與近即時分析
Data Warehouse  → 結構化資料、schema-on-write、用來 BI / 報表 / OLAP 分析

Elasticsearch 不是資料倉儲，也不是純資料湖，而是把資料轉成可被快速搜尋與聚合分析的索引系統。它可以做 aggregation 與 dashboard 查詢，但不是為複雜 OLAP、長期歷史報表、跨表 join、嚴格交易一致性所設計。

核心特性

與 ETL 的關係

Log / 事件分析

產品搜尋 / 商業資料

三者差異

參考資料