Skip to content

Term查询

Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query

特点:

  • 对输入不做分词, 因此应检索keyword
  • 会为每个包含该词项的文档进行相关度算分
  • 可用constant score将查询转换成一个filtering, 避免算分, 并利用缓存, 提高性能
json
# term查询
GET movies/_search
{
  "query": {
    "term": {
      "title.keyword": {
        "value": "2012"
      }
    }
  }
}

# 利用constant_score转为filter
GET movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "title.keyword": "2012"
        }
      }
    }
  }
}

全文查询

Match Query / Match Phrase Query / Query String Query

特点:

  • 索引和搜索时都会分词, 查询字符串会先分词生成一个共查询的词项列表
  • 会算分
json
# 全文查询
POST movies/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Matrix reloaded",
        "operator": "and"
      }
    }
  }
}

POST movies/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Matrix reloaded",
        "minimum_should_match": 2
      }
    }
  }
}

POST movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "Matrix reloaded",
        "slop": 1
      }
    }
  }
}

结构化搜索

结构化搜索是指对结构化数据的搜索, 其中日期, 布尔类型和数字都是结构化的, 文本也可以是结构化的.

json
# 对布尔类型, 数字类型直接term查询, 可以用constant_score转filter, 避免算分

# 当term查询的字段是数组类型时, 是包含关系, 而不是等值
# 如果要求等值, 则需在索引中加一个计数字段, 然后通过bool查询叠加条件实现

#数字类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "price": [
            "20",
            "30"
          ]
        }
      }
    }
  }
}

#字符类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "productID.keyword": [
            "QQPX-R-3956-#aD8",
            "JODL-X-1937-#pV7"
          ]
        }
      }
    }
  }
}

#数字 Range 查询
GET products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
                        "lte"  : 30
                    }
                }
            }
        }
    }
}

# 日期 range
POST products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "date" : {
                      "gte" : "now-1y"
                    }
                }
            }
        }
    }
}

#exists查询
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }
}

# 查询不存在date字段的文档
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must_not": {
            "exists": {
              "field": "date"
            }
          }
        }
      }
    }
  }
}

相关性算分

搜索的相关性算分, 描述了一个文档和查询语句匹配的程度. 打分的本质是排序, 需要把最符合用户需求的文档排在前面, es5后默认的相关性算分采用BM25.

  • 词频TF(Term Frequency): 检索词在一篇文档中出现的频率=检索词出现次数/文档总字数

    • 度量一条查询与结果文档相关性的简单方法, 是将搜索中每一个词的TF进行相加
    • Stop Word除外, 比如"的", "地", 不贡献相关度, 不应考虑它们的TF
  • 逆文档频率IDF, 词在所有文档中出现的越少, 值越大

    • DF: 检索词在所有文档中出现的频率
    • IDF: Inverse Document Frequency=log(全部文档数/检索词出现过的文档总数)
  • TF-IDF: 本质上将TF求和变成了加权求和

比如搜索"区块链的应用", TF-IDF=TF(区块链)*IDF(区块链)+TF(的)*IDF(的)+TF(应用)*IDF(应用)

1745061546541

1745061552042

json
# 可以打开explain了解算分
POST movies/_search
{
  "explain": true, 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reloaded",
        "operator": "and"
      }
    }
  }
}

# 通过Boosting来控制相关度
	# boost>1, 提升相关性
	# 0<boost<1, 降低相关性
	# boost<0, 贡献负分

# 1. 设置索引以及mapping时可以设置字段的boost
# 2. 查询时可以设置, 如下
POST testscore/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "content" : "elasticsearch"
                }
            },
            "negative" : {
                 "term" : {
                     "content" : "like"
                }
            },
            "negative_boost" : 0.2
        }
    }
}

Query&Filtering与多字符串多字段查询

  • Query: 查询, 有算分
  • Filter: 过滤, 不需要算分, 可以利用缓存, 提升性能

bool查询, 一个或多个查询子句组合

  • 子查询可以任意顺序
  • 可以嵌套多个
  • 如果没有must条件, 则should必须至少满足一个

1745061565222

json
# bool查询
POST /products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

#嵌套,实现了 should not 逻辑
POST /products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "price": "30"
        }
      },
      "should": [
        {
          "bool": {
            "must_not": {
              "term": {
                "avaliable": "false"
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

# bool查询嵌套的层级会影响算分
POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "brown" }},
        { "term": { "text": "red" }},
        { "term": { "text": "quick"   }},
        { "term": { "text": "dog"   }}
      ]
    }
  }
}

POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "dog"   }},
        {
          "bool":{
            "should":[
               { "term": { "text": "brown" }},
                 { "term": { "text": "brown" }},
            ]
          }

        }
      ]
    }
  }
}

# Boosting Query
POST news/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "apple"
        }
      },
      "negative": {
        "match": {
          "content": "pie"
        }
      },
      "negative_boost": 0.5
    }
  }
}

单字符串多字段查询, Dis Max Query

总结: 这种场景还是要用dis_max+tie_breaker合适

json
# 文档1
PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

# 文档2
PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

# 上面两篇文档
# 单字符串多字段查询如下示例, 通过bool查询来实现
# 当bool查询的条件为should时, 此时会简单的将所有匹配的得分相加, 
# 由于brown在第一篇文档中出现两次, 因此文档1优先于文档2, 但语义上文档2的brown fox才更接近
POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

# 此时, 使用dis_max查询即可达到效果, dis_max会简单返回匹配更高者的分数, 得到文档2优先于1
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

# 另一种场景, 当查询的字段是Quick pets时, dis_max算法会使得2个文档评分一样
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

# 这时, 可以引入tie_breaker, tie_breker决定了剩余匹配的贡献度, 
# 引入后匹配了2个单词的文档2评分就会高于文档1
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

单字符串多字段查询: Multi Match

三种场景:

  1. 最佳字段(Best Fields)

当字段之间互相竞争, 又相互关联, 评分来自最匹配字段

补充: 效果就相当于上面的dis_max query

json
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields",
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2,
      "minimum_should_match": "20%"
    }
  }
}
  1. 多数字段(Most Fields)

处理英文内容时, 常见手段是, 在主字段(English Analyzer), 抽取词干, 加入同义词, 以匹配更多的文档, 相同的文本, 加入子字段(Standard Analyzer), 以提供更加精确的匹配, 其他字段作为匹配文档提高相关度的信号, 匹配字段越多则越好

json
# standard分词器不会对单词进行处理
GET /_analyze
{
  "analyzer":"standard",
  "text":"barking dogs"
}
------------------------
{
  "tokens" : [
    {
      "token" : "barking",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dogs",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

# english分词器会去除单词的时态, 单复数等特征
GET /_analyze
{
  "analyzer":"english",
  "text":"barking dogs"
}
------------------------
{
  "tokens" : [
    {
      "token" : "bark",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dog",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

案例

json
# 现有titles这样的索引, 对title字段设定了english分词器
PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

# 写入上面两篇文档后, 有如下搜索, 由于两篇文档都有俩单词匹配, 且第一篇文档更短, 
# 因此文档1优先于文档2, 但是明显文档2才是我们想要的
GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}

# 解决方式, 修改mapping, 对title字段, 添加一个使用standard分词器的子字段std
DELETE /titles
PUT /titles
{
    "settings": { "number_of_shards": 1 },
    "mappings": {
        "my_type": {
            "properties": {
                "title": {
                    "type":     "string",
                    "analyzer": "english",
                    "fields": {
                        "std":   {
                            "type":     "string",
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}

# 再使用most_fields的multi_match查询, 此时文档2的title以及子字段都有匹配, 因此优先于文档1
GET /titles/_search
{
   "query": {
        "multi_match": {
            "query":  "barking dogs",
            "type":   "most_fields",
            "fields": [ "title", "title.std" ]
        }
    }
}

专业解释如下:

1745061579737

  1. 混合字段(Cross Field)

对于某些实体, 例如人名, 地址, 图书信息, 需要在多个字段中确定信息, 单个字段只能作为整体的一部分, 希望在任何这些列出的字段中找到尽可能多的词

json
# Cross Field的multi_match常用于跨字段查询, 比如地区字段如下
# 这种情形的另一个解决方式是使用copy_to, 新增一个字段, 但是会占用更多索引空间
# 此外使用Cross Field还可以设定每个字段的权重
PUT address/_doc/1
{
  "street": "5 Poland Street",
  "city": "London",
  "country" : "UK",
  "postcode": "W1V 3DG"
}


POST address/_search
{
  "query": {"multi_match": {
    "query": "Poland Street W1V",
    "type": "cross_fields",
    "operator": "and", 
    "fields": ["street","city","country","postcode"]
  }}
}