This discussion is continuation of Tools and Technologies involved in modern search architecture. Please use the link to access the part-I for a quick recap as the topics mentioned there may be referred here.
In Search implementation once you're done with the challenges of Crawling the content and building an Inverted Index to quickly access the matched documents two more challenges are up for you:
- Relevant results to the user - Relevance Score
- Deliver results to the user - SERP
Relevance Score:
Relevance scoring is used by search engines to identify the order in which the documents should appear. Here's a documented definition:
Relevance scoring uses the Boolean model to find matching documents and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from TF-IDF (term frequency/inverse document frequency) and the vector space model but adds more modern features like a coordination factor, field length normalization, and term or query clause boosting.
It's interesting how all the above algorithms used to provide most relevant documents. Let's try to decipher some algorithms and their role in simple terms:
Boolean Model:
This model applies AND, OR and NOT as expressed in the query to find all documents that match the conditions as in example below:
ironman AND hulk AND antman AND ( captain america OR thor )
The above query will provide documents that contains ironman, hulk, antman and either captain america or thor.
TF_IDF:
Once you have a list of matching documents, they need to be presented in useful order ranked by relevance. Every document will not contain all the terms in the query, some terms are more important than others and relevance score of the whole document also depends on the weight of the query term that appears in the document. Weight of query term is calculated using Term Frequency, Inverse Document Frequency, Field-length norm.
Term Frequency(TF):
TF specifies how often the term appears in the document. The more often, the higher the weight.
tf(t in d) = √frequency(t)
If you want to ignore term frequency and find only documents that contain the term, then you can disable term frequencies in the field mapping by setting index_options to docs which will disable term frequencies and term positions.
Relevance scoring uses the Boolean model to find matching documents and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from TF-IDF (term frequency/inverse document frequency) and the vector space model but adds more modern features like a coordination factor, field length normalization, and term or query clause boosting.
It's interesting how all the above algorithms used to provide most relevant documents. Let's try to decipher some algorithms and their role in simple terms:
Boolean Model:
This model applies AND, OR and NOT as expressed in the query to find all documents that match the conditions as in example below:
ironman AND hulk AND antman AND ( captain america OR thor )
The above query will provide documents that contains ironman, hulk, antman and either captain america or thor.
TF_IDF:
Once you have a list of matching documents, they need to be presented in useful order ranked by relevance. Every document will not contain all the terms in the query, some terms are more important than others and relevance score of the whole document also depends on the weight of the query term that appears in the document. Weight of query term is calculated using Term Frequency, Inverse Document Frequency, Field-length norm.
Term Frequency(TF):
TF specifies how often the term appears in the document. The more often, the higher the weight.
tf(t in d) = √frequency(t)
If you want to ignore term frequency and find only documents that contain the term, then you can disable term frequencies in the field mapping by setting index_options to docs which will disable term frequencies and term positions.
PUT /my_index { "mappings": { "doc": { "properties": { "text": { "type": "string", "index_options": "docs" } } } } }
Inverse Document Frequency(IDF):
IDF specifies how often does the term appear in all documents in the collection.The more often, the lower the weight i.e., common terms like and or the are ignored for relevance calculation, as they appear in most documents.
idf(t) = 1 + log ( numDocs / (docFreq + 1))
numDocs → number of documents in the index.
docFreq → number of documents that contain the term.
Field-length norm:
For field length normailization, a term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms. For example, a term that appears in title field will have more weight than the same term that appeared in body field.
norm(d) = 1/√numTerms
numTerms → number of terms in the field
*field-length norm is important for full-text search
All three factors - term frequency, inverse document frequency, and field-length norm are calculated and stored at index time and together they are used to calculate the weight of a single term in particular document
IDF specifies how often does the term appear in all documents in the collection.The more often, the lower the weight i.e., common terms like and or the are ignored for relevance calculation, as they appear in most documents.
idf(t) = 1 + log ( numDocs / (docFreq + 1))
numDocs → number of documents in the index.
docFreq → number of documents that contain the term.
Field-length norm:
For field length normailization, a term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms. For example, a term that appears in title field will have more weight than the same term that appeared in body field.
norm(d) = 1/√numTerms
numTerms → number of terms in the field
*field-length norm is important for full-text search
All three factors - term frequency, inverse document frequency, and field-length norm are calculated and stored at index time and together they are used to calculate the weight of a single term in particular document
Comments
Post a Comment