Skip to main content

ESS - 3.Tools and Technologies involved in modern search architecture - Part II


This discussion is continuation of Tools and Technologies involved in modern search architecture. Please use the link to access the part-I for a quick recap as the topics mentioned there may be referred here.

In Search implementation once you're done with the challenges of Crawling the content and building an Inverted Index to quickly access the matched documents two more challenges are up for you:

  1. Relevant results to the user - Relevance Score
  2. Deliver results to the user - SERP
Relevance Score:

Relevance scoring is used by search engines to identify the order in which the documents should appear. Here's a documented definition:

Relevance scoring uses the Boolean model to find matching documents and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from TF-IDF (term frequency/inverse document frequency) and the vector space model but adds more modern features like a coordination factor, field length normalization, and term or query clause boosting.

It's interesting how all the above algorithms used to provide most relevant documents. Let's try to decipher some algorithms and their role in simple terms:

Boolean Model:
This model applies AND, OR and NOT as expressed in the query to find all documents that match the conditions as in example below:
    ironman AND hulk AND antman AND ( captain america OR thor )
The above query will provide documents that contains ironman, hulk, antman and either captain america or thor.

TF_IDF:
Once you have a list of matching documents, they need to be presented in useful order ranked by relevance. Every document will not contain all the terms in the query, some terms are more important than others and relevance score of the whole document also depends on the weight of the query term that appears in the document. Weight of query term is calculated using Term Frequency, Inverse Document Frequency, Field-length norm.

Term Frequency(TF):
 TF specifies how often the term appears in the document. The more often, the higher the weight.
    tf(t in d) = frequency(t)
If you want to ignore term frequency and find only documents that contain the term, then you can disable term frequencies in the field mapping by setting index_options to docs which will disable term frequencies and term positions.
PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type":          "string",
          "index_options": "docs" 
        }
      }
    }
  }
}

Inverse Document Frequency(IDF):
IDF specifies how often does the term appear in all documents in the collection.The more often, the lower the weight i.e., common terms like and or the are ignored for relevance calculation, as they appear in most documents.
  idf(t) = 1 + log ( numDocs / (docFreq + 1))
numDocs → number of documents in the index.
docFreq → number of documents that contain the term.

Field-length norm:
For field length normailization, a term match found in a field with a low number of total terms is going to be more important than a match found in a field with a large number of terms. For example, a term that appears in title field will have more weight than the same term that appeared in body field.

norm(d) = 1/√numTerms
numTerms → number of terms in the field

*field-length norm is important for full-text search

All three factors - term frequency, inverse document frequency, and field-length norm are calculated and stored at index time and together they are used to calculate the weight of a single term in particular document

Comments

Popular posts from this blog

Spring Boot - RestTemplate PATCH request fix

  In Spring Boot, you make a simple http request as below: 1. Define RestTemplate bean @Bean public RestTemplate restTemplate () { return new RestTemplate (); } 2. Autowire RestTemplate wherever you need to make Http calls @Autowire private RestTemplate restTemplate ; 3. Use auto-wired RestTemplate to make the Http call restTemplate . exchange ( "http://localhost:8080/users" , HttpMethod . POST , httpEntity , String . class ); Above setup works fine for all Http calls except PATCH. The following exception occurs if you try to make a PATCH request as above Exception: I / O error on PATCH request for \ "http://localhost:8080/users\" : Invalid HTTP method: PATCH ; nested exception is java . net . ProtocolException : Invalid HTTP method: PATCH Cause: Above exception happens because of the HttpURLConnection used by default in Spring Boot RestTemplate which is provided by the standard JDK HTTP library. More on this at this  bug Fix: This can b...

RADUS#4 - Caching the response in REST API's

  Caching in spring boot app: Caching can be used to provide a performance boost to your application users by avoiding the business logic processing involved again and again, load on your DB, requests to external systems if the users request data that's not changed frequently Different types of caching: We'll be focusing more on in-memory caching in this post i listed other options available to have an idea. In-memory caching You'll have a key-value data stores that stores the response of the request after it is served for the first time There are multiple systems like Redis, Memcached that do this distributed caching very well By default Spring provides concurrent hashmap as default cache, but you can override CacheManager to register external cache providers. Database caching Web server caching Dependencies needed: Maven < dependency > < groupId > org . springframework . boot </ groupId > < artifactId > spring - boot - starter - cache ...

Settings.xml for Maven, JFrog

For development and deployment of applications we always use an artifactory in real-time world to host artifacts needed for your build and also as a target to deploy artifacts generated in the build process. For Maven, to communicate with artifactory we need a settings.xml file which is usually located at "/User/rake/.m2/settings.xml" this file consists of how to authenticate to the artifactory servers and authorizations to read/ write to different locations like release, snapshots e.t.c... Settings.xml can be generated using the artifactory you're using which in my case is JFrog , but here's a sample settings file for your reference incase you're feeling lazy☺ <?xml version="1.0" encoding="UTF-8"?> <settings xsi:schemaLocation= "http://maven.apache.org/SETTINGS/1.1.0 http://maven.apache.org/xsd/settings-1.1.0.xsd" xmlns= "http://maven.apache.org/SETTINGS/1.1.0" xmlns:xsi= "http://www.w3.org/2001/XM...