Semantic multi-lingual search Implementation in Hansken

The Hansken Team is always looking for new innovative inspiration, and one of the entrances to new knowledge is via academia. As a student, you can join the Hansken Team by engaging into an internship. Below, you find one of the directions that can inspire you to design your own research or assignment. Contact us for more information.

Project Description

This project focuses on enhancing the search functionality within Hansken by implementing semantic search. The goal is to break down text streams from digital traces (e.g. documents, emails, chats) into text blocks, for which embedding vectors are calculated and stored in ElasticSearch. This allows users to find semantically similar text blocks (potentially in a different language) based on their search queries, which are also converted into embedding vectors. An interesting continuation of the project will explore how language models, like ChatGPT, can be used for retrieval-augmented generation (RAG), answering users' search queries with the retrieved text blocks.

Why This Project Is Interesting

With the exponential increase in digital data, the need for advanced search capabilities is ever-growing. This project enables in-depth, contextually relevant searches, revolutionizing the way information is found and analyzed.

Skills

  • NLP and machine learning, focusing on embedding vectors
  • ElasticSearch optimization
  • Programming in Python
  • Knowledge of retrieval-augmented generation (RAG) techniques