Machine Learning/DL - NLP

Chroma(Vector DB) and Sentence Transformer

IP_DataScientist 2023. 5. 30.

Chroma 원리 파악해보자

궁금한점 :

문단 또는 문장 유사도가 vectorstore에 들어갈때 측정되는 것인지
retriever가 retrive 할때 문장유사도가 쿼리로 부터 뭘로 측정 되는지가 궁금하다
sentence transformer 로 문장 유사도로 할때 성능이 더 좋은걸로 보이는데, openai 의 임베딩을 활용하면 유사도를 무엇으로 구하는지 궁금하다

LangChain + Chroma 참고

https://github.com/chroma-core/chroma

DeepL번역:

LangChain - 인공지능 네이티브 개발자 툴킷

저희는 인공지능 네이티브 애플리케이션을 개발하기 위한 모듈식 유연한 프레임워크를 구축하기 위해 LangChain을 시작했습니다. 즉시 떠오른 몇 가지 사용 사례는 채팅 봇, 질문 답변 서비스 및 에이전트입니다. 현재 수천 명의 개발자들이 LangChain의 유연하고 사용하기 쉬운 프레임워크를 사용하여 모든 종류의 LLM 기반 애플리케이션을 해킹하고, 수정하고, 구축하고 있습니다.

애플리케이션의 핵심 구성 요소 중 하나는 임베딩과 이러한 임베딩을 보관하고 작업하기 위한 벡터 저장소입니다.

기존의 많은 벡터 스토어에서 발견한 한 가지 문제점은 임베딩을 저장하는 외부 서버에 연결해야 하는 경우가 많다는 점입니다. 이 방식은 애플리케이션을 프로덕션에 적용하는 데는 적합하지만, 로컬에서 애플리케이션을 쉽게 프로토타이핑하는 데는 다소 까다롭습니다.

로컬 벡터 스토어에 대한 최선의 솔루션은 FAISS를 사용하는 것이었지만, 많은 커뮤니티 회원들은 설치 문제를 일으키는 까다로운 종속성이 있다고 지적했습니다.

Chroma - 인공지능 네이티브 벡터 스토어

Chroma는 임베딩의 힘을 활용하는 도구를 만들기 위해 설립되었습니다. 임베딩은 모든 종류의 데이터를 표현하는 인공지능 고유의 방식으로, 모든 종류의 인공지능 기반 도구 및 알고리즘을 사용하는 데 적합합니다.

Chroma 팀은 모델 임베딩의 가능성을 탐색하는 과정에서 최신 인공지능 워크로드를 처리할 수 있는 사용하기 쉽고 성능이 뛰어나며 가벼운 벡터 스토어가 필요했습니다.

이미 여러 벡터 데이터베이스 솔루션이 있지만, 이들은 대부분 대규모 시맨틱 검색과 같은 다른 사용 사례와 액세스 패턴에 맞춰져 있다는 사실을 알게 되었습니다. 또한, 특히 개발 환경에서는 설정과 실행이 번거로운 경우가 많았습니다.

요컨대, 우리에게 필요한 것을 찾지 못했기 때문에 Chroma 팀이 직접 만들었습니다.

Chroma는 임베딩이 포함된 AI 애플리케이션을 쉽게 구축할 수 있도록 처음부터 설계된 벡터 스토어 및 임베딩 데이터베이스입니다. 시작하는 데 필요한 모든 것이 내장되어 있으며(https://docs.trychroma.com/getting-started), 컴퓨터에서pip install chromadb만 하면 실행됩니다!

https://github.com/hwchase17/chroma-langchain

the AI-native open-source embedding database

Chroma 에 따로 임베딩을 제공하지 않은 경우 Sentence Transformer 모델을 활용하는것을 확인할 수 있었습니다.

결과적으로 한글 문서에서 OpenAI의 임베딩 모델 보다 Sentence Transformer의 sRoberta모델을 사용하는게 더 정확합니다.
- 한국어는 jhgan/ko-sroberta-multitask 가 더 성능이 좋다

Chroma 기본 SentenceTransformer 모델

all-MiniLM-L6-v2 모델 사용 및 결과

import chromadb
client = chromadb.Client()

collection = client.create_collection("sample_collection")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents= ["자기자본영업이익율, 자기자본현금영업이익율", "자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율", "한 소녀가 머리를 스타일링 하고 있다"], # we embed for you, or bring your own
    metadatas=[{"source": "notion"}, {"source": "google-docs"}, {"source": "google-docs"}], # filter on arbitrary metadata!
    ids=["doc1", "doc2", "doc3"], # must be unique for each doc 

)

results1 = collection.query(
    query_texts=["문장이 유사하지 않을 수록 거리가 멀게 나오는 듯 그러면 이건 거리가 짧은게 우선순위인지 궁금", "자기자본 영업이익율", "자기자본영업이익율", "자기자본영업이익율, 자기자본현금영업이익율", "한 무리의 남자들이 해면에서 축구를 한다"],
    n_results=3,
    # 완전 일치시 거리가 0에 가깝게 수렴
    
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)  
results1

"""
{'ids': [['doc2', 'doc1', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc3', 'doc2', 'doc1']],
 'embeddings': None,
 'documents': [['자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '자기자본영업이익율, 자기자본현금영업이익율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['한 소녀가 머리를 스타일링 하고 있다',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '자기자본영업이익율, 자기자본현금영업이익율']],
 'metadatas': [[{'source': 'google-docs'},
   {'source': 'notion'},
   {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'google-docs'},
   {'source': 'google-docs'},
   {'source': 'notion'}]],
 'distances': [[0.2731316387653351, 0.3064711093902588, 0.4529026448726654],
  [0.13952328264713287, 0.17610591650009155, 0.7605348229408264],
  [0.13648229837417603, 0.18283022940158844, 0.7422823905944824],
  [0.0, 0.1587597280740738, 0.9638125896453857],
  [0.3948799669742584, 1.083803653717041, 1.2342554330825806]]}
"""

OpenAI Embedding

text-embedding-ada-002 모델 사용 및 결과

# open AI embedding
import pathlib
import glob
import shutil
import os, sys
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings

import chromadb
from chromadb.utils import embedding_functions

load_dotenv("./.env")

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=os.environ["OPENAI_API_KEY"],
                model_name="text-embedding-ada-002"
            )

client = chromadb.Client()

collection = client.create_collection("sample_collection", embedding_function=openai_ef)

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents= ["자기자본영업이익율, 자기자본현금영업이익율", "자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율", "한 소녀가 머리를 스타일링 하고 있다"], # we embed for you, or bring your own
    metadatas=[{"source": "notion"}, {"source": "google-docs"}, {"source": "google-docs"}], # filter on arbitrary metadata!
    ids=["doc1", "doc2", "doc3"], # must be unique for each doc 
)

results2 = collection.query(
    # 각각의 문장을 쿼리로 넣고 컬렉션에 두 문장과 유사도 비교 결과
    query_texts=["문장이 유사하지 않을 수록 거리가 멀게 나오는 듯 그러면 이건 거리가 짧은게 우선순위인지 궁금", "자기자본 영업이익율", "자기자본영업이익율", "자기자본영업이익율, 자기자본현금영업이익율", "한 무리의 남자들이 해면에서 축구를 한다"], 
    n_results=3,
    # 완전 일치시 거리가 0에 가깝게 수렴

    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)  
results2

"""
{'ids': [['doc3', 'doc1', 'doc2'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc3', 'doc1', 'doc2']],
 'embeddings': None,
 'documents': [['한 소녀가 머리를 스타일링 하고 있다',
   '자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['한 소녀가 머리를 스타일링 하고 있다',
   '자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율']],
 'metadatas': [[{'source': 'google-docs'},
   {'source': 'notion'},
   {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'google-docs'},
   {'source': 'notion'},
   {'source': 'google-docs'}]],
 'distances': [[0.44236963987350464, 0.4673040211200714, 0.4736778438091278],
  [0.06233500689268112, 0.2052769958972931, 0.45922377705574036],
  [0.05751027166843414, 0.20472542941570282, 0.4568145275115967],
  [0.0, 0.152279332280159, 0.4780732989311218],
  [0.3532763421535492, 0.48527613282203674, 0.5043538808822632]]}
"""

Sentence Transformer Embedding

jhgan/ko-sroberta-multitask 모델 사용 및 결과

# sentence transformer 임베딩 (확실히 거리감이 더 좋다)

import pathlib
import glob
import shutil
import os, sys
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings

from sentence_transformers import SentenceTransformer, util
import chromadb
from chromadb.utils import embedding_functions

load_dotenv("./.env")

# embedder = SentenceTransformer("jhgan/ko-sroberta-multitask")
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="jhgan/ko-sroberta-multitask")

client = chromadb.Client()

collection = client.create_collection("sample_collection", embedding_function=sentence_transformer_ef)
# embedding_function=embedder.encode()
# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents= ["자기자본영업이익율, 자기자본현금영업이익율", "자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율", "한 소녀가 머리를 스타일링 하고 있다"], # we embed for you, or bring your own
    metadatas=[{"source": "notion"}, {"source": "google-docs"}, {"source": "google-docs"}], # filter on arbitrary metadata!
    ids=["doc1", "doc2", "doc3"], # must be unique for each doc 

)

results3 = collection.query(
    query_texts=["문장이 유사하지 않을 수록 거리가 멀게 나오는 듯 그러면 이건 거리가 짧은게 우선순위인지 궁금", "자기자본 영업이익율", "자기자본영업이익율", "자기자본영업이익율, 자기자본현금영업이익율", "한 무리의 남자들이 해면에서 축구를 한다"],
    n_results=3,
    # 완전 일치시 거리 0
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
    
)  
results3

"""
{'ids': [['doc2', 'doc1', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc1', 'doc2', 'doc3'],
  ['doc2', 'doc1', 'doc3']],
 'embeddings': None,
 'documents': [['자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '자기자본영업이익율, 자기자본현금영업이익율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자기자본영업이익율, 자기자본현금영업이익율',
   '자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '한 소녀가 머리를 스타일링 하고 있다'],
  ['자산 대비 금융비용 가산 순이익 비율, 자본 대비 세전순이익 비율',
   '자기자본영업이익율, 자기자본현금영업이익율',
   '한 소녀가 머리를 스타일링 하고 있다']],
 'metadatas': [[{'source': 'google-docs'},
   {'source': 'notion'},
   {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'notion'}, {'source': 'google-docs'}, {'source': 'google-docs'}],
  [{'source': 'google-docs'},
   {'source': 'notion'},
   {'source': 'google-docs'}]],
 'distances': [[186.68692016601562, 193.07351684570312, 242.72315979003906],
  [34.74972915649414, 70.28058624267578, 238.38934326171875],
  [22.45892333984375, 70.1590347290039, 266.7791748046875],
  [0.0, 52.723785400390625, 253.9215850830078],
  [244.60269165039062, 247.07911682128906, 303.5185241699219]]}
"""

Chroma 내부

# Chroma.py 중 일부
# 유사도 search 기능 있음

def similarity_search(
        self,
        query: str,
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Run similarity search with Chroma.

        Args:
            query (str): Query text to search for.
            k (int): Number of results to return. Defaults to 4.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

        Returns:
            List[Document]: List of documents most similar to the query text.
        """
        docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)
        return [doc for doc, _ in docs_and_scores]

    def similarity_search_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs most similar to embedding vector.
        Args:
            embedding (str): Embedding to look up documents similar to.
            k (int): Number of Documents to return. Defaults to 4.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
        Returns:
            List of Documents most similar to the query vector.
        """
        results = self.__query_collection(
            query_embeddings=embedding, n_results=k, where=filter
        )
        return _results_to_docs(results)

    def similarity_search_with_score(
        self,
        query: str,
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Run similarity search with Chroma with distance.

        Args:
            query (str): Query text to search for.
            k (int): Number of results to return. Defaults to 4.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

        Returns:
            List[Tuple[Document, float]]: List of documents most similar to the query
                text with distance in float.
        """
        if self._embedding_function is None:
            results = self.__query_collection(
                query_texts=[query], n_results=k, where=filter
            )
        else:
            query_embedding = self._embedding_function.embed_query(query)
            results = self.__query_collection(
                query_embeddings=[query_embedding], n_results=k, where=filter
            )

        return _results_to_docs_and_scores(results)

Sentence Transformer

Semantic Textual Similarity — Sentence-Transformers documentation

'Machine Learning > DL - NLP' 카테고리의 다른 글

구글 PaLM 2 정리 (0)	2023.06.27
KoAlpaca 랭체인(langchain) 활용하기 (0)	2023.06.13
LLaMA 모델의 간략한 역사 (0)	2023.05.23
NLP모델 파라미터 수 알아보기(feat. number of parameters of DNN models) (0)	2023.01.12
국립국어원 말뭉치 개체 추출 (0)	2022.12.26