nomic-ai/CodeRankEmbed

TEXT GENERATIONConcurrency Cost:1Model Size:0.1BQuant:BF16Ctx Length:32kPublished:Nov 3, 2024License:mitArchitecture:Transformer0.1K Open Weights Cold

CodeRankEmbed by nomic-ai is a 137 million parameter bi-encoder model specifically designed for code retrieval, supporting an 8192 token context length. It significantly outperforms other open-source and proprietary code embedding models across various code retrieval tasks. This model is optimized to generate embeddings for code snippets and queries, facilitating efficient and accurate search for relevant code.

Loading preview...

CodeRankEmbed: High-Performance Code Retrieval

CodeRankEmbed is a 137 million parameter bi-encoder model developed by nomic-ai, engineered for superior code retrieval performance. It leverages an 8192 token context length, making it suitable for processing longer code snippets and complex queries. The model is initialized with Arctic-Embed-M-Long and contrastively fine-tuned using InfoNCE loss on the 21 million example CoRNStack dataset.

Key Capabilities

  • State-of-the-art Code Embedding: Achieves an MRR of 77.9 on CSN and NDCG@10 of 60.1 on CoIR, outperforming models like CodeSage, Jina-Code-v2, CodeT5+, OpenAI-Ada-002, and Voyage-Code-002.
  • Efficient Code Search: Designed to generate embeddings for both queries and code, enabling effective semantic search for relevant code.
  • Long Context Support: Benefits from an 8192 token context window, allowing for comprehensive understanding of code and queries.
  • Integration with Re-rankers: Can be combined with re-rankers like CodeRankLLM for enhanced retrieval quality.

Usage Notes

  • Queries must include the task instruction prefix: "Represent this query for searching relevant code".

Good For

  • Developers building intelligent code search engines.
  • Systems requiring high-accuracy code retrieval from large repositories.
  • Applications needing to find relevant code snippets based on natural language queries.