{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"pypi_pypi-rank-bm25","slug":"pypi-rank-bm25","name":"rank-bm25","type":"repo","url":"https://github.com/dorianbrown/rank_bm25","page_url":"https://unfragile.ai/pypi-rank-bm25","categories":["data-analysis","documentation"],"tags":[],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"pypi_pypi-rank-bm25__cap_0","uri":"capability://search.retrieval.bm25okapi.probabilistic.document.ranking.with.standard.parameters","name":"bm25okapi probabilistic document ranking with standard parameters","description":"Implements the canonical BM25 (Best Matching 25) algorithm using the Okapi variant, which scores document relevance to queries through a probabilistic ranking function that combines term frequency, inverse document frequency, and document length normalization. The implementation accepts pre-tokenized document corpora and queries, computing relevance scores via numpy-based matrix operations on term statistics (document frequencies, term positions, corpus-wide IDF values). Initialization computes IDF values across the entire corpus once, then get_scores() applies the BM25 formula with tunable k1 (term saturation) and b (length normalization) parameters to generate per-document relevance scores.","intents":["I need to rank documents by relevance to a search query without building a full search engine","I want to implement semantic search using BM25 as a baseline before adding neural ranking","I need to retrieve top-N most relevant documents from a corpus for a given query","I'm building a recommendation system that scores items based on keyword matching"],"best_for":["Information retrieval engineers building search systems","NLP practitioners prototyping ranking pipelines","Teams implementing hybrid search (BM25 + dense retrieval)","Developers building lightweight search without Elasticsearch/Solr"],"limitations":["Requires pre-tokenized input — no built-in text preprocessing (stemming, lowercasing, stopword removal)","Stateless scoring — no learned ranking or personalization across queries","Memory scales linearly with corpus size and vocabulary size; corpus must fit in RAM","No support for phrase queries, boolean operators, or field-specific weighting","IDF computation is corpus-specific; adding new documents requires recomputing IDF values"],"requires":["Python 3.6+","numpy (any recent version)","Pre-tokenized documents as list of lists (e.g., [['word1', 'word2'], ['word3', 'word4']])","Pre-tokenized query as list of strings"],"input_types":["list of lists of strings (tokenized corpus)","list of strings (tokenized query)"],"output_types":["numpy array of float scores (one per document)","list of tuples (document_index, score) when using get_top_n()"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_1","uri":"capability://search.retrieval.bm25l.length.normalized.document.ranking.for.variable.length.documents","name":"bm25l length-normalized document ranking for variable-length documents","description":"Implements the BM25L variant, which modifies the standard BM25 formula to normalize document length more aggressively, addressing the bias toward longer documents that can occur with standard BM25. The algorithm adjusts the length normalization component by using a different formula that prevents saturation effects when documents vary significantly in length. Like BM25Okapi, it computes corpus-wide IDF once during initialization and applies the modified scoring formula during get_scores(), but the length normalization parameter b has different semantics and impact compared to the standard variant.","intents":["I need to rank documents fairly when my corpus has highly variable document lengths (e.g., tweets vs. research papers)","I want to reduce the advantage given to longer documents in BM25 scoring","I'm ranking short-form content (abstracts, snippets) and need to prevent length bias"],"best_for":["Search systems over heterogeneous document collections (mixed lengths)","Short-form content ranking (social media, abstracts, snippets)","Teams comparing multiple BM25 variants to find optimal ranking"],"limitations":["Length normalization is more aggressive than BM25Okapi, which may under-reward relevant longer documents in some domains","No empirical guidance on when to use BM25L vs. BM25Okapi — requires domain-specific tuning","Same preprocessing and memory constraints as BM25Okapi","Parameter b still requires manual tuning; no automatic optimization"],"requires":["Python 3.6+","numpy","Pre-tokenized documents and queries"],"input_types":["list of lists of strings (tokenized corpus)","list of strings (tokenized query)"],"output_types":["numpy array of float scores","list of tuples (document_index, score)"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_2","uri":"capability://search.retrieval.bm25.enhanced.term.frequency.handling.with.saturation.control","name":"bm25+ enhanced term frequency handling with saturation control","description":"Implements the BM25+ variant, which refines the term frequency saturation component of standard BM25 by adding a constant term to the numerator of the saturation function, preventing term frequency from ever reaching zero contribution. This addresses a theoretical limitation in BM25Okapi where very high term frequencies can paradoxically reduce relevance scores. The implementation maintains the same initialization and scoring interface as other variants but applies a modified formula during get_scores() that ensures monotonic improvement with term frequency.","intents":["I need to rank documents where term frequency should always contribute positively to relevance","I want to avoid the counter-intuitive behavior where very high term frequencies reduce scores","I'm tuning BM25 parameters and want a variant with better theoretical properties"],"best_for":["Ranking systems where term frequency saturation is problematic","Teams comparing multiple BM25 variants for empirical performance","Researchers implementing BM25+ from academic literature"],"limitations":["Empirical improvements over BM25Okapi are modest and dataset-dependent","No clear guidance on when BM25+ outperforms other variants in practice","Same preprocessing and memory constraints as other BM25 variants","Parameter tuning still required; the additional constant term may need adjustment"],"requires":["Python 3.6+","numpy","Pre-tokenized documents and queries"],"input_types":["list of lists of strings (tokenized corpus)","list of strings (tokenized query)"],"output_types":["numpy array of float scores","list of tuples (document_index, score)"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_3","uri":"capability://data.processing.analysis.corpus.wide.idf.computation.with.lazy.initialization","name":"corpus-wide idf computation with lazy initialization","description":"Computes inverse document frequency (IDF) statistics across the entire tokenized corpus during algorithm initialization, storing term-to-IDF mappings that are reused across all subsequent queries. The implementation iterates through the corpus once to count document frequencies per term, then applies the IDF formula (typically log(N / df) where N is corpus size and df is document frequency) to generate a lookup table. This one-time computation cost is amortized across multiple queries, but requires that the corpus is static — adding new documents necessitates recomputing IDF values for the entire corpus.","intents":["I need to compute IDF statistics once and reuse them across many queries","I want to understand which terms are rare vs. common in my corpus","I'm building a ranking system where IDF values should be corpus-specific, not pre-computed"],"best_for":["Static or slowly-changing document collections","Batch query processing where amortizing initialization cost is beneficial","Systems where corpus-specific IDF is important for ranking quality"],"limitations":["IDF values are corpus-specific and must be recomputed if corpus changes","No incremental IDF updates — adding a single document requires full recomputation","Memory usage scales with vocabulary size (number of unique terms across corpus)","No support for pre-computed or external IDF values"],"requires":["Python 3.6+","numpy","Complete tokenized corpus available at initialization time"],"input_types":["list of lists of strings (tokenized corpus)"],"output_types":["Internal IDF lookup table (dict or numpy array)"],"categories":["data-processing-analysis","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_4","uri":"capability://search.retrieval.top.n.document.retrieval.with.sorted.ranking.results","name":"top-n document retrieval with sorted ranking results","description":"Provides a get_top_n() method that scores all documents in the corpus against a query and returns the top N results sorted by relevance score in descending order. The implementation calls get_scores() internally to compute relevance for all documents, then uses numpy argsort or similar sorting to identify and return the N highest-scoring documents as tuples of (document_index, score). This convenience method eliminates the need for users to manually sort and filter results, providing a common retrieval pattern in a single function call.","intents":["I need to retrieve the top 10 most relevant documents for a query","I want a simple API that returns ranked results without manual sorting","I'm building a search interface that displays top-N results to users"],"best_for":["Search interfaces displaying ranked results","Batch retrieval systems needing top-N results","Rapid prototyping where convenience methods reduce boilerplate"],"limitations":["Scores all documents before filtering — O(corpus_size) complexity regardless of N","No pagination support — must retrieve all top-N results at once","No support for tie-breaking or secondary ranking criteria","Returns document indices, not actual document content — caller must map back to corpus"],"requires":["Python 3.6+","numpy","Pre-tokenized query","Original corpus (to map indices back to documents)"],"input_types":["list of strings (tokenized query)","list of lists of strings (original corpus for reference)","integer N (number of results)"],"output_types":["list of tuples (document_index: int, score: float)"],"categories":["search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_5","uri":"capability://search.retrieval.tunable.bm25.parameters.k1.b.for.algorithm.customization","name":"tunable bm25 parameters (k1, b) for algorithm customization","description":"Exposes k1 (term saturation parameter) and b (length normalization parameter) as configurable hyperparameters during algorithm initialization, allowing users to customize the ranking behavior without modifying the library code. The k1 parameter controls how quickly term frequency saturates (higher k1 = slower saturation, more weight on term frequency), while b controls the degree of length normalization (b=0 disables length normalization, b=1 applies full normalization). These parameters are stored as instance variables and applied during get_scores() computation, enabling empirical tuning for specific domains or datasets.","intents":["I need to tune BM25 parameters to optimize ranking for my specific domain","I want to experiment with different k1 and b values to find the best configuration","I'm comparing BM25 variants and need to adjust parameters for fair comparison"],"best_for":["Teams optimizing ranking quality for specific domains","Researchers comparing BM25 variants with different parameter settings","Systems where default parameters don't work well"],"limitations":["No automatic parameter tuning — requires manual experimentation or grid search","No guidance on parameter ranges or typical values for different domains","Parameter sensitivity varies across datasets; optimal values are domain-specific","No validation of parameter values — invalid settings may produce unexpected results"],"requires":["Python 3.6+","numpy","Understanding of BM25 parameters and their effects"],"input_types":["float k1 (typical range 1.2-2.0)","float b (typical range 0.0-1.0)"],"output_types":["Modified ranking scores based on parameter values"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_6","uri":"capability://data.processing.analysis.minimal.dependency.architecture.with.numpy.only.computation","name":"minimal dependency architecture with numpy-only computation","description":"Implements all BM25 algorithms using only numpy for numerical operations, avoiding heavy dependencies on full IR frameworks (Elasticsearch, Solr) or machine learning libraries (scikit-learn, TensorFlow). The library uses numpy arrays for efficient vector operations (IDF lookups, score computation) and basic Python data structures (lists, dicts) for corpus management. This design choice minimizes installation overhead and allows the library to be embedded in larger systems without dependency conflicts, though it sacrifices some performance optimizations available in specialized IR libraries.","intents":["I need a lightweight BM25 implementation that doesn't require heavy dependencies","I want to embed ranking into a Python application without pulling in Elasticsearch","I'm building a system where dependency management is critical"],"best_for":["Lightweight applications and microservices","Embedded systems or edge deployments with limited dependencies","Prototyping and research where minimal setup is important","Teams avoiding Elasticsearch/Solr for cost or complexity reasons"],"limitations":["Performance is slower than optimized C++ implementations (Whoosh, Lucene)","No built-in indexing structures (inverted indices, B-trees) for faster retrieval","Corpus must fit entirely in memory — no disk-based or distributed indexing","No support for advanced IR features (phrase queries, field weighting, faceting)"],"requires":["Python 3.6+","numpy (single external dependency)"],"input_types":["Standard Python data structures (lists, strings)"],"output_types":["numpy arrays and standard Python types"],"categories":["data-processing-analysis","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_7","uri":"capability://data.processing.analysis.preprocessing.agnostic.tokenization.interface","name":"preprocessing-agnostic tokenization interface","description":"Accepts pre-tokenized documents and queries as input, leaving all text preprocessing (lowercasing, stemming, stopword removal, punctuation handling) to the caller. The library makes no assumptions about tokenization strategy and works with any tokenization scheme the user provides, whether simple whitespace splitting, sophisticated NLP pipelines (spaCy, NLTK), or domain-specific tokenizers. This design maximizes flexibility but requires users to implement preprocessing themselves, making the library a pure ranking algorithm rather than an end-to-end search solution.","intents":["I need to use BM25 with my custom tokenization and preprocessing pipeline","I want to experiment with different preprocessing strategies without changing the ranking algorithm","I'm integrating BM25 into a larger NLP system with existing preprocessing"],"best_for":["Teams with existing preprocessing pipelines","Researchers experimenting with different tokenization strategies","Systems where preprocessing is domain-specific or language-specific","Integration into larger NLP systems"],"limitations":["Users must implement all preprocessing — no built-in lowercasing, stemming, or stopword removal","No guidance on preprocessing best practices for different domains","Preprocessing quality directly impacts ranking quality; poor preprocessing degrades results","Inconsistent preprocessing between corpus and queries will produce poor results"],"requires":["Python 3.6+","Pre-tokenized documents and queries (user-provided)","Consistent tokenization strategy across corpus and queries"],"input_types":["list of lists of strings (pre-tokenized corpus)","list of strings (pre-tokenized query)"],"output_types":["Ranking scores based on provided tokenization"],"categories":["data-processing-analysis","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-rank-bm25__cap_8","uri":"capability://search.retrieval.stateless.query.scoring.with.consistent.interface.across.variants","name":"stateless query scoring with consistent interface across variants","description":"Provides a uniform get_scores(tokenized_query) method across all BM25 variants (Okapi, L, Plus) that takes a pre-tokenized query and returns a numpy array of relevance scores, one per document in the corpus. The method is stateless — it does not modify internal state or cache results — and produces deterministic scores given the same query and corpus. All variants share this interface, allowing users to swap implementations without changing calling code, though the underlying scoring formulas differ.","intents":["I need to score a query against all documents in my corpus","I want to compare different BM25 variants by swapping implementations","I'm building a ranking pipeline that needs consistent scoring interface"],"best_for":["Comparative studies of BM25 variants","Pluggable ranking systems where algorithm can be swapped","Batch scoring of multiple queries"],"limitations":["Scores all documents every time — no caching or incremental updates","No support for query expansion, relevance feedback, or personalization","Scores are absolute values without normalization across queries","No ranking explanation or feature attribution"],"requires":["Python 3.6+","numpy","Pre-tokenized query"],"input_types":["list of strings (pre-tokenized query)"],"output_types":["numpy array of float scores (one per document)"],"categories":["search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":27,"verified":false,"data_access_risk":"high","permissions":["Python 3.6+","numpy (any recent version)","Pre-tokenized documents as list of lists (e.g., [['word1', 'word2'], ['word3', 'word4']])","Pre-tokenized query as list of strings","numpy","Pre-tokenized documents and queries","Complete tokenized corpus available at initialization time","Pre-tokenized query","Original corpus (to map indices back to documents)","Understanding of BM25 parameters and their effects"],"failure_modes":["Requires pre-tokenized input — no built-in text preprocessing (stemming, lowercasing, stopword removal)","Stateless scoring — no learned ranking or personalization across queries","Memory scales linearly with corpus size and vocabulary size; corpus must fit in RAM","No support for phrase queries, boolean operators, or field-specific weighting","IDF computation is corpus-specific; adding new documents requires recomputing IDF values","Length normalization is more aggressive than BM25Okapi, which may under-reward relevant longer documents in some domains","No empirical guidance on when to use BM25L vs. BM25Okapi — requires domain-specific tuning","Same preprocessing and memory constraints as BM25Okapi","Parameter b still requires manual tuning; no automatic optimization","Empirical improvements over BM25Okapi are modest and dataset-dependent","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.28,"ecosystem":0.49999999999999994,"match_graph":0.25,"freshness":1,"weights":{"adoption":0.3,"quality":0.2,"ecosystem":0.15,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:05.295Z","last_scraped_at":"2026-05-03T15:20:18.280Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=pypi-rank-bm25","compare_url":"https://unfragile.ai/compare?artifact=pypi-rank-bm25"}},"signature":"UApeQFoQUSqw/iM4Sbg4XeqPx6BCl/JgdHjTURIAO7S36AUlgg6QYEXNLcoBzi4cxKSPg56tBPGb/doWkXgCCQ==","signedAt":"2026-06-18T01:04:32.522Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/pypi-rank-bm25","artifact":"https://unfragile.ai/pypi-rank-bm25","verify":"https://unfragile.ai/api/v1/verify?slug=pypi-rank-bm25","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}