declarative pipeline composition for nlp workflows
Constructs NLP processing pipelines by declaratively composing named components (tagger, parser, NER, textcat, etc.) in a TOML-based `.cfg` configuration file with no hidden defaults. Each component processes Doc objects sequentially, enabling reproducible, version-controlled NLP workflows. Configuration specifies component order, hyperparameters, batch sizes, and GPU allocation, making training runs fully transparent and auditable.
Unique: Uses explicit TOML-based configuration files with 'no hidden defaults' philosophy, making every training decision visible and version-controllable. Unlike frameworks that embed hyperparameters in code, spaCy separates configuration from logic, enabling non-developers to modify pipelines and researchers to track experimental variations precisely.
vs alternatives: Offers more explicit, auditable pipeline composition than NLTK or TextBlob (which embed defaults in code), and more lightweight than full ML frameworks like Hugging Face Transformers for pure NLP task composition.
multi-language linguistic analysis with pre-trained pipelines
Provides 84 pre-trained statistical and transformer-based pipelines across 25 languages, enabling immediate tokenization, POS tagging, dependency parsing, lemmatization, and NER without training. Pipelines are language-specific (e.g., `en_core_web_sm`, `de_core_news_md`) and optimized for speed via Cython-based tokenization and efficient memory management. Supports both CPU-based statistical models and GPU-accelerated transformer models (BERT, etc.) for higher accuracy.
Unique: Combines Cython-optimized statistical models with optional transformer support in a unified API, enabling developers to swap between speed and accuracy without rewriting code. Pre-trained models are language-specific and optimized for production use, not research; includes 84 models across 25 languages with transparent accuracy metrics.
vs alternatives: Faster than Hugging Face Transformers for pure linguistic analysis (tokenization, POS, parsing) due to Cython implementation and statistical models; more language coverage than NLTK; more production-focused than spaCy's research-oriented competitors.
span categorization for multi-span classification
Categorizes arbitrary text spans (not just named entities) into user-defined categories via a trainable span categorization component. Unlike NER which identifies entity boundaries, span categorization assumes span boundaries are known (e.g., from NER or manual annotation) and assigns categories to spans. Supports overlapping spans and multiple categories per span. Enables tasks like aspect-based sentiment analysis, attribute extraction, or fine-grained entity typing.
Unique: Provides span-level classification as a distinct component from NER, enabling fine-grained categorization of pre-identified spans. Supports overlapping spans and multiple categories per span, unlike NER which assumes non-overlapping entity boundaries.
vs alternatives: More flexible than NER for overlapping or fine-grained classification; simpler than building custom span classification models; integrates into pipeline unlike standalone classifiers.
sentence segmentation and boundary detection
Segments text into sentences by detecting sentence boundaries (periods, question marks, exclamation marks, newlines). Uses rule-based heuristics and optional neural models for ambiguous cases (e.g., abbreviations like 'Dr.' or 'U.S.'). Sentence boundaries are marked in Doc objects, enabling downstream components to process sentences independently. Supports custom sentence segmentation rules via component configuration.
Unique: Integrates sentence segmentation into the pipeline as a configurable component, enabling custom segmentation rules without code changes. Supports both rule-based and neural models for boundary detection.
vs alternatives: More accurate than simple regex-based splitting; handles abbreviations better than NLTK; integrates into pipeline unlike standalone segmenters.
project templates and end-to-end workflow scaffolding
Provides pre-built project templates for common NLP tasks (NER, text classification, relation extraction, etc.) that can be cloned and customized. Templates include directory structure, configuration files, training scripts, and evaluation code, enabling developers to start with a working end-to-end workflow rather than building from scratch. Templates are version-controlled and can be extended with custom components or data.
Unique: Provides end-to-end project templates with configuration, training scripts, and evaluation code, enabling developers to start with a working workflow. Templates are version-controlled and can be customized without losing template updates.
vs alternatives: More complete than code snippets; enables faster project setup than building from scratch; standardizes project structure across teams.
visualization of linguistic annotations
Provides built-in visualizers for displaying linguistic annotations (dependency trees, named entities, text classifications) in interactive HTML or Jupyter notebooks. Visualizers render Doc objects with color-coded entities, dependency arcs, and annotations, enabling debugging and explanation of model predictions. Supports custom styling and filtering of visualizations.
Unique: Provides built-in visualizers for dependency trees and NER that render directly in Jupyter notebooks or as interactive HTML, enabling quick inspection without external tools. Visualizers are tightly integrated with spaCy's Doc objects.
vs alternatives: More integrated than external visualization tools; simpler than building custom visualizations; supports Jupyter notebooks for interactive exploration.
model packaging and deployment
Packages trained spaCy pipelines as distributable Python packages (wheels, tarballs) that can be installed via pip. Enables versioning, dependency management, and easy deployment to production environments. Packaged models include all trained components, configuration, and metadata; can be installed as `pip install spacy-model-name` and loaded via `spacy.load()`. Supports model versioning and compatibility checking.
Unique: Provides built-in model packaging as Python packages, enabling trained pipelines to be versioned, distributed, and installed via pip. Models include all components and configuration; no separate model files required.
vs alternatives: Simpler than manual model serialization; enables version control and dependency management; integrates with Python packaging ecosystem.
llm-integration-for-few-shot-and-zero-shot-tasks
Integrates large language models (via spacy-llm package) for few-shot and zero-shot NLP tasks without requiring training data. LLMs are used as components in the pipeline, enabling tasks like entity extraction, text classification, and relation extraction using natural language prompts instead of labeled training data.
Unique: Integrates LLMs as pipeline components via spacy-llm package, enabling few-shot and zero-shot NLP tasks without training data. LLM outputs are converted to structured spaCy annotations (entities, classifications, etc.).
vs alternatives: Faster to prototype than training custom models because no labeled data required, but slower and more expensive than pretrained models for production use due to LLM API latency and costs.
+9 more capabilities