Multi Person Skeletal Tracking And Pose Detection In Single Video

1

MediaPipeFramework60/100

via “pose landmark detection for body keypoint tracking”

Google's cross-platform on-device ML framework with pre-built solutions.

Unique: Provides 33-point full-body skeleton with 3D coordinate estimation (including depth via monocular estimation) and per-landmark visibility scores, optimized for on-device inference on mobile and web platforms; uses a single-stage neural network approach rather than multi-stage pipelines.

vs others: Faster and more mobile-friendly than OpenPose or MediaPipe's legacy Pose solution, includes 3D coordinate estimation without requiring depth cameras unlike some alternatives, but limited to single-person pose and requires full-body visibility unlike multi-person pose systems.

2

MS COCO (Common Objects in Context)Dataset60/100

via “human keypoint detection annotation with standardized joint coordinate system”

330K images with object detection, segmentation, and captions.

Unique: Standardized 17-joint skeleton with explicit visibility flags enables robust evaluation of pose estimation under occlusion; linked to instance segmentation masks allows joint-level accuracy analysis within person bounding boxes

vs others: More comprehensive than OpenPose dataset (no visibility flags) and larger scale than Human3.6M (3.6M frames vs 330K images); visibility annotations enable explicit occlusion handling unlike MPII (which lacks visibility metadata)

3

Segment Anything 2Model59/100

via “multi-object video segmentation with independent prompt-per-object tracking”

Meta's foundation model for visual segmentation.

Unique: Maintains independent memory buffers per tracked object, allowing the same cross-frame attention mechanism to operate on object-specific feature sequences. This design avoids global memory conflicts and enables flexible object-level prompting without requiring a unified object registry.

vs others: More flexible than traditional multi-object tracking (MOT) methods because it doesn't require pre-computed detections or appearance models; instead, it directly propagates semantic masks, handling appearance changes and occlusions through learned attention patterns.

4

YOLOv8Repository58/100

via “pose estimation with keypoint detection and visualization”

Real-time object detection, segmentation, and pose.

Unique: Implements pose estimation as a native task variant using the same training/inference pipeline as detection, with specialized keypoint loss functions and OKS metrics, enabling pose analysis without separate pose estimation models

vs others: More integrated than standalone pose estimation models (OpenPose, MediaPipe) because pose estimation is native to YOLO, and more flexible than single-person pose estimators because multi-person pose detection is supported

5

Detectron2Repository58/100

via “keypoint detection with multi-person pose estimation”

Meta's modular object detection platform on PyTorch.

Unique: Implements keypoint detection via heatmap regression on RoI-aligned features, enabling precise multi-person pose estimation — unlike single-person pose estimation which assumes one person per image

vs others: More accurate than bottom-up pose estimation (OpenPose) because it leverages detection confidence to disambiguate keypoints; more efficient than top-down methods with separate detection and pose estimation because keypoint prediction is integrated into the detection pipeline

6

Deepseek v4 peopleModel45/100

via “multi-person tracking”

Deepseek v4 people

Unique: Combines advanced tracking algorithms with real-time processing capabilities, setting it apart from traditional tracking systems that may not handle occlusions effectively.

vs others: More effective in maintaining identity across frames than simpler tracking systems that lose track during occlusions.

7

yolov10sModel42/100

via “video object tracking via frame-by-frame detection with optional temporal smoothing”

object-detection model by undefined. 2,23,706 downloads.

Unique: YOLOv10's improved detection consistency (lower false positive flicker) across frames compared to YOLOv8 reduces tracking ID switches, making it more suitable for video tracking pipelines without requiring temporal smoothing.

vs others: Simpler than 3D detection models (which require temporal context) for 2D video tracking; more flexible than end-to-end tracking models (which require retraining) since tracking algorithm can be swapped independently.

8

DINO-XMCP Server36/100

via “human pose keypoint estimation with 17-point skeletal representation”

** - Advanced computer vision and object detection MCP server powered by Dino-X, enabling AI agents to analyze images, detect objects, identify keypoints, and perform visual understanding tasks.

Unique: Integrates DINO-X's pose estimation model through MCP, exposing 17-point COCO keypoint format with per-keypoint confidence scores. The architecture allows LLM agents to reason about human pose without requiring separate pose estimation infrastructure.

vs others: Simpler integration than OpenPose or MediaPipe for MCP-based workflows, with unified authentication and transport through the DINO-X platform rather than managing multiple vision libraries.

9

LivePortraitWeb App27/100

via “real-time facial landmark detection and tracking”

LivePortrait — AI demo on HuggingFace

Unique: Implements temporal smoothing through a learned motion model rather than post-hoc filtering, reducing jitter while preserving fast expression changes by predicting landmark positions based on optical flow and previous frame history

vs others: Achieves lower latency than MediaPipe for video processing and higher accuracy than traditional Dlib-based methods because it uses modern transformer architectures with temporal context aggregation

10

SadTalkerWeb App25/100

via “real-time facial landmark detection and tracking”

SadTalker — AI demo on HuggingFace

Unique: Uses a lightweight, pre-trained landmark detector (MediaPipe) that runs efficiently on CPU or GPU, with temporal smoothing via Kalman filtering to reduce jitter. Landmarks are automatically converted to 3D pose estimates using weak-perspective projection, enabling downstream 3D animation tasks.

vs others: Faster and more robust than traditional computer vision approaches (Dlib, OpenFace) because it uses modern deep learning with pre-trained weights, achieving real-time performance on mobile devices while maintaining accuracy.

11

FacePoke_CLONE-THIS-REPO-TO-USE-ITWeb App23/100

via “facial landmark detection and tracking”

FacePoke_CLONE-THIS-REPO-TO-USE-IT — AI demo on HuggingFace

Unique: Integrates landmark detection directly into the HuggingFace Spaces inference pipeline, leveraging Gradio's built-in video input handling and model caching to avoid redundant model loads across requests

vs others: More accessible than raw OpenCV/dlib implementations because it abstracts model loading and preprocessing; faster iteration than building custom PyTorch models because it uses pre-trained weights from HuggingFace Model Hub

12

MovmiWeb App

via “multi-person skeletal tracking and pose detection in single video”

Unique: Automatically detects and separates multiple people in a single video without manual per-person segmentation, enabling efficient capture of group scenes and interactions; outputs distinct FBX files per person, allowing independent character animation and reuse in different contexts

vs others: More efficient than filming each character separately and manually synchronizing animations; more accessible than professional mocap studios which require controlled environments and marker placement on each actor; more flexible than pose libraries which are limited to single-character poses

13

PoseTracker APIAPI

via “real-time single-person skeletal pose estimation from video stream”

Unique: Hardware-agnostic approach eliminates dependency on OptiTrack, Vicon, or Kinect systems by running inference on standard webcams; freemium tier removes upfront hardware investment barrier that traditionally gates motion capture access to well-funded studios

vs others: Dramatically cheaper deployment than traditional mocap (no marker suits, cameras, or calibration) but lacks the sub-millimeter accuracy and multi-person tracking of enterprise systems like OptiTrack

14

DeepMotionProduct

via “body-pose-estimation-from-video”

15

Move AIProduct

via “markerless body pose estimation”

16

QuickMagicProduct

via “real-time human pose estimation from video”

17

Rokoko VideoProduct

via “video-to-skeleton-tracking”

18

PlaskProduct

via “ai-pose-estimation-and-joint-tracking”

19

MeshcapadeProduct

via “multi-person tracking in group footage”

20

Voxel51Product

via “real-time video object detection and tracking”

Top Matches

Also Known As

Company