https://github.com/google/sentencepiece

natural-language-processing neural-machine-translation word-segmentation

Last synced: about 1 month ago

Repository metadata:

Unsupervised text tokenizer for Neural Network-based text generation.


Owner metadata:


Committers metadata

Last synced: about 1 month ago

Total Commits: 751
Total Committers: 93
Avg Commits per committer: 8.075
Development Distribution Score (DDS): 0.463

Commits in past year: 68
Committers in past year: 13
Avg Commits per committer in past year: 5.231
Development Distribution Score (DDS) in past year: 0.647

Name Email Commits
Taku Kudo t****u@g****m 403
Taku Kudo t****0 187
dependabot[bot] 4****] 19
resec r****9@g****m 10
Pedro Kaj Kjellerup Nacht p****t@g****m 6
Tetsuo Kiso t****9@g****m 4
laurentsimon l****n@g****m 3
Kentaro Hayashi h****i@c****m 3
Dr. Christoph Mittendorf 3****s 3
Darío Hereñú m****a@g****m 3
Yasuhiro Matsumoto m****p@g****m 3
TSUCHIYA Masatoshi t****a@t****p 3
Nagico n****o@q****m 3
Matthew Mistele m****e@s****u 3
Kentaro Hayashi k****s@g****m 3
Kashif Rasul k****l@z****e 3
H. Vetinari h****i@g****m 3
stephantul s****l@g****m 2
mark e****r@g****m 2
Michal Fojtak m****k@s****z 2
Lee Dongjin d****n@a****g 2
Julius Frost j****t@g****m 2
Jaepil Jeong z****h@g****m 2
Graham Neubig n****g@g****m 2
Christopher Hong c****4@g****m 2
Ryan Schmidt g****t@r****m 2
Julius Frost 3****t 2
Guillaume Klein g****n 2
Aleksey Morozov 3****v 2
A2va 4****a 2
and 63 more...

Issue and Pull Request metadata

Last synced: about 1 month ago


Package metadata

pypi: sentencepiece

SentencePiece python wrapper

  • Homepage: https://github.com/google/sentencepiece
  • Documentation: https://sentencepiece.readthedocs.io/
  • Licenses: Apache
  • Latest release: 0.2.0 (published 10 months ago)
  • Last Synced: 2024-11-10T23:35:46.713Z (about 1 month ago)
  • Versions: 32
  • Dependent Packages: 802
  • Dependent Repositories: 18,074
  • Downloads: 22,525,768 Last month
  • Docker Downloads: 26,551,887
  • Rankings:
    • Dependent packages count: 0.039%
    • Dependent repos count: 0.059%
    • Downloads: 0.074%
    • Average: 0.509%
    • Stargazers count: 0.574%
    • Docker downloads count: 0.655%
    • Forks count: 1.652%
  • Maintainers (1)
pypi: tf-sentencepiece

SentencePiece Encode/Decode ops for TensorFlow

  • Homepage: https://github.com/google/sentencepiece
  • Documentation: https://tf-sentencepiece.readthedocs.io/
  • Licenses: Apache
  • Latest release: 0.1.92 (published over 4 years ago)
  • Last Synced: 2024-11-10T23:35:49.881Z (about 1 month ago)
  • Versions: 15
  • Dependent Packages: 1
  • Dependent Repositories: 30
  • Downloads: 5,643 Last month
  • Docker Downloads: 31
  • Rankings:
    • Stargazers count: 0.292%
    • Forks count: 1.255%
    • Average: 2.378%
    • Dependent repos count: 2.678%
    • Dependent packages count: 3.24%
    • Docker downloads count: 3.331%
    • Downloads: 3.469%
  • Maintainers (1)
go: github.com/google/sentencepiece

  • Homepage:
  • Documentation: https://pkg.go.dev/github.com/google/sentencepiece#section-documentation
  • Licenses: apache-2.0
  • Latest release: v0.2.0 (published 10 months ago)
  • Last Synced: 2024-11-10T23:36:11.977Z (about 1 month ago)
  • Versions: 23
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Rankings:
    • Stargazers count: 0.792%
    • Forks count: 0.898%
    • Average: 3.717%
    • Dependent repos count: 4.802%
    • Dependent packages count: 8.376%
conda: sentencepiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

  • Homepage: https://github.com/google/sentencepiece/
  • Licenses: Apache-2.0
  • Latest release: 0.1.96 (published almost 3 years ago)
  • Last Synced: 2024-11-10T23:37:23.222Z (about 1 month ago)
  • Versions: 4
  • Dependent Packages: 14
  • Dependent Repositories: 25
  • Rankings:
    • Stargazers count: 4.073%
    • Dependent packages count: 4.472%
    • Forks count: 5.081%
    • Average: 5.217%
    • Dependent repos count: 7.242%
spack: sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation. This is the C++ package.

  • Homepage: https://github.com/google/sentencepiece
  • Licenses: []
  • Latest release: 0.1.91 (published over 2 years ago)
  • Last Synced: 2024-10-29T08:31:17.574Z (about 2 months ago)
  • Versions: 2
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Rankings:
    • Dependent repos count: 0.0%
    • Stargazers count: 1.584%
    • Forks count: 3.029%
    • Average: 8.17%
    • Dependent packages count: 28.067%
  • Maintainers (1)
homebrew: sentencepiece

Unsupervised text tokenizer and detokenizer

  • Homepage: https://github.com/google/sentencepiece
  • Licenses: Apache-2.0
  • Latest release: 0.2.0 (published 10 months ago)
  • Last Synced: 2024-11-10T23:35:55.658Z (about 1 month ago)
  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 82 Last month
  • Rankings:
    • Forks count: 2.538%
    • Stargazers count: 3.08%
    • Dependent packages count: 18.349%
    • Average: 23.074%
    • Dependent repos count: 29.299%
    • Downloads: 62.104%
conda: sentencepiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

  • Homepage: https://github.com/google/sentencepiece/
  • Licenses: Apache-2.0
  • Latest release: 0.1.99 (published over 1 year ago)
  • Last Synced: 2024-10-29T09:44:02.846Z (about 2 months ago)
  • Versions: 3
  • Dependent Packages: 2
  • Dependent Repositories: 25
  • Rankings:
    • Stargazers count: 9.813%
    • Forks count: 11.469%
    • Average: 23.076%
    • Dependent repos count: 30.083%
    • Dependent packages count: 40.938%
conda: libsentencepiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

  • Homepage: https://github.com/google/sentencepiece/
  • Licenses: Apache-2.0
  • Latest release:
  • Last Synced: 2024-11-10T23:36:16.883Z (about 1 month ago)
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Rankings:
    • Dependent packages count: 51.245%
    • Average: 53.792%
    • Dependent repos count: 56.338%
conda: sentencepiece-spm

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

  • Homepage: https://github.com/google/sentencepiece/
  • Licenses: Apache-2.0
  • Latest release:
  • Last Synced: 2024-10-29T09:45:11.090Z (about 2 months ago)
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Rankings:
    • Dependent packages count: 51.245%
    • Average: 53.792%
    • Dependent repos count: 56.338%
conda: sentencepiece-python

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

  • Homepage: https://github.com/google/sentencepiece/
  • Licenses: Apache-2.0
  • Latest release:
  • Last Synced: 2024-11-10T23:36:25.944Z (about 1 month ago)
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Rankings:
    • Dependent packages count: 51.245%
    • Average: 53.792%
    • Dependent repos count: 56.338%