https://github.com/google/sentencepiece
natural-language-processing neural-machine-translation word-segmentation
Last synced: about 1 month ago
Repository metadata:
Unsupervised text tokenizer for Neural Network-based text generation.
- Host: GitHub
- URL: https://github.com/google/sentencepiece
- Owner: google
- License: apache-2.0
- Created: 2017-03-07T10:03:48.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2024-11-01T15:36:25.000Z (about 2 months ago)
- Last Synced: 2024-11-05T21:03:12.495Z (about 2 months ago)
- Topics: natural-language-processing, neural-machine-translation, word-segmentation
- Language: C++
- Homepage:
- Size: 24.5 MB
- Stars: 10,254
- Watchers: 127
- Forks: 1,172
- Open Issues: 37
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Owner metadata:
- Name: Google
- Login: google
- Email: [email protected]
- Kind: organization
- Description: Google ❤️ Open Source
- Website: https://opensource.google/
- Location:
- Twitter: GoogleOSS
- Company:
- Icon url: https://avatars.githubusercontent.com/u/1342004?v=4
- Repositories: 2445
- Last Synced at: 2023-04-09T05:37:45.829Z
- Profile URL: https://github.com/google
- Sponsor URL:
Committers metadata
Last synced: about 1 month ago
Total Commits: 751
Total Committers: 93
Avg Commits per committer: 8.075
Development Distribution Score (DDS): 0.463
Commits in past year: 68
Committers in past year: 13
Avg Commits per committer in past year: 5.231
Development Distribution Score (DDS) in past year: 0.647
Name | Commits | |
---|---|---|
Taku Kudo | t****u@g****m | 403 |
Taku Kudo | t****0 | 187 |
dependabot[bot] | 4****] | 19 |
resec | r****9@g****m | 10 |
Pedro Kaj Kjellerup Nacht | p****t@g****m | 6 |
Tetsuo Kiso | t****9@g****m | 4 |
laurentsimon | l****n@g****m | 3 |
Kentaro Hayashi | h****i@c****m | 3 |
Dr. Christoph Mittendorf | 3****s | 3 |
Darío Hereñú | m****a@g****m | 3 |
Yasuhiro Matsumoto | m****p@g****m | 3 |
TSUCHIYA Masatoshi | t****a@t****p | 3 |
Nagico | n****o@q****m | 3 |
Matthew Mistele | m****e@s****u | 3 |
Kentaro Hayashi | k****s@g****m | 3 |
Kashif Rasul | k****l@z****e | 3 |
H. Vetinari | h****i@g****m | 3 |
stephantul | s****l@g****m | 2 |
mark | e****r@g****m | 2 |
Michal Fojtak | m****k@s****z | 2 |
Lee Dongjin | d****n@a****g | 2 |
Julius Frost | j****t@g****m | 2 |
Jaepil Jeong | z****h@g****m | 2 |
Graham Neubig | n****g@g****m | 2 |
Christopher Hong | c****4@g****m | 2 |
Ryan Schmidt | g****t@r****m | 2 |
Julius Frost | 3****t | 2 |
Guillaume Klein | g****n | 2 |
Aleksey Morozov | 3****v | 2 |
A2va | 4****a | 2 |
and 63 more... |
Issue and Pull Request metadata
Last synced: about 1 month ago
Package metadata
- Total packages: 10
-
Total downloads:
- homebrew: 82 last-month
- pypi: 22,531,411 last-month
- Total docker downloads: 26,551,918
- Total dependent packages: 820 (may contain duplicates)
- Total dependent repositories: 18,156 (may contain duplicates)
- Total versions: 87
- Total maintainers: 2
pypi: sentencepiece
SentencePiece python wrapper
- Homepage: https://github.com/google/sentencepiece
- Documentation: https://sentencepiece.readthedocs.io/
- Licenses: Apache
- Latest release: 0.2.0 (published 10 months ago)
- Last Synced: 2024-11-10T23:35:46.713Z (about 1 month ago)
- Versions: 32
- Dependent Packages: 802
- Dependent Repositories: 18,074
- Downloads: 22,525,768 Last month
- Docker Downloads: 26,551,887
-
Rankings:
- Dependent packages count: 0.039%
- Dependent repos count: 0.059%
- Downloads: 0.074%
- Average: 0.509%
- Stargazers count: 0.574%
- Docker downloads count: 0.655%
- Forks count: 1.652%
- Maintainers (1)
pypi: tf-sentencepiece
SentencePiece Encode/Decode ops for TensorFlow
- Homepage: https://github.com/google/sentencepiece
- Documentation: https://tf-sentencepiece.readthedocs.io/
- Licenses: Apache
- Latest release: 0.1.92 (published over 4 years ago)
- Last Synced: 2024-11-10T23:35:49.881Z (about 1 month ago)
- Versions: 15
- Dependent Packages: 1
- Dependent Repositories: 30
- Downloads: 5,643 Last month
- Docker Downloads: 31
-
Rankings:
- Stargazers count: 0.292%
- Forks count: 1.255%
- Average: 2.378%
- Dependent repos count: 2.678%
- Dependent packages count: 3.24%
- Docker downloads count: 3.331%
- Downloads: 3.469%
- Maintainers (1)
go: github.com/google/sentencepiece
- Homepage:
- Documentation: https://pkg.go.dev/github.com/google/sentencepiece#section-documentation
- Licenses: apache-2.0
- Latest release: v0.2.0 (published 10 months ago)
- Last Synced: 2024-11-10T23:36:11.977Z (about 1 month ago)
- Versions: 23
- Dependent Packages: 0
- Dependent Repositories: 1
-
Rankings:
- Stargazers count: 0.792%
- Forks count: 0.898%
- Average: 3.717%
- Dependent repos count: 4.802%
- Dependent packages count: 8.376%
conda: sentencepiece
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
- Homepage: https://github.com/google/sentencepiece/
- Licenses: Apache-2.0
- Latest release: 0.1.96 (published almost 3 years ago)
- Last Synced: 2024-11-10T23:37:23.222Z (about 1 month ago)
- Versions: 4
- Dependent Packages: 14
- Dependent Repositories: 25
-
Rankings:
- Stargazers count: 4.073%
- Dependent packages count: 4.472%
- Forks count: 5.081%
- Average: 5.217%
- Dependent repos count: 7.242%
spack: sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation. This is the C++ package.
- Homepage: https://github.com/google/sentencepiece
- Licenses: []
- Latest release: 0.1.91 (published over 2 years ago)
- Last Synced: 2024-10-29T08:31:17.574Z (about 2 months ago)
- Versions: 2
- Dependent Packages: 1
- Dependent Repositories: 0
-
Rankings:
- Dependent repos count: 0.0%
- Stargazers count: 1.584%
- Forks count: 3.029%
- Average: 8.17%
- Dependent packages count: 28.067%
- Maintainers (1)
homebrew: sentencepiece
Unsupervised text tokenizer and detokenizer
- Homepage: https://github.com/google/sentencepiece
- Licenses: Apache-2.0
- Latest release: 0.2.0 (published 10 months ago)
- Last Synced: 2024-11-10T23:35:55.658Z (about 1 month ago)
- Versions: 5
- Dependent Packages: 0
- Dependent Repositories: 1
- Downloads: 82 Last month
-
Rankings:
- Forks count: 2.538%
- Stargazers count: 3.08%
- Dependent packages count: 18.349%
- Average: 23.074%
- Dependent repos count: 29.299%
- Downloads: 62.104%
conda: sentencepiece
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
- Homepage: https://github.com/google/sentencepiece/
- Licenses: Apache-2.0
- Latest release: 0.1.99 (published over 1 year ago)
- Last Synced: 2024-10-29T09:44:02.846Z (about 2 months ago)
- Versions: 3
- Dependent Packages: 2
- Dependent Repositories: 25
-
Rankings:
- Stargazers count: 9.813%
- Forks count: 11.469%
- Average: 23.076%
- Dependent repos count: 30.083%
- Dependent packages count: 40.938%
conda: libsentencepiece
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
conda: sentencepiece-spm
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
conda: sentencepiece-python
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and unigram language model [[Kudo](https://arxiv.org/abs/1804.109590)]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.