Skip to content

hscspring/bytepiece-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bytepiece

Implementation of Su's bytepiece.

Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.

Bindings

Quick Example using Python

from rs_bytepiece import Tokenizer

tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]