File size: 3,919 Bytes
2bb79f2
 
1f21607
 
 
2bb79f2
1f21607
 
 
 
 
 
 
 
 
 
 
 
369120b
 
1f21607
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369120b
1f21607
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc0-1.0
datasets:
- go_emotions
pipeline_tag: sentence-similarity
---

### Model Description

Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.  
This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.  

- **Developed by:** Ziv Arin
- **Model type:** Sentence similarity lossless compression
- **License:** CC0-1.0

### Demo

Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000  
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)  

[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
```py
import sentencepiece as spm

bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')

def encode_id(bit_text):
	encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
	encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
	assert any([id_ <= 255 for id_ in encoded_ids])
	string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
	return string_ids

def decode_id(hex_string):
	u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
	encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
	return encoded_tokens

# Encode text
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
encoded_ids = encode_id(new_sentence)
decoded_tokens = decode_id(encoded_ids)

print("length:", len(encoded_tokens))
print("encoded_tokens:", encoded_tokens)
print("encoded_ids:", encoded_ids)
print("same?:", encoded_tokens == decoded_tokens)

count = Counter(encoded_tokens)
print("count:", count)
```
Output:
```
length: 13
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
encoded_ids: 1ab2ed09d7a9617206894e0608
same?: True
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
```

## Bias, Risks, and Limitations

It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.  
The model doesn't compress well strings with fewer zeros.  

## Environmental Impact
- **Hardware Type:** I5-9300H
- **Hours used:** 3 hours