diff --git "a/Pandas-profile-report-of-the-dataset.html" "b/Pandas-profile-report-of-the-dataset.html" new file mode 100644--- /dev/null +++ "b/Pandas-profile-report-of-the-dataset.html" @@ -0,0 +1,7354 @@ +Formosan Dataset

Overview

Dataset statistics

Number of variables6
Number of observations139023
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory3.6 MiB
Average record size in memory27.0 B

Variable types

Categorical5
Numeric1

Warnings

Ab has a high cardinality: 138472 distinct values High cardinality
Ch has a high cardinality: 118291 distinct values High cardinality
Lang_Ch is highly correlated with Lang_EnHigh correlation
Lang_En is highly correlated with Lang_ChHigh correlation
Ab is uniformly distributed Uniform

Reproduction

Analysis started2021-05-08 07:18:30.970771
Analysis finished2021-05-08 07:18:45.182861
Duration14.21 seconds
Software versionpandas-profiling v2.12.0
Download configurationconfig.yaml

Variables

Lang_En
Categorical

HIGH CORRELATION

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size136.6 KiB
Rukai
15036 
Bunun
13382 
Atayal
11289 
Puyuma
10359 
Amis
9978 
Other values (11)
78979 

Length

Max length10
Median length6
Mean length5.783891874
Min length4

Characters and Unicode

Total characters804094
Distinct characters27
Distinct categories2 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowSakizaya
2nd rowSakizaya
3rd rowSakizaya
4th rowSakizaya
5th rowSakizaya
ValueCountFrequency (%)
Rukai15036
10.8%
Bunun13382
 
9.6%
Atayal11289
 
8.1%
Puyuma10359
 
7.5%
Amis9978
 
7.2%
Kavalan9444
 
6.8%
Thao8777
 
6.3%
Seediq8025
 
5.8%
Paiwan8009
 
5.8%
Yami7867
 
5.7%
Other values (6)36857
26.5%
Histogram of lengths of the category
ValueCountFrequency (%)
rukai15036
10.8%
bunun13382
 
9.6%
atayal11289
 
8.1%
puyuma10359
 
7.5%
amis9978
 
7.2%
kavalan9444
 
6.8%
thao8777
 
6.3%
seediq8025
 
5.8%
paiwan8009
 
5.8%
yami7867
 
5.7%
Other values (6)36857
26.5%

Most occurring characters

ValueCountFrequency (%)
a189565
23.6%
u85836
 
10.7%
i68931
 
8.6%
n59909
 
7.5%
y34769
 
4.3%
k34272
 
4.3%
m28204
 
3.5%
S26728
 
3.3%
s22017
 
2.7%
A21267
 
2.6%
Other values (17)232596
28.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter665071
82.7%
Uppercase Letter139023
 
17.3%

Most frequent character per category

ValueCountFrequency (%)
a189565
28.5%
u85836
12.9%
i68931
 
10.4%
n59909
 
9.0%
y34769
 
5.2%
k34272
 
5.2%
m28204
 
4.2%
s22017
 
3.3%
l20733
 
3.1%
o19503
 
2.9%
Other values (9)101332
15.2%
ValueCountFrequency (%)
S26728
19.2%
A21267
15.3%
T19085
13.7%
P18368
13.2%
K17290
12.4%
R15036
10.8%
B13382
9.6%
Y7867
 
5.7%

Most occurring scripts

ValueCountFrequency (%)
Latin804094
100.0%

Most frequent character per script

ValueCountFrequency (%)
a189565
23.6%
u85836
 
10.7%
i68931
 
8.6%
n59909
 
7.5%
y34769
 
4.3%
k34272
 
4.3%
m28204
 
3.5%
S26728
 
3.3%
s22017
 
2.7%
A21267
 
2.6%
Other values (17)232596
28.9%

Most occurring blocks

ValueCountFrequency (%)
ASCII804094
100.0%

Most frequent character per block

ValueCountFrequency (%)
a189565
23.6%
u85836
 
10.7%
i68931
 
8.6%
n59909
 
7.5%
y34769
 
4.3%
k34272
 
4.3%
m28204
 
3.5%
S26728
 
3.3%
s22017
 
2.7%
A21267
 
2.6%
Other values (17)232596
28.9%

Lang_Ch
Categorical

HIGH CORRELATION

Distinct43
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size137.5 KiB
魯凱_霧台
11015 
布農_郡群
10446 
噶瑪蘭
9444 
 
8777
泰雅_賽考利克
 
8350
Other values (38)
90991 

Length

Max length8
Median length5
Mean length4.223373111
Min length1

Characters and Unicode

Total characters587146
Distinct characters80
Distinct categories2 ?
Distinct scripts2 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row撒奇萊雅
2nd row撒奇萊雅
3rd row撒奇萊雅
4th row撒奇萊雅
5th row撒奇萊雅
ValueCountFrequency (%)
魯凱_霧台11015
 
7.9%
布農_郡群10446
 
7.5%
噶瑪蘭9444
 
6.8%
8777
 
6.3%
泰雅_賽考利克8350
 
6.0%
達悟7867
 
5.7%
卡那卡那富7846
 
5.6%
卑南_南王7700
 
5.5%
賽夏6895
 
5.0%
賽德克_德固達雅6599
 
4.7%
Other values (33)54084
38.9%
Histogram of lengths of the category
ValueCountFrequency (%)
魯凱_霧台11015
 
7.9%
布農_郡群10446
 
7.5%
噶瑪蘭9444
 
6.8%
8777
 
6.3%
泰雅_賽考利克8350
 
6.0%
達悟7867
 
5.7%
卡那卡那富7846
 
5.6%
卑南_南王7700
 
5.5%
賽夏6895
 
5.0%
賽德克_德固達雅6599
 
4.7%
Other values (33)54084
38.9%

Most occurring characters

ValueCountFrequency (%)
_76078
 
13.0%
25782
 
4.4%
24114
 
4.1%
23270
 
4.0%
19707
 
3.4%
16480
 
2.8%
16375
 
2.8%
15692
 
2.7%
15560
 
2.7%
15349
 
2.6%
Other values (70)338739
57.7%

Most occurring categories

ValueCountFrequency (%)
Other Letter511068
87.0%
Connector Punctuation76078
 
13.0%

Most frequent character per category

ValueCountFrequency (%)
25782
 
5.0%
24114
 
4.7%
23270
 
4.6%
19707
 
3.9%
16480
 
3.2%
16375
 
3.2%
15692
 
3.1%
15560
 
3.0%
15349
 
3.0%
15167
 
3.0%
Other values (69)323572
63.3%
ValueCountFrequency (%)
_76078
100.0%

Most occurring scripts

ValueCountFrequency (%)
Han511068
87.0%
Common76078
 
13.0%

Most frequent character per script

ValueCountFrequency (%)
25782
 
5.0%
24114
 
4.7%
23270
 
4.6%
19707
 
3.9%
16480
 
3.2%
16375
 
3.2%
15692
 
3.1%
15560
 
3.0%
15349
 
3.0%
15167
 
3.0%
Other values (69)323572
63.3%
ValueCountFrequency (%)
_76078
100.0%

Most occurring blocks

ValueCountFrequency (%)
CJK511068
87.0%
ASCII76078
 
13.0%

Most frequent character per block

ValueCountFrequency (%)
25782
 
5.0%
24114
 
4.7%
23270
 
4.6%
19707
 
3.9%
16480
 
3.2%
16375
 
3.2%
15692
 
3.1%
15560
 
3.0%
15349
 
3.0%
15167
 
3.0%
Other values (69)323572
63.3%
ValueCountFrequency (%)
_76078
100.0%

Ab
Categorical

HIGH CARDINALITY
UNIFORM

Distinct138472
Distinct (%)99.6%
Missing0
Missing (%)0.0%
Memory size1.1 MiB
na.
 
4
anema azua?
 
4
sinsi, nana ku walri .
 
4
Satokien ako to romi’ami’ad .
 
3
su sinsi timadju?
 
3
Other values (138467)
139005 

Length

Max length486
Median length37
Mean length39.69009444
Min length1

Characters and Unicode

Total characters5517836
Distinct characters154
Distinct categories18 ?
Distinct scripts4 ?
Distinct blocks10 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique137968 ?
Unique (%)99.2%

Sample

1st rowmalalikid ku niyazu' i waluay a bulad.
2nd rowkaudadan a demiad milalupela' kita.
3rd rowi buyubuyu'an ku aadupen a mauzip.
4th rowu aam ku sakalanam tu sananal.
5th rowaamen nu miaamay ku tubah ni Bunga!
ValueCountFrequency (%)
na.4
 
< 0.1%
anema azua? 4
 
< 0.1%
sinsi, nana ku walri .4
 
< 0.1%
Satokien ako to romi’ami’ad .3
 
< 0.1%
su sinsi timadju? 3
 
< 0.1%
nana ku matra. 3
 
< 0.1%
sgagay ta la! 3
 
< 0.1%
imu, muruma’ ku lra .3
 
< 0.1%
tatelraw nu ’arevu? 3
 
< 0.1%
nu mukuwa ku i takesiyan zi nu muruma’ ku mu .3
 
< 0.1%
Other values (138462)138990
> 99.9%
Histogram of lengths of the category
ValueCountFrequency (%)
a24357
 
2.7%
ku17678
 
1.9%
na17666
 
1.9%
ka17234
 
1.9%
tu15594
 
1.7%
i10176
 
1.1%
o8694
 
0.9%
7670
 
0.8%
su7317
 
0.8%
ta6945
 
0.8%
Other values (140997)782473
85.4%

Most occurring characters

ValueCountFrequency (%)
a1008449
18.3%
800736
14.5%
i418441
 
7.6%
n384737
 
7.0%
u357291
 
6.5%
k239393
 
4.3%
m208199
 
3.8%
s177487
 
3.2%
t177300
 
3.2%
l157855
 
2.9%
Other values (144)1587948
28.8%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter4362309
79.1%
Space Separator800740
 
14.5%
Other Punctuation258804
 
4.7%
Uppercase Letter64443
 
1.2%
Final Punctuation21452
 
0.4%
Dash Punctuation7056
 
0.1%
Initial Punctuation567
 
< 0.1%
Open Punctuation524
 
< 0.1%
Close Punctuation522
 
< 0.1%
Modifier Symbol508
 
< 0.1%
Other values (8)911
 
< 0.1%

Most frequent character per category

ValueCountFrequency (%)
a1008449
23.1%
i418441
 
9.6%
n384737
 
8.8%
u357291
 
8.2%
k239393
 
5.5%
m208199
 
4.8%
s177487
 
4.1%
t177300
 
4.1%
l157855
 
3.6%
e143962
 
3.3%
Other values (30)1089195
25.0%
ValueCountFrequency (%)
M10030
15.6%
S9783
15.2%
R6910
10.7%
T4639
 
7.2%
P4543
 
7.0%
I3602
 
5.6%
A3561
 
5.5%
K3304
 
5.1%
N2524
 
3.9%
O2038
 
3.2%
Other values (18)13509
21.0%
ValueCountFrequency (%)
13
21.3%
10
16.4%
5
 
8.2%
5
 
8.2%
4
 
6.6%
4
 
6.6%
4
 
6.6%
1
 
1.6%
1
 
1.6%
1
 
1.6%
Other values (13)13
21.3%
ValueCountFrequency (%)
.114962
44.4%
'66659
25.8%
,35630
 
13.8%
?21850
 
8.4%
!11300
 
4.4%
:6242
 
2.4%
;1051
 
0.4%
/597
 
0.2%
"266
 
0.1%
69
 
< 0.1%
Other values (10)178
 
0.1%
ValueCountFrequency (%)
187
19.9%
069
15.8%
846
10.5%
245
10.3%
942
9.6%
539
8.9%
333
 
7.6%
728
 
6.4%
428
 
6.4%
620
 
4.6%
ValueCountFrequency (%)
(485
92.6%
28
 
5.3%
6
 
1.1%
[5
 
1.0%
ValueCountFrequency (%)
)484
92.7%
27
 
5.2%
6
 
1.1%
]5
 
1.0%
ValueCountFrequency (%)
800736
> 99.9%
 3
 
< 0.1%
 1
 
< 0.1%
ValueCountFrequency (%)
^497
97.8%
˄10
 
2.0%
´1
 
0.2%
ValueCountFrequency (%)
1
33.3%
1
33.3%
1
33.3%
ValueCountFrequency (%)
32
54.2%
22
37.3%
5
 
8.5%
ValueCountFrequency (%)
́43
84.3%
̄7
 
13.7%
̅1
 
2.0%
ValueCountFrequency (%)
518
91.4%
49
 
8.6%
ValueCountFrequency (%)
20910
97.5%
542
 
2.5%
ValueCountFrequency (%)
ʼ2
50.0%
ˆ2
50.0%
ValueCountFrequency (%)
=27
81.8%
~6
 
18.2%
ValueCountFrequency (%)
-7056
100.0%
ValueCountFrequency (%)
_263
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin4426752
80.2%
Common1090972
 
19.8%
Han61
 
< 0.1%
Inherited51
 
< 0.1%

Most frequent character per script

ValueCountFrequency (%)
a1008449
22.8%
i418441
 
9.5%
n384737
 
8.7%
u357291
 
8.1%
k239393
 
5.4%
m208199
 
4.7%
s177487
 
4.0%
t177300
 
4.0%
l157855
 
3.6%
e143962
 
3.3%
Other values (58)1153638
26.1%
ValueCountFrequency (%)
800736
73.4%
.114962
 
10.5%
'66659
 
6.1%
,35630
 
3.3%
?21850
 
2.0%
20910
 
1.9%
!11300
 
1.0%
-7056
 
0.6%
:6242
 
0.6%
;1051
 
0.1%
Other values (50)4576
 
0.4%
ValueCountFrequency (%)
13
21.3%
10
16.4%
5
 
8.2%
5
 
8.2%
4
 
6.6%
4
 
6.6%
4
 
6.6%
1
 
1.6%
1
 
1.6%
1
 
1.6%
Other values (13)13
21.3%
ValueCountFrequency (%)
́43
84.3%
̄7
 
13.7%
̅1
 
2.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII5463197
99.0%
IPA Ext30735
 
0.6%
Punctuation22049
 
0.4%
None1332
 
< 0.1%
Latin Ext Additional394
 
< 0.1%
CJK61
 
< 0.1%
Diacriticals51
 
< 0.1%
Modifier Letters14
 
< 0.1%
Box Drawing2
 
< 0.1%
Arrows1
 
< 0.1%

Most frequent character per block

ValueCountFrequency (%)
a1008449
18.5%
800736
14.7%
i418441
 
7.7%
n384737
 
7.0%
u357291
 
6.5%
k239393
 
4.4%
m208199
 
3.8%
s177487
 
3.2%
t177300
 
3.2%
l157855
 
2.9%
Other values (76)1533309
28.1%
ValueCountFrequency (%)
20910
94.8%
542
 
2.5%
518
 
2.3%
49
 
0.2%
28
 
0.1%
2
 
< 0.1%
ValueCountFrequency (%)
é720
54.1%
á103
 
7.7%
ē79
 
5.9%
69
 
5.2%
í67
 
5.0%
ú67
 
5.0%
45
 
3.4%
43
 
3.2%
28
 
2.1%
27
 
2.0%
Other values (17)84
 
6.3%
ValueCountFrequency (%)
1
50.0%
1
50.0%
ValueCountFrequency (%)
˄10
71.4%
ʼ2
 
14.3%
ˆ2
 
14.3%
ValueCountFrequency (%)
ʉ29406
95.7%
ɨ1329
 
4.3%
ValueCountFrequency (%)
13
21.3%
10
16.4%
5
 
8.2%
5
 
8.2%
4
 
6.6%
4
 
6.6%
4
 
6.6%
1
 
1.6%
1
 
1.6%
1
 
1.6%
Other values (13)13
21.3%
ValueCountFrequency (%)
́43
84.3%
̄7
 
13.7%
̅1
 
2.0%
ValueCountFrequency (%)
1
100.0%
ValueCountFrequency (%)
394
100.0%

Ch
Categorical

HIGH CARDINALITY

Distinct118291
Distinct (%)85.1%
Missing0
Missing (%)0.0%
Memory size1.1 MiB
那個人很勤勞嗎?
 
74
下雨了!你帶著雨傘嗎?
 
72
今天熱嗎?
 
71
你們天天來這裡吃晚餐嗎?
 
71
你有幾個兄弟姊妹?
 
71
Other values (118286)
138664 

Length

Max length128
Median length11
Mean length12.14047316
Min length1

Characters and Unicode

Total characters1687805
Distinct characters4437
Distinct categories18 ?
Distinct scripts5 ?
Distinct blocks15 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique112797 ?
Unique (%)81.1%

Sample

1st row八月份是部落的豐年祭。
2nd row下雨天我們一起去撿天使的眼淚。
3rd row動物生存在山裡。
4th row早餐吃的是稀飯。
5th row乞丐去向Bunga乞討地瓜!
ValueCountFrequency (%)
那個人很勤勞嗎?74
 
0.1%
下雨了!你帶著雨傘嗎?72
 
0.1%
今天熱嗎?71
 
0.1%
你們天天來這裡吃晚餐嗎?71
 
0.1%
你有幾個兄弟姊妹?71
 
0.1%
那間房子很大嗎?69
 
< 0.1%
他們天天看電視嗎?69
 
< 0.1%
在下雨嗎?68
 
< 0.1%
那張椅子很重嗎?66
 
< 0.1%
她的衣服是紅色的嗎?66
 
< 0.1%
Other values (118281)138326
99.5%
Histogram of lengths of the category
ValueCountFrequency (%)
88
 
0.1%
81
 
0.1%
那個人很勤勞嗎?74
 
0.1%
subali73
 
0.1%
元。73
 
0.1%
下雨了!你帶著雨傘嗎?72
 
0.1%
你們天天來這裡吃晚餐嗎?71
 
0.1%
你有幾個兄弟姊妹?71
 
0.1%
今天熱嗎?71
 
0.1%
他們天天看電視嗎?69
 
< 0.1%
Other values (118731)140906
99.5%

Most occurring characters

ValueCountFrequency (%)
103122
 
6.1%
63472
 
3.8%
48642
 
2.9%
36223
 
2.1%
26784
 
1.6%
22800
 
1.4%
20704
 
1.2%
20680
 
1.2%
20651
 
1.2%
20331
 
1.2%
Other values (4427)1304396
77.3%

Most occurring categories

ValueCountFrequency (%)
Other Letter1410000
83.5%
Other Punctuation182776
 
10.8%
Lowercase Letter60147
 
3.6%
Uppercase Letter10715
 
0.6%
Open Punctuation9507
 
0.6%
Close Punctuation9410
 
0.6%
Space Separator2976
 
0.2%
Decimal Number1661
 
0.1%
Final Punctuation390
 
< 0.1%
Dash Punctuation60
 
< 0.1%
Other values (8)163
 
< 0.1%

Most frequent character per category

ValueCountFrequency (%)
63472
 
4.5%
48642
 
3.4%
26784
 
1.9%
22800
 
1.6%
20704
 
1.5%
20680
 
1.5%
20651
 
1.5%
20331
 
1.4%
19094
 
1.4%
18573
 
1.3%
Other values (4276)1128269
80.0%
ValueCountFrequency (%)
P1512
14.1%
T1285
12.0%
A1248
11.6%
S1001
9.3%
K763
 
7.1%
B743
 
6.9%
M634
 
5.9%
U605
 
5.6%
Y489
 
4.6%
L357
 
3.3%
Other values (22)2078
19.4%
ValueCountFrequency (%)
a14231
23.7%
u6096
10.1%
i5769
 
9.6%
n5378
 
8.9%
y2711
 
4.5%
s2400
 
4.0%
g2376
 
4.0%
l2369
 
3.9%
k2164
 
3.6%
w2012
 
3.3%
Other values (20)14641
24.3%
ValueCountFrequency (%)
103122
56.4%
36223
 
19.8%
17466
 
9.6%
10484
 
5.7%
?4392
 
2.4%
'1916
 
1.0%
1845
 
1.0%
/1612
 
0.9%
.1530
 
0.8%
1052
 
0.6%
Other values (18)3134
 
1.7%
ValueCountFrequency (%)
(5745
60.4%
2821
29.7%
835
 
8.8%
[69
 
0.7%
17
 
0.2%
11
 
0.1%
3
 
< 0.1%
2
 
< 0.1%
2
 
< 0.1%
1
 
< 0.1%
ValueCountFrequency (%)
)5682
60.4%
2793
29.7%
829
 
8.8%
]69
 
0.7%
17
 
0.2%
11
 
0.1%
3
 
< 0.1%
2
 
< 0.1%
2
 
< 0.1%
1
 
< 0.1%
ValueCountFrequency (%)
0471
28.4%
1301
18.1%
2200
12.0%
5169
 
10.2%
9105
 
6.3%
495
 
5.7%
393
 
5.6%
889
 
5.4%
774
 
4.5%
664
 
3.9%
ValueCountFrequency (%)
18
62.1%
5
 
17.2%
2
 
6.9%
2
 
6.9%
1
 
3.4%
1
 
3.4%
ValueCountFrequency (%)
=19
59.4%
9
28.1%
~2
 
6.2%
1
 
3.1%
1
 
3.1%
ValueCountFrequency (%)
2806
94.3%
 122
 
4.1%
 48
 
1.6%
ValueCountFrequency (%)
-53
88.3%
4
 
6.7%
3
 
5.0%
ValueCountFrequency (%)
9
45.0%
7
35.0%
4
20.0%
ValueCountFrequency (%)
232
59.5%
158
40.5%
ValueCountFrequency (%)
^2
66.7%
´1
33.3%
ValueCountFrequency (%)
34
97.1%
1
 
2.9%
ValueCountFrequency (%)
ˋ35
100.0%
ValueCountFrequency (%)
­1
100.0%
ValueCountFrequency (%)
8
100.0%

Most occurring scripts

ValueCountFrequency (%)
Han1409938
83.5%
Common206935
 
12.3%
Latin70862
 
4.2%
Bopomofo62
 
< 0.1%
Unknown8
 
< 0.1%

Most frequent character per script

ValueCountFrequency (%)
63472
 
4.5%
48642
 
3.4%
26784
 
1.9%
22800
 
1.6%
20704
 
1.5%
20680
 
1.5%
20651
 
1.5%
20331
 
1.4%
19094
 
1.4%
18573
 
1.3%
Other values (4272)1128207
80.0%
ValueCountFrequency (%)
103122
49.8%
36223
 
17.5%
17466
 
8.4%
10484
 
5.1%
(5745
 
2.8%
)5682
 
2.7%
?4392
 
2.1%
2821
 
1.4%
2806
 
1.4%
2793
 
1.3%
Other values (78)15401
 
7.4%
ValueCountFrequency (%)
a14231
20.1%
u6096
 
8.6%
i5769
 
8.1%
n5378
 
7.6%
y2711
 
3.8%
s2400
 
3.4%
g2376
 
3.4%
l2369
 
3.3%
k2164
 
3.1%
w2012
 
2.8%
Other values (52)25356
35.8%
ValueCountFrequency (%)
58
93.5%
2
 
3.2%
1
 
1.6%
1
 
1.6%
ValueCountFrequency (%)
8
100.0%

Most occurring blocks

ValueCountFrequency (%)
CJK1409259
83.5%
None178616
 
10.6%
ASCII98392
 
5.8%
CJK Compat Ideographs679
 
< 0.1%
Punctuation533
 
< 0.1%
Small Forms125
 
< 0.1%
IPA Ext74
 
< 0.1%
Bopomofo62
 
< 0.1%
Modifier Letters35
 
< 0.1%
Box Drawing9
 
< 0.1%
Other values (5)21
 
< 0.1%

Most frequent character per block

ValueCountFrequency (%)
63472
 
4.5%
48642
 
3.5%
26784
 
1.9%
22800
 
1.6%
20704
 
1.5%
20680
 
1.5%
20651
 
1.5%
20331
 
1.4%
19094
 
1.4%
18573
 
1.3%
Other values (4202)1127528
80.0%
ValueCountFrequency (%)
103122
57.7%
36223
 
20.3%
17466
 
9.8%
10484
 
5.9%
2821
 
1.6%
2793
 
1.6%
1845
 
1.0%
1052
 
0.6%
835
 
0.5%
834
 
0.5%
Other values (31)1141
 
0.6%
ValueCountFrequency (%)
a14231
 
14.5%
u6096
 
6.2%
i5769
 
5.9%
(5745
 
5.8%
)5682
 
5.8%
n5378
 
5.5%
?4392
 
4.5%
2806
 
2.9%
y2711
 
2.8%
s2400
 
2.4%
Other values (78)43182
43.9%
ValueCountFrequency (%)
ˋ35
100.0%
ValueCountFrequency (%)
58
93.5%
2
 
3.2%
1
 
1.6%
1
 
1.6%
ValueCountFrequency (%)
232
43.5%
158
29.6%
94
17.6%
34
 
6.4%
11
 
2.1%
3
 
0.6%
1
 
0.2%
ValueCountFrequency (%)
7
100.0%
ValueCountFrequency (%)
99
79.2%
13
 
10.4%
7
 
5.6%
4
 
3.2%
1
 
0.8%
1
 
0.8%
ValueCountFrequency (%)
9
100.0%
ValueCountFrequency (%)
ʉ62
83.8%
ɨ12
 
16.2%
ValueCountFrequency (%)
8
100.0%
ValueCountFrequency (%)
73
 
10.8%
67
 
9.9%
50
 
7.4%
44
 
6.5%
42
 
6.2%
40
 
5.9%
32
 
4.7%
30
 
4.4%
22
 
3.2%
19
 
2.8%
Other values (60)260
38.3%
ValueCountFrequency (%)
1
100.0%
ValueCountFrequency (%)
4
100.0%
ValueCountFrequency (%)
1
100.0%

From
Categorical

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size136.1 KiB
詞典
103864 
生活會話
12892 
句型
10452 
九階教材
 
6088
文法
 
5727

Length

Max length4
Median length2
Mean length2.273048345
Min length2

Characters and Unicode

Total characters316006
Distinct characters14
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row詞典
2nd row詞典
3rd row詞典
4th row詞典
5th row詞典
ValueCountFrequency (%)
詞典103864
74.7%
生活會話12892
 
9.3%
句型10452
 
7.5%
九階教材6088
 
4.4%
文法5727
 
4.1%
Histogram of lengths of the category
ValueCountFrequency (%)
詞典103864
74.7%
生活會話12892
 
9.3%
句型10452
 
7.5%
九階教材6088
 
4.4%
文法5727
 
4.1%

Most occurring characters

ValueCountFrequency (%)
103864
32.9%
103864
32.9%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
10452
 
3.3%
10452
 
3.3%
6088
 
1.9%
6088
 
1.9%
Other values (4)23630
 
7.5%

Most occurring categories

ValueCountFrequency (%)
Other Letter316006
100.0%

Most frequent character per category

ValueCountFrequency (%)
103864
32.9%
103864
32.9%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
10452
 
3.3%
10452
 
3.3%
6088
 
1.9%
6088
 
1.9%
Other values (4)23630
 
7.5%

Most occurring scripts

ValueCountFrequency (%)
Han316006
100.0%

Most frequent character per script

ValueCountFrequency (%)
103864
32.9%
103864
32.9%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
10452
 
3.3%
10452
 
3.3%
6088
 
1.9%
6088
 
1.9%
Other values (4)23630
 
7.5%

Most occurring blocks

ValueCountFrequency (%)
CJK316006
100.0%

Most frequent character per block

ValueCountFrequency (%)
103864
32.9%
103864
32.9%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
12892
 
4.1%
10452
 
3.3%
10452
 
3.3%
6088
 
1.9%
6088
 
1.9%
Other values (4)23630
 
7.5%

word_counts
Real number (ℝ≥0)

Distinct51
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.58742798
Minimum1
Maximum89
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size1.1 MiB

Quantile statistics

Minimum1
5-th percentile3
Q15
median6
Q38
95-th percentile12
Maximum89
Range88
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.09127209
Coefficient of variation (CV)0.4692684458
Kurtosis13.47000465
Mean6.58742798
Median Absolute Deviation (MAD)2
Skewness2.008589441
Sum915804
Variance9.555963133
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
523536
16.9%
622566
16.2%
718690
13.4%
417842
12.8%
813567
9.8%
310447
7.5%
99231
 
6.6%
105867
 
4.2%
113906
 
2.8%
23409
 
2.5%
Other values (41)9962
7.2%
ValueCountFrequency (%)
11164
 
0.8%
23409
 
2.5%
310447
7.5%
417842
12.8%
523536
16.9%
ValueCountFrequency (%)
891
< 0.1%
631
< 0.1%
572
< 0.1%
521
< 0.1%
491
< 0.1%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

Lang_EnLang_ChAbChFromword_counts
0Sakizaya撒奇萊雅malalikid ku niyazu' i waluay a bulad.八月份是部落的豐年祭。詞典7
1Sakizaya撒奇萊雅kaudadan a demiad milalupela' kita.下雨天我們一起去撿天使的眼淚。詞典5
2Sakizaya撒奇萊雅i buyubuyu'an ku aadupen a mauzip.動物生存在山裡。詞典6
3Sakizaya撒奇萊雅u aam ku sakalanam tu sananal.早餐吃的是稀飯。詞典6
4Sakizaya撒奇萊雅aamen nu miaamay ku tubah ni Bunga!乞丐去向Bunga乞討地瓜!詞典7
5Sakizaya撒奇萊雅miaam ku miaamay tu hemay.乞丐常常來討飯。詞典5
6Sakizaya撒奇萊雅katuud ku miaamay i Taypak.臺北市有很多乞丐。詞典5
7Sakizaya撒奇萊雅misaaam kaku tu sakalanam nu niyam.我要煮我們早餐要吃的稀飯。詞典6
8Sakizaya撒奇萊雅sapisaaam kina dangah.這是煮稀飯的大鍋。詞典3
9Sakizaya撒奇萊雅kau baduwac nu pabuy ku pacamul tu sasaaamen.用豬的排骨來熬稀飯。詞典8

Last rows

Lang_EnLang_ChAbChFromword_counts
139013Bunun布農_郡群Inaak kaviaz hai, kuzamian tantungu.我的朋友到我們的地方作客。詞典5
139014Bunun布農_郡群Izamian tu sinsusuaz hai, matalbuh amin.我們的農作物都很肥碩。詞典6
139015Bunun布農_郡群pinitsanavan.在我們這裡吃晚餐吧。詞典1
139016Bunun布農_郡群Mali hai, mazaum aupa ukaan is-aang.氣球軟軟的,因為沒有氣。詞典6
139017Bunun布農_郡群Ukaan saikin mas zikaang pishasibang.我沒有時間玩。詞典5
139018Bunun布農_郡群Asa tu kapimaupa mas sinpatupa tu zikaang.要遵守約定的時間。詞典7
139019Bunun布農_郡群Isia makazavan tu hanian, uvaaz hai, supahan mas zungzung.寒冷的天氣裡,小孩子鼻涕很多。詞典9
139020Bunun布農_郡群Zungzung hai, maduhtaz.鼻涕是黏的。詞典3
139021Bunun布農_郡群Maza hazam hai, pandu sia lukis tu zuszus.鳥兒停棲在樹梢。詞典8
139022Bunun布農_郡群Mazima saikin maun mas kinal-ing tu lili tu zuszus.我喜歡吃炒過貓的嫩芽。詞典9
\ No newline at end of file