honicky commited on
Commit
266a133
·
verified ·
1 Parent(s): 7c03541

Add model documentation

Browse files
Files changed (1) hide show
  1. README_model.md +55 -0
README_model.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - log-analysis
5
+ - pythia
6
+ - hdfs
7
+ license: mit
8
+ datasets:
9
+ - honicky/log-analysis-hdfs-preprocessed
10
+ metrics:
11
+ - cross-entropy
12
+ - perplexity
13
+ base_model: EleutherAI/pythia-70m
14
+ ---
15
+
16
+ # pythia-70m-hdfs-logs
17
+
18
+ Fine-tuned Pythia-14m model for HDFS log analysis, specifically for anomaly detection.
19
+
20
+ ## Model Description
21
+
22
+ This model is fine-tuned from `EleutherAI/pythia-70m` for analyzing HDFS log sequences. It's designed to understand and predict patterns in
23
+ HDFS log data so that we can detect anomalies using the perplexity of the log sequence. THhe HDFS sequence is handy because it has labels
24
+ so we can use it to validate that the model can predict anomalies.
25
+
26
+ We will use this model to understand the ability of a small model to predict anomalies in a specific dataset. We will study model scale
27
+ and experiment with tokenization, intialization, data set size, etc. to find a configuration that is minimal in size and fast, but can
28
+ effectively predict anomalies. We will then attempt build a model that is more robust to different log formats.
29
+
30
+ - Huggingface Model: [honicky/pythia-14m-hdfs-logs](https://huggingface.co/honicky/pythia-14m-hdfs-logs)
31
+
32
+ ## Training Details
33
+ - Base model: EleutherAI/pythia-70m
34
+ - Dataset: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 + preprocessed data at honicky/log-analysis-hdfs-preprocessed
35
+ - Batch size: 16
36
+ - Max sequence length: 405
37
+ - Learning rate: 0.0001
38
+ - Training steps: 2000
39
+ - Weights and Biases run: https://wandb.ai/honicky/log-analysis-pythia/runs/jomdv9lz
40
+
41
+
42
+ ## Special Tokens
43
+ - Added `<|sep|>` token for event ID separation
44
+
45
+ ## Intended Use
46
+ This model is intended for:
47
+ - Analyzing HDFS log sequences
48
+ - Detecting anomalies in log patterns
49
+ - Understanding system behavior through log analysis
50
+
51
+ ## Limitations
52
+ - Model is specifically trained on HDFS logs and may not generalize to other log formats
53
+ - Limited to the context window size of 405 tokens
54
+
55
+