Automatic correction of README.md metadata for keys. Contact [email protected] for any question
e0f9855
language: en | |
thumbnail: https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg | |
tags: | |
- India | |
- politics | |
- tweets | |
- BJP | |
- Congress | |
- AAP | |
- pytorch | |
- gpt2 | |
- lm-head | |
- text-generation | |
license: apache-2.0 | |
datasets: | |
- IndianPolitics | |
# Model name | |
Indian Political Tweets LM | |
## Model description | |
Note: This model is based on GPT2, if you want a bigger model based on GPT2-medium and finetuned on the same data please take a look at the [IndianPoliticalTweetsLMMedium](https://huggingface.co/bagdaebhishek/IndianPoliticalTweetsLMMedium) model. | |
This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. | |
## Intended uses & limitations | |
This finetuned model can be used to generate tweets which are related to Indian politics. | |
#### How to use | |
```python | |
from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline | |
tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM") | |
model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM") | |
text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer) | |
init_sentence = "India will always be" | |
print(text_generator(init_sentence)) | |
``` | |
#### Limitations and bias | |
1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text. | |
2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc. | |
3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model. | |
## Training data | |
I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog. | |
## Training procedure | |
For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values. | |
I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles. | |
### Hardware | |
1. GPU: GTX 1080Ti | |
2. CPU: Ryzen 3900x | |
3. RAM: 32GB | |
This model took roughly 36 hours to fine-tune. | |