microsoft-finetuned-personality / README.md

Update README.md

ae5bcd4 verified 3 months ago

17.3 kB

	---
	license: mit
	language: en
	widget:
	- text: >-
	"I neutral with that I am the life of the party and strongly agree with that I sympathize with others' feelings and agree with that I get chores done right away and agree with that I have frequent mood swings and agree with that I have a vivid imagination and strongly agree with that I do not talk a lot and strongly disagree with that I am not interested in other people's problems and strongly disagree with that I often forget to put things back in their proper place and disagree with that I am relaxed most of the time and disagree with that I am not interested in abstract ideas and agree with that I talk to a lot of different people at parties and strongly disagree with that I feel others' emotions and agree with that I like order and strongly agree with that I get upset easily and neutral with that I have difficulty understanding abstract ideas and strongly disagree with that I keep in the background and strongly disagree with that I am not really interested in others and strongly disagree with that I make a mess of things and disagree with that I seldom feel blue and disagree with that I am not have a good imagination"
	- text: >-
	"I strongly disagree with that I am the life of the party and strongly disagree with that I sympathize with others' feelings and strongly disagree with that I get chores done right away and agree with that I have frequent mood swings and strongly disagree with that I have a vivid imagination and agree with that I do not talk a lot and disagree with that I am not interested in other people's problems and disagree with that I often forget to put things back in their proper place and agree with that I am relaxed most of the time and strongly agree with that I am not interested in abstract ideas and strongly agree with that I talk to a lot of different people at parties and disagree with that I feel others' emotions and strongly agree with that I like order and agree with that I get upset easily and neutral with that I have difficulty understanding abstract ideas and strongly agree with that I keep in the background and strongly agree with that I am not really interested in others and strongly agree with that I make a mess of things and strongly agree with that I seldom feel blue and disagree with that I am not have a good imagination"
	- text: >-
	"I disagree with that I am the life of the party and strongly agree with that I sympathize with others' feelings and agree with that I get chores done right away and strongly agree with that I have frequent mood swings and agree with that I have a vivid imagination and strongly agree with that I do not talk a lot and strongly disagree with that I am not interested in other people's problems and strongly disagree with that I often forget to put things back in their proper place and strongly agree with that I am relaxed most of the time and agree with that I am not interested in abstract ideas and agree with that I talk to a lot of different people at parties and agree with that I feel others' emotions and agree with that I like order and strongly disagree with that I get upset easily and strongly disagree with that I have difficulty understanding abstract ideas and strongly disagree with that I keep in the background and agree with that I am not really interested in others and strongly disagree with that I make a mess of things and strongly disagree with that I seldom feel blue and strongly disagree with that I am not have a good imagination"
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- Personality
	- Personality Classification
	- The Mini Big Five Assessment
	- Personality Type (OCEAN)
	- Agreeableness
	- Conscientiousness
	- Extraversion
	- Neuroticism
	- Openness
	---
	<!-- # Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does.).

	## Model Details
	I fine-tuned this model depending on the dataset (raw data) from Kaggle, which contains 1,015,342 questionnaire answers collected online by Open Psychometrics.
	[Dataset_Link](https://www.kaggle.com/datasets/tunguz/big-five-personality-test)
	Firstly: I preprocess the mentioned data on colab to prepare it for my goal (Know the personality type based on the text as input;
	each personality type as a percent as output) and to make it easier for the users,
	I relied on the Mini Big Five personality assessment (20 questions) only to make the assessment easier and more convenient.
	At the beginning, I selected only the 20 questions from the 50, then calculated the type of personality using rule-based (if-then) statement as a new column.
	Then I concatenated between the 20 questions to one new column "text" and the other column is "label" ~ "type", I trained the base model "microsoft/MiniLM-L12-H384-uncased"
	I fine-tuned my model using colab (free GPU) with the below hyperparameters:

	Fine-tuning hyper-parameters
	learning_rate = 5e-6
	batch_size = 64
	max_seq_length = 275
	num_train_epochs = 3

	-->

	## Model Description

	<!-- Provide a longer summary of what this model is. -->
	* Many of the most significant developments in the modern corporate sector are driven by data. Data will become more crucial and revolutionary to daily individual operations in the near future.
	* In order to predict the Big Five personality traits for this project, I employed transfer learning using the [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased).
	* The model's learning patterns between personality attributes and input text were refined using a carefully selected dataset for personality traits.
	* The [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) model increased the prediction accuracy of personality traits by utilising transfer learning to reach more than 97% comparing to my last model which was depended on roberta-large "65%".
	* This finetuned model is able to estimate an individual's Big Five personality traits based on their input text with high accuracy by utilising transfer learning and optimising [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased).
	* This experiment demonstrates the efficacy of predicting the Big Five personality traits and the strength of transfer learning in machine learning.

	- Developed by: [Nasser Elsaman]
	- Model type: [Text/ Personality Classification Model]
	- Language (NLP): [English]
	- License: [MIT]
	- Finetuned from model: [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased)
	- Goal from this finetuned model: This model is for educational and research purposes only, any uses outside of this, the author is not responsible.

	# Uses:-
	## Direct Use:
	* Individuals can utilise the personality prediction model directly to acquire insights into their own personality qualities based on the input text. Users can input text to get predictions for the Big Five personality characteristics, and this model is for educational and research purposes only, any uses outside of this, the author is not responsible.

	## Downstream Use:
	* This model is designed for later usage or fine-tuning for certain needs. It was created as a stand-alone personality prediction finetuned model.

	## Out-of-Scope Use:
	* Use this model with caution when making significant choices about people in fields like employment, education, or law.

	## Biases, Risks, and Limitations:
	* The personality prediction model, like any machine learning models, has limits and potential biases that should be considered.

	## Generalisations:
	* The algorithm predicts personality qualities using patterns acquired from a given dataset. Its results will not alter when applied to people from diverse ethnic or cultural backgrounds who are underrepresented in the training data.

	## Ethical considerations:
	* Personality prediction models should be utilised responsibly, with the awareness that personality features do not define a person's value or talents. It is critical to avoid forming unjust judgements or discriminating against someone based on their expected personality characteristics.

	## Privacy concerns:
	* The model is based on user-provided input text, which may include sensitive or confidential information. Users should be cautious while giving personal information and maintain the security of their data.

	## Recommendations:
	* To reduce the dangers and limits associated with personality prediction models, the following guidelines are proposed:

	## Awareness and Education:
	* Users should understand the model's limits and potential biases. Increase awareness that personality traits are multifaceted and cannot be fully represented by a single model or text analysis.

	## Avoid Stereotypes and Discrimination:
	* Users should use caution when making judgements or conclusions based primarily on projected personality attributes. Personality forecasts should not be used to discriminate against people or reinforce stereotypes, and this model is for educational and research purposes only.

	## Contextual Interpretation:
	* Place the model's predictions in context and evaluate extra information about the individual beyond the input text.

	## Data Privacy and Security:
	* Ensure that user data is processed securely and in accordance with privacy legislation. Users should be cautious while giving personal information.

	## Promote Ethical Use:
	* Encourage the proper use of personality prediction models while discouraging abuse or harmful uses.

	* It is crucial to highlight that the preceding recommendations are generic principles; additional context-specific recommendations should be made depending on the individual use case and the ethical issues.

	## How to Get Started with the Model:

	Use the code below to get started with the model.

	```python

	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	def personality_detection(text, threshold=0.05, endpoint= 1.0):
	token="Write_Your_HUG_Access_token_Id_Here"
	tokenizer = AutoTokenizer.from_pretrained ("Nasserelsaman/microsoft-finetuned-personality",token=token)
	model = AutoModelForSequenceClassification.from_pretrained ("Nasserelsaman/microsoft-finetuned-personality",token=token)

	inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
	outputs = model(**inputs)
	predictions = outputs.logits.squeeze().detach().numpy()

	# Get raw logits
	logits = model(**inputs).logits

	# Apply sigmoid to squash between 0 and 1
	probabilities = torch.sigmoid(logits)

	# Set values less than the threshold to 0.05
	predictions[predictions < threshold] = 0.05
	predictions[predictions > endpoint] = 1.0

	label_names = ['Agreeableness', 'Conscientiousness', 'Extraversion', 'Neuroticism', 'Openness']
	result = {label_names[i]: f"{predictions[i]*100:.0f}%" for i in range(len(label_names))}

	return result
	```

	<!-- ## Training Details

	[Explained above]

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	<!-- [Explained above]


	#### Preprocessing Steps [optional]

	[Explained above]


	#### Training Hyperparameters

	- Training regime: [Explained above] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	<!-- #### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	<!-- [Explained above] -->

	## Results

	[I get a fine-tuned model with a high accuracy more than 97%; [test by yourself on my streamlit app](https://personality-assessment.streamlit.app/)

	The personality_detection function returns a dictionary containing the predicted personality traits based on the given input text.
	The dictionary contains the following personality traits with their corresponding predicted values:

	1) Agreeableness: A value between 5% and 100% represents the predicted agreeableness trait.
	2) Conscientiousness: A value between 5% and 100% represents the predicted conscientiousness trait.
	3) Extroversion: A value between 5% and 100% represents the predicted extroversion trait.
	4) Neuroticism: A value between 5% and 100% represents the predicted neuroticism trait.
	5) Openness: A value between 5% and 100% represents the predicted openness trait.
	### Please note that I made the min value is 5% not 0% as 0% has no meaning but 5% means the user has low peronslaity type in this trait, but max value 100% meaning this personality type is the dominant type.

	## Example:-
	text_input = "Strongly Agree with that I am the life of the party.
	Disagree with that I sympathize with others’ feelings.
	Strongly Agree with that I get chores done right away.
	Disagree with that I have frequent mood swings.
	Disagree with that I have a vivid imagination.
	Neutral with that I don’t talk a lot.
	Strongly Disagree with that I am not interested in other people’s problems.
	Neutral with that I often forget to put things back in their proper place.
	Strongly Agree with that I am relaxed most of the time.
	Neutral with that I am not interested in abstract ideas.
	Strongly Agree with that I talk to a lot of different people at parties.
	Agree with that I feel others’ emotions.
	Disagree with that I like order.
	Strongly Agree with that I get upset easily.
	Neutral with that I have difficulty understanding abstract ideas.
	Strongly Disagree with that I keep in the background.
	Agree with that I am not really interested in others.
	Strongly Disagree with that I make a mess of things.
	Strongly Agree with that I seldom feel blue.
	Strongly Disagree with that I do not have a good imagination."

	personality_prediction = personality_detection(text_input)
	print(personality_prediction)

	## Output:-
	{
	"Agreeableness":"5%"
	"Conscientiousness":"5%"
	"Extraversion":"6%"
	"Neuroticism":"100%"
	"Openness":"5%"
	}

	### * In addition to a spider graph in [my streamlit app](https://personality-assessment.streamlit.app/)

	## Epochs:
	* There were 3 epochs only, and I got a high accuracy from the second one as follows without overfitting:

	\| Epoch \|Training Loss \| Validation Loss \| Accuracy \|
	\| ------------- \|:-------------:\|:-------------:\|:-------------:\|
	\| 1 \| 0.626600 \| 0.188280 \| 0.945493 \|
	\| 2 \| 0.166500 \| 0.095803 \| 0.970488 \|
	\| 3 \| 0.104300 \| 0.074864 \| 0.976524 \|

	** Please note the following points explaining my result and why this were not overfitting:
	1) With a large training dataset of 360,855 samples, good performance may be achieved in a few epochs. The model has enough data to learn from.
	2) Using a pretrained language model, such as [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased), results in stronger initialization, allowing for faster convergence than random initialization, and this model with 12-layer, 384-hidden, 12-heads, 33M parameters, and 2.7x faster than BERT-Base
	and for more details in this point check the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)".
	3) The low training and validation losses after only 2-3 epochs show that the model is fitting the data correctly and is not overfitting.
	4) Increasing accuracy and decreasing loss values over epochs demonstrate smooth continuous learning, rather than spikes, which may imply overfitting.
	5) The high accuracy of 97.65% after only three epochs is amazing and understandable considering the massive sample quantity, pretrained weights, and smooth learning curves.

	## Evaluation Metrics:
	* The evaluation metrics used are: Accuracy, Recall and F1-score.
	1) Accuracy:- Training data (0.976524) - Test data (0.9765239651189411)
	2) Recall:- {'recall': 0.9765239651189411}
	3) F1-score:- {'f1': 0.9765239651189411}

	## Model Summary:
	1) Model_Name: [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased)
	2) Dataset Size: 360855 rows; after making data cleansing and class balance to the [Kaggle personality dataset](https://www.kaggle.com/datasets/tunguz/big-five-personality-test)
	3) Num_Of_Epochs: 3
	4) Tokenizer Max_length: 275
	5) Batch size: 64
	6) Learning Rate: 5e-6
	7) Software: Google Colab with free GPU

	#### * Final Finetuning Model is: [Nasserelsaman/microsoft_finetuned_personality](https://huggingface.co/Nasserelsaman/microsoft-finetuned-personality), and please note that his model is for educational and research purposes only, any uses outside of this, the author is not responsible.

	#### * Model streamlit app: [Personality_Assessment](https://personality-assessment.streamlit.app/)

	#### * This project is based on [The Mini IPIP personality measure](https://www.researchgate.net/publication/7014171_The_Mini-IPIP_Scales_Tiny-yet-Effective_Measures_of_the_Big_Five_Factors_of_Personality)

	## Model Card Authors:
	[Prepared by:- Nasser Elsaman](https://elsamaninfo.wordpress.com)