nehulagrawal commited on
Commit
d336dbd
1 Parent(s): 0e1c4f1

Upload 4 files

Browse files
Files changed (4) hide show
  1. json_helper.py +68 -0
  2. main.py +12 -0
  3. readme.md +129 -0
  4. requirement.txt +3 -0
json_helper.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_community.llms import Ollama
2
+
3
+ json_content = """{{
4
+ "name": "",
5
+ "email" : "",
6
+ "phone_1": "",
7
+ "phone_2": "",
8
+ "address": "",
9
+ "city": "",
10
+ "linkedin": "",
11
+ "professional_experience_in_years": "",
12
+ "highest_education": "",
13
+ "is_fresher": "yes/no",
14
+ "is_student": "yes/no",
15
+ "skills": ["",""],
16
+ "applied_for_profile": "",
17
+ "education": [
18
+ {{
19
+ "institute_name": "",
20
+ "year_of_passsing": "",
21
+ "score": ""
22
+ }},
23
+ {{
24
+ "institute_name": "",
25
+ "year_of_passsing": "",
26
+ "score": ""
27
+ }}
28
+ ],
29
+ "professional_experiene": [
30
+ {{
31
+ "organisation_name": "",
32
+ "duration": "",
33
+ "profile": ""
34
+ }},
35
+ {{
36
+ "organisation_name": "",
37
+ "duration": "",
38
+ "profile": ""
39
+ }}
40
+ ]
41
+ }}"""
42
+
43
+
44
+ class InputData:
45
+ def input_data(text):
46
+
47
+ input = f"""Extract relevant information from the following resume text and fill the provided JSON template. Ensure all keys in the template are present in the output, even if the value is empty or unknown. If a specific piece of information is not found in the text, use 'Not provided' as the value.
48
+
49
+ Resume text:
50
+ {text}
51
+
52
+ JSON template:
53
+ {json_content}
54
+
55
+ Instructions:
56
+ 1. Carefully analyze the resume text.
57
+ 2. Extract relevant information for each field in the JSON template.
58
+ 3. If a piece of information is not explicitly stated, make a reasonable inference based on the context.
59
+ 4. Ensure all keys from the template are present in the output JSON.
60
+ 5. Format the output as a valid JSON string.
61
+
62
+ Output the filled JSON template only, without any additional text or explanations."""
63
+
64
+ return input
65
+
66
+ def llm():
67
+ llm = Ollama(model="llama3")
68
+ return llm
main.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pdfminer.high_level import extract_text
2
+ from json_helper import InputData as input
3
+
4
+ def extract_text_from_pdf(pdf_path):
5
+ return extract_text(pdf_path)
6
+
7
+ text = extract_text_from_pdf(r"/home/ml2/Desktop/resume/Resume1.pdf")
8
+
9
+ llm = input.llm()
10
+
11
+ data = llm.invoke(input.input_data(text))
12
+ print(data)
readme.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - resume
4
+ - extractor
5
+ - resume extractor
6
+ - extract
7
+ - pdf
8
+ - cv parser
9
+ - pdf extraction
10
+ - document analysis
11
+ - unstructured document
12
+ - DataProcessing
13
+ - TextToJSON
14
+ - resume parser
15
+ - resume information extractor
16
+ - resume data extraction
17
+
18
+ # PDF Resume Information Extractor
19
+
20
+ ## Description
21
+
22
+ This Python script extracts information from PDF resumes and converts it into a structured JSON format using the Ollama language model. It's designed to automate the process of parsing resumes and extracting key details, making it easier for HR departments, recruiters, and organizations to process large volumes of applications efficiently.
23
+
24
+ The Resume Information Extractor serves as a versatile solution for precisely identifying and extracting relevant information from resume PDFs. This tool combines PDF text extraction with natural language processing to parse unstructured resume data into a structured format. By leveraging the Ollama language model, it can understand context and extract information even when it's not explicitly stated.
25
+
26
+ What sets this tool apart is its ability to handle various resume formats and styles. It uses a predefined JSON template to ensure consistent output structure, making it easier to integrate with other systems or databases. The tool is designed to be flexible, allowing for customization of the output format and the underlying language model.
27
+
28
+ We invite you to explore the potential of this tool and its data extraction capabilities. For those interested in harnessing its power or seeking further collaboration, we encourage you to reach out to us or contribute to the project on GitHub. Your input drives our continuous improvement, as we collectively pave the way towards enhanced data extraction and document analysis.
29
+
30
+ - **Developed by:** FODUU AI
31
+ - **Model type:** Extraction
32
+ - **Task:** Resume Parsing and Information Extraction
33
+
34
+ ### Supported Output Fields
35
+
36
+ The exact fields depend on the JSON template which have includes the following fields:
37
+ ['name', 'email', 'phone_1', 'phone_2', 'address', 'city', 'highest_education', 'is_fresher','is_student', 'professional_experience_in_years', 'skills' ,'linkedin' , 'applied_for_profile', 'education', 'professional_experience']
38
+
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+
44
+ The Resume Information Extractor can be directly used for parsing resume PDFs and extracting structured information. It's particularly useful for HR departments, recruitment agencies, or any organization that deals with large volumes of resumes.
45
+
46
+ ### Downstream Use
47
+
48
+ The extracted information can be used for various downstream tasks such as candidate matching, resume scoring, or populating applicant tracking systems.
49
+
50
+ ### Out-of-Scope Use
51
+
52
+ The model is not designed for tasks unrelated to resume parsing or for processing documents that are not resumes.
53
+
54
+ ## Risks, and Limitations
55
+
56
+ The Resume Information Extractor may have some limitations and risks, including:
57
+
58
+ - Performance may vary based on the format and structure of the input resume.
59
+ - The quality of extraction depends on the capabilities of the underlying Ollama model.
60
+ - It may struggle with highly unconventional resume formats or non-English resumes.
61
+ - The tool does not verify the accuracy of the information in the resume.
62
+
63
+ ### Recommendations
64
+
65
+ Users should be aware of the tool's limitations and potential risks. It's recommended to manually verify the extracted information for critical applications. Further testing and validation are advised for specific use cases to evaluate its performance accurately.
66
+
67
+ ## How to Get Started with the Model
68
+
69
+ To begin using the Resume Information Extractor, follow these steps:
70
+
71
+ 1. Install the required packages:
72
+ ```bash
73
+ pip install langchain_community pdfminer.six ollama
74
+
75
+
76
+ 2. Ensure you have Ollama set up and running with the "llama3" model.
77
+
78
+ 3. Use the tool in your Python script:
79
+
80
+ ```python
81
+ from pdfminer.high_level import extract_text
82
+ from json_helper import InputData as input
83
+
84
+ def extract_text_from_pdf(pdf_path):
85
+ return extract_text(pdf_path)
86
+
87
+ text = extract_text_from_pdf(r"/home/ml2/Desktop/resume/Resume1.pdf")
88
+
89
+ llm = input.llm()
90
+ data = llm.invoke(input.input_data(text))
91
+
92
+ print(data)
93
+ ```
94
+
95
+ ## Objective
96
+
97
+ The tool uses the Ollama language model to understand and extract information from resume text. The specific architecture depends on the Ollama model being used.
98
+
99
+
100
+ ## Contact
101
+
102
+ For inquiries and contributions, please contact us at [email protected].
103
+
104
+ ```bibtex
105
+ @contact{foduu,
106
+ author = {Foduu},
107
+ title = {Resume Extractor},
108
+ year = {2024}
109
+ }
110
+
111
+ ```
112
+
113
+ ---
114
+
115
+
116
+
117
+
118
+
119
+
120
+
121
+
122
+
123
+
124
+
125
+
126
+
127
+
128
+
129
+
requirement.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ langchain-community
2
+ pdfminer
3
+ pdfminer.six