Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- .gradio/cached_examples/23/log.csv +4 -0
- .gradio/certificate.pem +31 -0
- .gradio/flagged/log.csv +6 -0
- .idea/.gitignore +8 -0
- .idea/JStage_RAG.iml +8 -0
- .idea/inspectionProfiles/Project_Default.xml +14 -0
- .idea/inspectionProfiles/profiles_settings.xml +6 -0
- .idea/misc.xml +7 -0
- .idea/modules.xml +8 -0
- .idea/vcs.xml +6 -0
- .idea/workspace.xml +111 -0
- README.md +2 -8
- app.py +128 -0
- github/s2orc/.gitignore +114 -0
- github/s2orc/README.md +152 -0
- github/s2orc/assets/logo.svg +23 -0
- github/s2orc/data/metadata/sample.jsonl +0 -0
- github/s2orc/data/pdf_parses/sample.jsonl +0 -0
- github/s2orc/requirements.txt +2 -0
- github/s2orc/setup.py +15 -0
- requirements.txt +2 -0
- sample.jsonl +0 -0
.gradio/cached_examples/23/log.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
component 0,timestamp
|
2 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""\u3053\u3093\u306b\u3061\u306f"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""\u3053\u3093\u306b\u3061\u306f... \u308f\u304b\u3063\u305f\uff01\u3053\u3061\u3089\u306f\u3042\u306a\u305f\u306e\u8cea\u554f\u306b\u95a2\u9023\u3059\u308b\u8ad6\u6587\u3067\u3059\uff1a\n- Turkey\u2014Europe's Bridge to the Middle East: None\n- Hypertensive heart disease.: None\n- Revisiting the Dynamics of Forest Area Change: A Panel Data Assessment: None\n- Pompe disease: PAS-positive lymphocyte vacuoles as diagnostic screening test: None\n- Belly Dancers' Reunion: Some people called Elna a hermit, but she did still answer her door. She might only peek through a crack, but a knock could rouse her out of her darkly comfortable basement. Maybe she thought Harry would drop by. Or the Arabic drummer. Whatever it was that kept her answering, one day she heard a substantial commotion at her side door. She put a bookmark in her novel and climbed out of her basement, head down, aware of the lint on the stairs and the black scum on the bannister. Pausing at the top, she saw two faces on the other side of the door's glass pane. Two large women were peering in, looking curiously at Elna in her cable knit sweater and wool skirt spattered with turquoise and pink flecks. Because she was startled to see such girth on her narrow porch, such an abundance of cheeks, breasts, and upper arms, she opened her door wider than usual. \""We've been sent to you,\"" one of the women said through the screen door. \""We need help,\"" the other interrupted. The two were like water trying to rush through a slow drain pipe. \""There's a hundred dollar prize,\"" Elna managed to decipher from the chaos on her porch. \""For what?\"" she asked. \""Somebody at our office is a neighbor of yours. We told him we'd decided to be exotic dancers at the Silver Zephyr's Amateur Night. He said you used to belly dance and might have costumes.\""\n\n\u3069\u3046\u601d\u3044\u307e\u3059\u304b\uff1f"", ""options"": null}]",2025-05-08 00:59:28.230599
|
3 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""LLM\u95a2\u9023\u306e\u8ad6\u6587\u3092\u63a2\u3057\u305f\u3044"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""LLM\u95a2\u9023\u306e\u8ad6\u6587\u3092\u63a2\u3057\u305f\u3044... \u308f\u304b\u3063\u305f\uff01\u3053\u3061\u3089\u306f\u3042\u306a\u305f\u306e\u8cea\u554f\u306b\u95a2\u9023\u3059\u308b\u8ad6\u6587\u3067\u3059\uff1a\n- Managing Ethnicity in African Politics: None\n- Lethal drug interaction: Isoniazid and methylxanthines: Nonlethal doses of methylxanthines (caffeine or theophyline) produced dose-dependent lethality in rats pretreated with isoniazid. Isoniazid pretreatment did not alter theophylline concentration in blood or brain, suggesting that the drug interaction was not due to altered distribution or metabolism of theophylline. Death was associated with tonic-clonic seizures and pulmonary congestion. The toxicity of the drug combination was blocked by the anticonvulsants diazepam, barbital, and trimethadione, but not by chlorpromazine, a sedative drug which lacks anticonvulsant activity. Thus, there is a fatal drug interaction between isoniazid and theophylline which may be due to convulsions that trigger a shock lung syndrome.\n- An orbital sequence design for lunar missions combining elliptic orbital segments with constraints: None\n- Electrons for Neutrinos: Lepton Energy Reconstruction in the Resonance Excitation Region: None\n- SYMPOSIUM: ocular injuries.: None\n\n\u3069\u3046\u601d\u3044\u307e\u3059\u304b\uff1f"", ""options"": null}]",2025-05-08 00:59:30.122987
|
4 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""I want to know which is the SOTA model for MedQA."", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""I want to know which is the SOTA model for MedQA.... \u308f\u304b\u3063\u305f\uff01\u3053\u3061\u3089\u306f\u3042\u306a\u305f\u306e\u8cea\u554f\u306b\u95a2\u9023\u3059\u308b\u8ad6\u6587\u3067\u3059\uff1a\n- Characterization of sources of 2\u03c0 phase discontinuity in speckle interferograms: One of the most successful phase-unwrapping algorithms uses branch cuts to join discontinuity sources that mark the beginning or the end of a 2\u03c0 phase discontinuity. Here, using phase-stepping speckle interferometry, we verify that these sources coincide with points of very low or zero modulus and that the displacement of sources as a result of speckle decorrelation between measurements of two phase maps leads to closely spaced dipole pairs of sources in the phase-difference map. By measuring the movement of sources at high magnification, we find that the length distribution of correct branch cuts needed to unwrap a phase-difference map is approximately Gaussian. This provides a theoretical justification for unwrapping with the set of branch cuts that minimizes the sum of squares of cut lengths.\n- Highly Enantioselective Conjugate Addition-Cyclization Cascade Reaction of Malonates with o-Hydroxycinnamaldehydes: Asymmetric Synthesis of 4-Substituted Chromanols: The asymmetric organocatalytic conjugate addition\u2013cyclization reaction of malonates with o-hydroxycinnamaldehydes, which affords 4-substituted chroman-2-ols, has been established using a diphenylprolinol trimethylsilyl (TMS) ether as organocatalyst. The desired products were obtained with good to excellent yields and high enantioselectivities (up to >99% ee). Synthetically useful chroman derivatives were formed after subsequent reactions.\n- Prospective validation of noninvasive prenatal testing on whole genome level (conference abstract - article in Slovak): None\n- Ilhan New, Soldier for the Modern Nation: Recovering a Protestant Martial Alternative to Korean Hegemonic Masculinity: Twentieth century Korean hegemonic masculinity has validated the right to employ violence for the benefit of the nation unchecked by any higher ethical concerns. This arose in the early twentieth century in reaction to a crisis of Korean masculinity, identified by the first Korean nationalists. The supposed pernicious effects of Confucianism created the crisis by making men effete. This in turn \u201cled\u201d Korea to lose its independence. While scholars have recognized alternative Korean masculinities arising since the 1990s, including Catholic masculinities, they have overlooked a Protestant martial masculinity personified by Ilhan New [1895\u20131971]. New was a much lauded business pioneer, but his military career has not been analyzed in terms of its place in the history of masculinity. Mentored by a leading Protestant nationalist, New personified in mid-century an alternative Protestant martial masculinity, which created soldiers fighting for the nation under the discipline of a conventional military, bound by Protestant norms.\n- Single-fraction versus hypofractionated stereotactic radiosurgery for medium-sized brain metastases of 2.5 to 3 cm: Purpose ::: Given recently suggested utility of hypofractionated stereotactic radiosurgery (SRS) in treating large brain metastases (BMs) > 3 cm, we sought to prospectively control tumor size variable to investigate the efficacy and safety of hypofractionated SRS for medium-sized BMs (2.5 to 3 cm) compared with single-fraction SRS.\n\n\u3069\u3046\u601d\u3044\u307e\u3059\u304b\uff1f"", ""options"": null}]",2025-05-08 00:59:35.776590
|
.gradio/certificate.pem
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
-----BEGIN CERTIFICATE-----
|
2 |
+
MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
|
3 |
+
TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
|
4 |
+
cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
|
5 |
+
WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
|
6 |
+
ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
|
7 |
+
MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
|
8 |
+
h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
|
9 |
+
0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
|
10 |
+
A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
|
11 |
+
T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
|
12 |
+
B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
|
13 |
+
B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
|
14 |
+
KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
|
15 |
+
OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
|
16 |
+
jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
|
17 |
+
qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
|
18 |
+
rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
|
19 |
+
HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
|
20 |
+
hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
|
21 |
+
ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
|
22 |
+
3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
|
23 |
+
NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
|
24 |
+
ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
|
25 |
+
TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
|
26 |
+
jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
|
27 |
+
oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
|
28 |
+
4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
|
29 |
+
mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
|
30 |
+
emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
|
31 |
+
-----END CERTIFICATE-----
|
.gradio/flagged/log.csv
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
conversation,index,value,flag,timestamp
|
2 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""hi"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""You typed: hi"", ""options"": null}]",1,Like,2025-05-07 16:07:57.747389
|
3 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""hi"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""You typed: hi"", ""options"": null}]",1,Spam,2025-05-07 16:08:02.536048
|
4 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""hi"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""You typed: hi"", ""options"": null}]",1,Inappropriate,2025-05-07 16:08:03.301466
|
5 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""hi"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""You typed: hi"", ""options"": null}]",1,Like,2025-05-07 16:08:05.084130
|
6 |
+
"[{""role"": ""user"", ""metadata"": null, ""content"": ""LLM\u95a2\u9023\u306e\u8ad6\u6587\u3092\u63a2\u3057\u305f\u3044"", ""options"": null}, {""role"": ""assistant"", ""metadata"": null, ""content"": ""LLM\u95a2\u9023\u306e\u8ad6\u6587\u3092\u63a2\u3057\u305f\u3044... \u308f\u304b\u3063\u305f\uff01\u3053\u3061\u3089\u306f\u3042\u306a\u305f\u306e\u8cea\u554f\u306b\u95a2\u9023\u3059\u308b\u8ad6\u6587\u3067\u3059\uff1a\n- Glucocorticoid Receptor Activation Lowers the Threshold for NMDA-Receptor-Dependent Homosynaptic Long-Term Depression in the Hippocampus Through Activation of Voltage-Dependent Calcium Channels: Coussens, Christine M., D. Steven Kerr, and Wickliffe C. Abraham. Glucocorticoid receptor activation lowers the threshold for NMDA-receptor-dependent homosynaptic long-term depression in the hippoc...\n- Summary of the Discussions: None\n- VizieR Online Data Catalog: Sou323 ICRF reference sample (Liu+, 2017): None\n\n\u3069\u3046\u601d\u3044\u307e\u3059\u304b\uff1f"", ""options"": null}]",1,Like,2025-05-09 16:48:21.327291
|
.idea/.gitignore
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Default ignored files
|
2 |
+
/shelf/
|
3 |
+
/workspace.xml
|
4 |
+
# Editor-based HTTP Client requests
|
5 |
+
/httpRequests/
|
6 |
+
# Datasource local storage ignored files
|
7 |
+
/dataSources/
|
8 |
+
/dataSources.local.xml
|
.idea/JStage_RAG.iml
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<?xml version="1.0" encoding="UTF-8"?>
|
2 |
+
<module type="PYTHON_MODULE" version="4">
|
3 |
+
<component name="NewModuleRootManager">
|
4 |
+
<content url="file://$MODULE_DIR$" />
|
5 |
+
<orderEntry type="inheritedJdk" />
|
6 |
+
<orderEntry type="sourceFolder" forTests="false" />
|
7 |
+
</component>
|
8 |
+
</module>
|
.idea/inspectionProfiles/Project_Default.xml
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<component name="InspectionProjectProfileManager">
|
2 |
+
<profile version="1.0">
|
3 |
+
<option name="myName" value="Project Default" />
|
4 |
+
<inspection_tool class="PyPackageRequirementsInspection" enabled="true" level="WARNING" enabled_by_default="true">
|
5 |
+
<option name="ignoredPackages">
|
6 |
+
<value>
|
7 |
+
<list size="1">
|
8 |
+
<item index="0" class="java.lang.String" itemvalue="omegaconf" />
|
9 |
+
</list>
|
10 |
+
</value>
|
11 |
+
</option>
|
12 |
+
</inspection_tool>
|
13 |
+
</profile>
|
14 |
+
</component>
|
.idea/inspectionProfiles/profiles_settings.xml
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<component name="InspectionProjectProfileManager">
|
2 |
+
<settings>
|
3 |
+
<option name="USE_PROJECT_PROFILE" value="false" />
|
4 |
+
<version value="1.0" />
|
5 |
+
</settings>
|
6 |
+
</component>
|
.idea/misc.xml
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<?xml version="1.0" encoding="UTF-8"?>
|
2 |
+
<project version="4">
|
3 |
+
<component name="Black">
|
4 |
+
<option name="sdkName" value="rag" />
|
5 |
+
</component>
|
6 |
+
<component name="ProjectRootManager" version="2" project-jdk-name="rag" project-jdk-type="Python SDK" />
|
7 |
+
</project>
|
.idea/modules.xml
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<?xml version="1.0" encoding="UTF-8"?>
|
2 |
+
<project version="4">
|
3 |
+
<component name="ProjectModuleManager">
|
4 |
+
<modules>
|
5 |
+
<module fileurl="file://$PROJECT_DIR$/.idea/JStage_RAG.iml" filepath="$PROJECT_DIR$/.idea/JStage_RAG.iml" />
|
6 |
+
</modules>
|
7 |
+
</component>
|
8 |
+
</project>
|
.idea/vcs.xml
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<?xml version="1.0" encoding="UTF-8"?>
|
2 |
+
<project version="4">
|
3 |
+
<component name="VcsDirectoryMappings">
|
4 |
+
<mapping directory="$PROJECT_DIR$/github/s2orc" vcs="Git" />
|
5 |
+
</component>
|
6 |
+
</project>
|
.idea/workspace.xml
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<?xml version="1.0" encoding="UTF-8"?>
|
2 |
+
<project version="4">
|
3 |
+
<component name="AutoImportSettings">
|
4 |
+
<option name="autoReloadType" value="SELECTIVE" />
|
5 |
+
</component>
|
6 |
+
<component name="ChangeListManager">
|
7 |
+
<list default="true" id="c853a62f-55db-44a1-bce5-2e4d5d00c3a8" name="Changes" comment="" />
|
8 |
+
<option name="SHOW_DIALOG" value="false" />
|
9 |
+
<option name="HIGHLIGHT_CONFLICTS" value="true" />
|
10 |
+
<option name="HIGHLIGHT_NON_ACTIVE_CHANGELIST" value="false" />
|
11 |
+
<option name="LAST_RESOLUTION" value="IGNORE" />
|
12 |
+
</component>
|
13 |
+
<component name="FileTemplateManagerImpl">
|
14 |
+
<option name="RECENT_TEMPLATES">
|
15 |
+
<list>
|
16 |
+
<option value="Python Script" />
|
17 |
+
</list>
|
18 |
+
</option>
|
19 |
+
</component>
|
20 |
+
<component name="Git.Settings">
|
21 |
+
<option name="RECENT_GIT_ROOT_PATH" value="$PROJECT_DIR$/github/s2orc" />
|
22 |
+
</component>
|
23 |
+
<component name="MarkdownSettingsMigration">
|
24 |
+
<option name="stateVersion" value="1" />
|
25 |
+
</component>
|
26 |
+
<component name="ProjectColorInfo">{
|
27 |
+
"associatedIndex": 2
|
28 |
+
}</component>
|
29 |
+
<component name="ProjectId" id="2wkzRAQAXupWpP41JeH3GR1AJCi" />
|
30 |
+
<component name="ProjectViewState">
|
31 |
+
<option name="hideEmptyMiddlePackages" value="true" />
|
32 |
+
<option name="showLibraryContents" value="true" />
|
33 |
+
</component>
|
34 |
+
<component name="PropertiesComponent">{
|
35 |
+
"keyToString": {
|
36 |
+
"Python.app.executor": "Run",
|
37 |
+
"RunOnceActivity.OpenProjectViewOnStart": "true",
|
38 |
+
"RunOnceActivity.ShowReadmeOnStart": "true",
|
39 |
+
"git-widget-placeholder": "master",
|
40 |
+
"last_opened_file_path": "/Users/jiangjunfeng/Desktop/coding/JStage_RAG",
|
41 |
+
"node.js.detected.package.eslint": "true",
|
42 |
+
"node.js.detected.package.tslint": "true",
|
43 |
+
"node.js.selected.package.eslint": "(autodetect)",
|
44 |
+
"node.js.selected.package.tslint": "(autodetect)",
|
45 |
+
"nodejs_package_manager_path": "npm",
|
46 |
+
"vue.rearranger.settings.migration": "true"
|
47 |
+
}
|
48 |
+
}</component>
|
49 |
+
<component name="RecentsManager">
|
50 |
+
<key name="CopyFile.RECENT_KEYS">
|
51 |
+
<recent name="$PROJECT_DIR$" />
|
52 |
+
</key>
|
53 |
+
</component>
|
54 |
+
<component name="RunManager">
|
55 |
+
<configuration name="app" type="PythonConfigurationType" factoryName="Python" temporary="true" nameIsGenerated="true">
|
56 |
+
<module name="JStage_RAG" />
|
57 |
+
<option name="ENV_FILES" value="" />
|
58 |
+
<option name="INTERPRETER_OPTIONS" value="" />
|
59 |
+
<option name="PARENT_ENVS" value="true" />
|
60 |
+
<envs>
|
61 |
+
<env name="PYTHONUNBUFFERED" value="1" />
|
62 |
+
</envs>
|
63 |
+
<option name="SDK_HOME" value="" />
|
64 |
+
<option name="WORKING_DIRECTORY" value="$PROJECT_DIR$" />
|
65 |
+
<option name="IS_MODULE_SDK" value="true" />
|
66 |
+
<option name="ADD_CONTENT_ROOTS" value="true" />
|
67 |
+
<option name="ADD_SOURCE_ROOTS" value="true" />
|
68 |
+
<EXTENSION ID="PythonCoverageRunConfigurationExtension" runner="coverage.py" />
|
69 |
+
<option name="SCRIPT_NAME" value="$PROJECT_DIR$/app.py" />
|
70 |
+
<option name="PARAMETERS" value="" />
|
71 |
+
<option name="SHOW_COMMAND_LINE" value="false" />
|
72 |
+
<option name="EMULATE_TERMINAL" value="false" />
|
73 |
+
<option name="MODULE_MODE" value="false" />
|
74 |
+
<option name="REDIRECT_INPUT" value="false" />
|
75 |
+
<option name="INPUT_FILE" value="" />
|
76 |
+
<method v="2" />
|
77 |
+
</configuration>
|
78 |
+
<recent_temporary>
|
79 |
+
<list>
|
80 |
+
<item itemvalue="Python.app" />
|
81 |
+
</list>
|
82 |
+
</recent_temporary>
|
83 |
+
</component>
|
84 |
+
<component name="SharedIndexes">
|
85 |
+
<attachedChunks>
|
86 |
+
<set>
|
87 |
+
<option value="bundled-python-sdk-d68999036c7f-b11f5e8da5ad-com.jetbrains.pycharm.pro.sharedIndexes.bundled-PY-233.14475.56" />
|
88 |
+
</set>
|
89 |
+
</attachedChunks>
|
90 |
+
</component>
|
91 |
+
<component name="SpellCheckerSettings" RuntimeDictionaries="0" Folders="0" CustomDictionaries="0" DefaultDictionary="application-level" UseSingleDictionary="true" transferred="true" />
|
92 |
+
<component name="TaskManager">
|
93 |
+
<task active="true" id="Default" summary="Default task">
|
94 |
+
<changelist id="c853a62f-55db-44a1-bce5-2e4d5d00c3a8" name="Changes" comment="" />
|
95 |
+
<created>1746600349656</created>
|
96 |
+
<option name="number" value="Default" />
|
97 |
+
<option name="presentableId" value="Default" />
|
98 |
+
<updated>1746600349656</updated>
|
99 |
+
<workItem from="1746600368067" duration="9690000" />
|
100 |
+
<workItem from="1746762687197" duration="1020000" />
|
101 |
+
<workItem from="1746776819364" duration="198000" />
|
102 |
+
</task>
|
103 |
+
<servers />
|
104 |
+
</component>
|
105 |
+
<component name="TypeScriptGeneratedFilesManager">
|
106 |
+
<option name="version" value="3" />
|
107 |
+
</component>
|
108 |
+
<component name="com.intellij.coverage.CoverageDataManagerImpl">
|
109 |
+
<SUITE FILE_PATH="coverage/JStage_RAG$app.coverage" NAME="app Coverage Results" MODIFIED="1746776823255" SOURCE_PROVIDER="com.intellij.coverage.DefaultCoverageFileProvider" RUNNER="coverage.py" COVERAGE_BY_TEST_ENABLED="true" COVERAGE_TRACING_ENABLED="false" WORKING_DIRECTORY="$PROJECT_DIR$" />
|
110 |
+
</component>
|
111 |
+
</project>
|
README.md
CHANGED
@@ -1,12 +1,6 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
|
4 |
-
colorFrom: yellow
|
5 |
-
colorTo: blue
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.29.0
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
---
|
11 |
-
|
12 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
1 |
---
|
2 |
+
title: JStage_RAG
|
3 |
+
app_file: app.py
|
|
|
|
|
4 |
sdk: gradio
|
5 |
sdk_version: 5.29.0
|
|
|
|
|
6 |
---
|
|
|
|
app.py
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import time
|
2 |
+
import random
|
3 |
+
import ujson as json
|
4 |
+
from typing import List
|
5 |
+
from dataclasses import dataclass
|
6 |
+
import gradio as gr
|
7 |
+
|
8 |
+
|
9 |
+
@dataclass
|
10 |
+
class Paper:
|
11 |
+
paper_id: str
|
12 |
+
title: str
|
13 |
+
abstract: str
|
14 |
+
authors: List[str] = None
|
15 |
+
year: int = None
|
16 |
+
doi: str = None
|
17 |
+
|
18 |
+
|
19 |
+
def load_database(filename):
|
20 |
+
database = []
|
21 |
+
with open(filename, "r", encoding="utf-8") as f:
|
22 |
+
for line in f:
|
23 |
+
json_data = json.loads(line)
|
24 |
+
|
25 |
+
data_point = Paper(
|
26 |
+
paper_id=json_data["paper_id"],
|
27 |
+
title=json_data["title"],
|
28 |
+
abstract=json_data["abstract"],
|
29 |
+
authors=json_data.get("authors", []),
|
30 |
+
year=json_data.get("year", None),
|
31 |
+
doi=json_data.get("doi", None)
|
32 |
+
)
|
33 |
+
|
34 |
+
database.append(data_point)
|
35 |
+
return database
|
36 |
+
|
37 |
+
|
38 |
+
class S2ORCRAGPipeline:
|
39 |
+
def __init__(
|
40 |
+
self,
|
41 |
+
s2orc_filename,
|
42 |
+
model=lambda x: x,
|
43 |
+
):
|
44 |
+
self.s2orc_filename = s2orc_filename
|
45 |
+
self.database = load_database(s2orc_filename)
|
46 |
+
self.model = model
|
47 |
+
|
48 |
+
def retrieve_top_k(
|
49 |
+
self,
|
50 |
+
query: str,
|
51 |
+
topk=5
|
52 |
+
):
|
53 |
+
# Fake
|
54 |
+
random.seed(len(query) + topk)
|
55 |
+
return random.sample(self.database, topk)
|
56 |
+
|
57 |
+
# Real
|
58 |
+
# TODO: DB-team
|
59 |
+
|
60 |
+
|
61 |
+
def generate_response(
|
62 |
+
self,
|
63 |
+
query,
|
64 |
+
retrieved_papers,
|
65 |
+
):
|
66 |
+
# Fake
|
67 |
+
response = f"{query}... わかった!こちらはあなたの質問に関連する論文です:\n"
|
68 |
+
for paper in retrieved_papers:
|
69 |
+
response += f"- {paper.title}: {paper.abstract}\n"
|
70 |
+
|
71 |
+
response += "\nどう思いますか?\n"
|
72 |
+
|
73 |
+
response = self.model(response)
|
74 |
+
|
75 |
+
return response
|
76 |
+
|
77 |
+
# Real
|
78 |
+
# TODO: Generation-team
|
79 |
+
|
80 |
+
def __call__(
|
81 |
+
self,
|
82 |
+
query
|
83 |
+
):
|
84 |
+
# Firstly, retrieve papers from database
|
85 |
+
retrieved_papers = self.retrieve_top_k(query, topk=3)
|
86 |
+
|
87 |
+
# Secondly, generate response based on query and the retrieved papers
|
88 |
+
response = self.generate_response(query, retrieved_papers)
|
89 |
+
|
90 |
+
return response
|
91 |
+
|
92 |
+
def slow_echo(self, message, history):
|
93 |
+
output = self.__call__(query=message)
|
94 |
+
for i in range(len(output)):
|
95 |
+
time.sleep(0.001)
|
96 |
+
yield output[: i + 1]
|
97 |
+
|
98 |
+
|
99 |
+
if __name__ == "__main__":
|
100 |
+
# load from S2ORC
|
101 |
+
example_filename = "sample.jsonl"
|
102 |
+
|
103 |
+
pipeline = S2ORCRAGPipeline(
|
104 |
+
s2orc_filename=example_filename,
|
105 |
+
model=lambda x: x
|
106 |
+
)
|
107 |
+
|
108 |
+
initial_messages = [{"role": "assistant", "content": "こんにちは〜今日は何の論文を探したいですか?"}]
|
109 |
+
|
110 |
+
demo = gr.ChatInterface(
|
111 |
+
pipeline.slow_echo,
|
112 |
+
chatbot=gr.Chatbot(
|
113 |
+
value=initial_messages,
|
114 |
+
type="messages",
|
115 |
+
resizable=True, height=700,
|
116 |
+
placeholder="こんにちは〜今日は何の論文を探したいですか?"
|
117 |
+
),
|
118 |
+
type="messages",
|
119 |
+
flagging_mode="manual",
|
120 |
+
flagging_options=["Like", "Spam", "Inappropriate", "Other"],
|
121 |
+
title="LLMC S2ORC 論文検索 (+RAG)",
|
122 |
+
description="",
|
123 |
+
save_history=True,
|
124 |
+
examples=["こんにちは", "LLM関連の論文を探したい", "Find Suzuki's papers on graphene from 2019 to 2021 in Surface Science Journal."],
|
125 |
+
)
|
126 |
+
|
127 |
+
demo.launch(debug=True, share=True) # Share=True is failed when using NII Network
|
128 |
+
|
github/s2orc/.gitignore
ADDED
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
# vscode
|
3 |
+
*.vscode
|
4 |
+
|
5 |
+
# Byte-compiled / optimized / DLL files
|
6 |
+
__pycache__/
|
7 |
+
*.py[cod]
|
8 |
+
*$py.class
|
9 |
+
|
10 |
+
# C extensions
|
11 |
+
*.so
|
12 |
+
|
13 |
+
# Distribution / packaging
|
14 |
+
.Python
|
15 |
+
build/
|
16 |
+
develop-eggs/
|
17 |
+
dist/
|
18 |
+
downloads/
|
19 |
+
eggs/
|
20 |
+
.eggs/
|
21 |
+
lib/
|
22 |
+
lib64/
|
23 |
+
parts/
|
24 |
+
sdist/
|
25 |
+
var/
|
26 |
+
wheels/
|
27 |
+
*.egg-info/
|
28 |
+
.installed.cfg
|
29 |
+
*.egg
|
30 |
+
MANIFEST
|
31 |
+
|
32 |
+
# PyInstaller
|
33 |
+
# Usually these files are written by a python script from a template
|
34 |
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
35 |
+
*.manifest
|
36 |
+
*.spec
|
37 |
+
|
38 |
+
# Installer logs
|
39 |
+
pip-log.txt
|
40 |
+
pip-delete-this-directory.txt
|
41 |
+
|
42 |
+
# Unit test / coverage reports
|
43 |
+
htmlcov/
|
44 |
+
.tox/
|
45 |
+
.coverage
|
46 |
+
.coverage.*
|
47 |
+
.cache
|
48 |
+
nosetests.xml
|
49 |
+
coverage.xml
|
50 |
+
*.cover
|
51 |
+
.hypothesis/
|
52 |
+
.pytest_cache/
|
53 |
+
|
54 |
+
# Translations
|
55 |
+
*.mo
|
56 |
+
*.pot
|
57 |
+
|
58 |
+
# Django stuff:
|
59 |
+
*.log
|
60 |
+
local_settings.py
|
61 |
+
db.sqlite3
|
62 |
+
|
63 |
+
# Flask stuff:
|
64 |
+
instance/
|
65 |
+
.webassets-cache
|
66 |
+
|
67 |
+
# Scrapy stuff:
|
68 |
+
.scrapy
|
69 |
+
|
70 |
+
# Sphinx documentation
|
71 |
+
docs/_build/
|
72 |
+
|
73 |
+
# PyBuilder
|
74 |
+
target/
|
75 |
+
|
76 |
+
# Jupyter Notebook
|
77 |
+
.ipynb_checkpoints
|
78 |
+
|
79 |
+
# pyenv
|
80 |
+
.python-version
|
81 |
+
|
82 |
+
# celery beat schedule file
|
83 |
+
celerybeat-schedule
|
84 |
+
|
85 |
+
# SageMath parsed files
|
86 |
+
*.sage.py
|
87 |
+
|
88 |
+
# Environments
|
89 |
+
.env
|
90 |
+
.venv
|
91 |
+
env/
|
92 |
+
venv/
|
93 |
+
ENV/
|
94 |
+
env.bak/
|
95 |
+
venv.bak/
|
96 |
+
|
97 |
+
# Spyder project settings
|
98 |
+
.spyderproject
|
99 |
+
.spyproject
|
100 |
+
|
101 |
+
# Rope project settings
|
102 |
+
.ropeproject
|
103 |
+
|
104 |
+
# mkdocs documentation
|
105 |
+
/site
|
106 |
+
|
107 |
+
# mypy
|
108 |
+
.mypy_cache/
|
109 |
+
|
110 |
+
.idea/
|
111 |
+
|
112 |
+
setup.sh
|
113 |
+
|
114 |
+
output/*
|
github/s2orc/README.md
ADDED
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<p align="center">
|
2 |
+
<img src="https://raw.githubusercontent.com/allenai/s2orc/master/assets/logo.svg" alt="Logo of S2ORC, pronounced stork" width="20%">
|
3 |
+
</p>
|
4 |
+
|
5 |
+
|
6 |
+
# S2ORC: The Semantic Scholar Open Research Corpus
|
7 |
+
|
8 |
+
S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.
|
9 |
+
|
10 |
+
* **[Download instructions](#download-instructions)**.
|
11 |
+
* S2ORC was developed by [Kyle Lo](https://kyleclo.github.io/) and [Lucy Lu Wang](https://llwang.net/) at the [Allen Institute for AI](https://allenai.org/). It is now being maintained as a product offering by the API team at [Semantic Scholar](https://www.semanticscholar.org/product/api).
|
12 |
+
* S2ORC is released under the [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/). By using S2ORC, you agree to the terms in the license.
|
13 |
+
* Please cite [our ACL 2020 paper](https://www.aclweb.org/anthology/2020.acl-main.447) if you use S2ORC for your project. See the [BibTeX](#citation). You can also watch [our 12 min ACL 2020 talk](https://slideslive.com/38929131/s2orc-the-semantic-scholar-open-research-corpus).
|
14 |
+
|
15 |
+
|
16 |
+
|
17 |
+
## News and Releases
|
18 |
+
|
19 |
+
⭐ **S2ORC now available through S2 API**
|
20 |
+
|
21 |
+
It's Jan 2023; happy new year! After years of managing S2ORC as a research project, it has now been adopted as a core dataset offering through the [Semantic Scholar Public API](https://www.semanticscholar.org/product/api). Please look for the instructions under "Bulk Dataset" for download!
|
22 |
+
|
23 |
+
S2ORC is now available through the [Semantic Scholar Public API](https://www.semanticscholar.org/product/api) as a "Bulk Dataset". It is continuously being rebuilt so if you access it through there, you'll get access to **new** papers as well!
|
24 |
+
|
25 |
+
**Software Release: 2021-02-01**
|
26 |
+
|
27 |
+
- Released [s2orc-doc2json](https://github.com/allenai/s2orc-doc2json) to support parsing of PDF and LaTeX to JSON format.
|
28 |
+
|
29 |
+
|
30 |
+
**S2ORC Release: 2020-07-05**
|
31 |
+
|
32 |
+
- Released a new version of S2ORC containing papers up until 2020-04-14, bringing full text coverage from 8M to 12M.
|
33 |
+
- Lifted some paper filters to be more lenient toward papers that don't have sufficient amount of text. This brought total paper count to 136M from 81M.
|
34 |
+
- Updated the schema to keep paper metadata and parsed paper text separate.
|
35 |
+
- Fixed major bugs such as (i) missing section names, (ii) inline citation mention links that don't resolve to bibliographies, and (iii) unpredictable typing in certain metadata fields.
|
36 |
+
- Omitted LaTeX parses from this release. They will be added in a subsequent release. Part of the dataset schema change is to accommodate incremental releases (e.g. LaTeX-only release without having to re-run PDF parsing).
|
37 |
+
|
38 |
+
- Feb 2023 update: We are no longer supporting access to this version & recommend everyone use the latest way of accessing S2ORC through the Semantic Scholar Public API. If you must use this version and need assistance, please contact Kyle and Lucy.
|
39 |
+
|
40 |
+
|
41 |
+
**Project Status: 2020-04-07**
|
42 |
+
|
43 |
+
- S2ORC has been accepted to ACL 2020!
|
44 |
+
- We've changed the name of the project to S2ORC. We will update the [preprint](https://arxiv.org/abs/1911.02782) shortly with the new name.
|
45 |
+
- The [BibTeX citation](#citation) has also been changed to reflect this.
|
46 |
+
- Feb 2023 update: We are no longer supporting access to this version & recommend everyone use the latest way of accessing S2ORC through the Semantic Scholar Public API. If you must use this version and need assistance, please contact Kyle and Lucy.
|
47 |
+
|
48 |
+
|
49 |
+
**S2ORC Release: 2019-09-28**
|
50 |
+
|
51 |
+
- Statistics: 81M+ paper nodes; 73M+ gold abstracts; 8M+ full text papers
|
52 |
+
- Due to release bugs (e.g. missing section names), we no longer recommend usage of this version. If you must use this version and need assistance, please contact Kyle and Lucy.
|
53 |
+
|
54 |
+
|
55 |
+
## Download instructions
|
56 |
+
|
57 |
+
The original S2ORC dataset files were refactored into multiple datasets available through [the Semantic Scholar APIs](https://api.semanticscholar.org/) (See detailed documentation [here](https://api.semanticscholar.org/api-docs/datasets)).
|
58 |
+
|
59 |
+
Once you obtain an API key from [Semantic Scholar Public API](https://www.semanticscholar.org/product/api), you should be able to access these bulk dumps like so:
|
60 |
+
```
|
61 |
+
import json
|
62 |
+
import os
|
63 |
+
import re
|
64 |
+
import requests
|
65 |
+
import wget
|
66 |
+
from tqdm import tqdm
|
67 |
+
|
68 |
+
# modify these
|
69 |
+
API_KEY = "..."
|
70 |
+
DATASET_NAME = "s2orc"
|
71 |
+
LOCAL_PATH = "/my/local/path/for/s2orc/"
|
72 |
+
os.makedirs(LOCAL_PATH, exist_ok=True)
|
73 |
+
|
74 |
+
# get latest release's ID
|
75 |
+
response = requests.get("https://api.semanticscholar.org/datasets/v1/release/latest").json()
|
76 |
+
RELEASE_ID = response["release_id"]
|
77 |
+
print(f"Latest release ID: {RELEASE_ID}")
|
78 |
+
|
79 |
+
# get the download links for the s2orc dataset; needs to pass API key through `x-api-key` header
|
80 |
+
# download via wget. this can take a while...
|
81 |
+
response = requests.get(f"https://api.semanticscholar.org/datasets/v1/release/{RELEASE_ID}/dataset/{DATASET_NAME}/", headers={"x-api-key": API_KEY}).json()
|
82 |
+
for url in tqdm(response["files"]):
|
83 |
+
match = re.match(r"https://ai2-s2ag.s3.amazonaws.com/staging/(.*)/s2orc/(.*).gz(.*)", url)
|
84 |
+
assert match.group(1) == RELEASE_ID
|
85 |
+
SHARD_ID = match.group(2)
|
86 |
+
wget.download(url, out=os.path.join(LOCAL_PATH, f"{SHARD_ID}.gz"))
|
87 |
+
print("Downloaded all shards.")
|
88 |
+
```
|
89 |
+
|
90 |
+
For questions, feature requests, bug reports, please search existing issues on [the s2-folks Github repo](https://github.com/allenai/s2-folks/issues?q=is%3Aissue) before creating [a new issue](https://github.com/allenai/s2-folks/issues/new).
|
91 |
+
|
92 |
+
|
93 |
+
## Contact us
|
94 |
+
|
95 |
+
The best way to contact us is through email. Don't hesitate to reach out about anything; we've helped a lot of people get started with the dataset, which can be a bit daunting given its size.
|
96 |
+
|
97 |
+
**Email:** Please include `{kylel, lucyw, rodneyk` on all correspondence.
|
98 |
+
|
99 |
+
**Twitter** [@kylelostat](https://twitter.com/kylelostat), [@lucyluwang](https://twitter.com/lucyluwang)
|
100 |
+
|
101 |
+
**Give us Feedback:** Totally optional, but we'd love to hear how you're using this dataset & any feedback for improving it. Send us an email or leave a Github Issue.
|
102 |
+
|
103 |
+
**Report issues:**
|
104 |
+
|
105 |
+
S2ORC is now being maintained by the S2 API product team. For questions, feature requests, bug reports, please search existing issues on [the s2-folks Github repo](https://github.com/allenai/s2-folks/issues?q=is%3Aissue) before creating [a new issue](https://github.com/allenai/s2-folks/issues/new).
|
106 |
+
|
107 |
+
|
108 |
+
## FAQ
|
109 |
+
|
110 |
+
#### What's the difference between [S2ORC](https://arxiv.org/abs/1911.02782) and [S2AG](https://dl.acm.org/doi/fullHtml/10.1145/3487553.3527147)?
|
111 |
+
At a high level:
|
112 |
+
|
113 |
+
- S2AG is everything that is covered in the literature graph, including Nodes (i.e. papers, authors) and Edges (i.e. citations, authorship). A `paper` in S2AG is represented by a bundle of Metadata, such as the Title, Authors, Year, Venue, Abstract, etc. You can download different releases of S2AG via the [the Semantic Scholar APIs](https://api.semanticscholar.org/) (See detailed documentation [here](https://api.semanticscholar.org/api-docs/datasets)).
|
114 |
+
|
115 |
+
- S2ORC is everything that is machine-readable **full text** of the paper, which we derive using models run on the paper's PDF. The original S2ORC dataset files are no longer available for download. They were refactored into multiple datasets available through [the Semantic Scholar APIs](https://api.semanticscholar.org/) (See detailed documentation [here](https://api.semanticscholar.org/api-docs/datasets)).
|
116 |
+
|
117 |
+
If you're unsure what to use or cite, please email us and we'd be happy to discuss your project with you.
|
118 |
+
|
119 |
+
#### I have an old version of S2ORC. How is it different from the version of S2ORC from the S2 API?
|
120 |
+
|
121 |
+
- Original S2ORC was a research project w/ original code. The current S2ORC is a reimplementation of the ideas from the research project within the Semantic Scholar data pipeline. As such, there can be differences due to low level implementation details being different.
|
122 |
+
|
123 |
+
- Current S2ORC is maintained by a different team than the original researchers.
|
124 |
+
|
125 |
+
|
126 |
+
- Original S2ORC was released under a non-commercial license. The current S2ORC is released under an ODC-By 1.0 license. We ask that users take care to double-check whether their intended usage of S2ORC and its underlying contents is permissible under this license.
|
127 |
+
|
128 |
+
|
129 |
+
## License
|
130 |
+
|
131 |
+
S2ORC is currently released through the [Semantic Scholar Public API](https://www.semanticscholar.org/product/api) under the [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/). By using S2ORC, you are agreeing to its usage terms.
|
132 |
+
|
133 |
+
|
134 |
+
|
135 |
+
## Citation
|
136 |
+
|
137 |
+
If using this dataset, please cite:
|
138 |
+
|
139 |
+
```
|
140 |
+
@inproceedings{lo-wang-2020-s2orc,
|
141 |
+
title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
|
142 |
+
author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel",
|
143 |
+
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
|
144 |
+
month = jul,
|
145 |
+
year = "2020",
|
146 |
+
address = "Online",
|
147 |
+
publisher = "Association for Computational Linguistics",
|
148 |
+
url = "https://www.aclweb.org/anthology/2020.acl-main.447",
|
149 |
+
doi = "10.18653/v1/2020.acl-main.447",
|
150 |
+
pages = "4969--4983"
|
151 |
+
}
|
152 |
+
```
|
github/s2orc/assets/logo.svg
ADDED
|
github/s2orc/data/metadata/sample.jsonl
ADDED
The diff for this file is too large to render.
See raw diff
|
|
github/s2orc/data/pdf_parses/sample.jsonl
ADDED
The diff for this file is too large to render.
See raw diff
|
|
github/s2orc/requirements.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
boto3
|
2 |
+
tqdm
|
github/s2orc/setup.py
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/python
|
2 |
+
import setuptools
|
3 |
+
|
4 |
+
setuptools.setup(
|
5 |
+
name='s2orc',
|
6 |
+
version='0.1',
|
7 |
+
packages=setuptools.find_packages(),
|
8 |
+
install_requires=[
|
9 |
+
],
|
10 |
+
tests_require=[
|
11 |
+
],
|
12 |
+
zip_safe=False,
|
13 |
+
test_suite='py.test',
|
14 |
+
entry_points='',
|
15 |
+
)
|
requirements.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
ujson==5.10.0
|
2 |
+
gradio==5.29.0
|
sample.jsonl
ADDED
The diff for this file is too large to render.
See raw diff
|
|