Code, binary, data, and README

Files changed (41) hide show

README.txt +60 -0
binary/graphseg.jar +3 -0
data/manifestos-gold-segmented/61320_200411.txt +0 -0
data/manifestos-gold-segmented/61320_200811.txt +0 -0
data/manifestos-gold-segmented/61320_201211.txt +0 -0
data/manifestos-gold-segmented/61620_200411.txt +0 -0
data/manifestos-gold-segmented/61620_200811.txt +0 -0
data/manifestos-gold-segmented/61620_201211.txt +0 -0
data/manifestos-original-clean/61320_200411.txt +0 -0
data/manifestos-original-clean/61320_200811.txt +0 -0
data/manifestos-original-clean/61320_201211.txt +0 -0
data/manifestos-original-clean/61620_200411.txt +0 -0
data/manifestos-original-clean/61620_200811.txt +0 -0
data/manifestos-original-clean/61620_201211.txt +0 -0
source/pom.xml +85 -0
source/src/config.properties +3 -0
source/src/edu/uma/nlp/graphseg/ClusteringHandler.java +206 -0
source/src/edu/uma/nlp/graphseg/GraphHandler.java +134 -0
source/src/edu/uma/nlp/graphseg/IOHandler.java +33 -0
source/src/edu/uma/nlp/graphseg/STSHandler.java +37 -0
source/src/edu/uma/nlp/graphseg/Start.java +122 -0
source/src/edu/uma/nlp/graphseg/preprocessing/Annotation.java +36 -0
source/src/edu/uma/nlp/graphseg/preprocessing/AnnotationType.java +14 -0
source/src/edu/uma/nlp/graphseg/preprocessing/AnnotatorChain.java +35 -0
source/src/edu/uma/nlp/graphseg/preprocessing/AnnotatorType.java +11 -0
source/src/edu/uma/nlp/graphseg/preprocessing/Document.java +110 -0
source/src/edu/uma/nlp/graphseg/preprocessing/IAnnotator.java +9 -0
source/src/edu/uma/nlp/graphseg/preprocessing/NamedEntityAnnotation.java +88 -0
source/src/edu/uma/nlp/graphseg/preprocessing/NamedEntityTokenAnnotation.java +38 -0
source/src/edu/uma/nlp/graphseg/preprocessing/NamedEntityType.java +18 -0
source/src/edu/uma/nlp/graphseg/preprocessing/PartOfSpeechAnnotation.java +69 -0
source/src/edu/uma/nlp/graphseg/preprocessing/SentenceAnnotation.java +66 -0
source/src/edu/uma/nlp/graphseg/preprocessing/StanfordAnnotator.java +142 -0
source/src/edu/uma/nlp/graphseg/preprocessing/TokenAnnotation.java +104 -0
source/src/edu/uma/nlp/graphseg/semantics/InformationContent.java +77 -0
source/src/edu/uma/nlp/graphseg/semantics/SemanticSimilarity.java +252 -0
source/src/edu/uma/nlp/graphseg/semantics/WordVectorSpace.java +151 -0
source/src/edu/uma/nlp/graphseg/utils/ApplicationConfiguration.java +49 -0
source/src/edu/uma/nlp/graphseg/utils/IOHelper.java +385 -0
source/src/edu/uma/nlp/graphseg/utils/MemoryStorage.java +26 -0
source/src/edu/uma/nlp/graphseg/utils/VectorOperations.java +45 -0

README.txt ADDED Viewed

	@@ -0,0 +1,60 @@

+About
+========
+GraphSeg is a tool for semantic/topical segmentation of text that employs semantic relatedness and a graph-based algorithm to identify semantically coherent segments in text.
+Segmentation is performed at the sentence level (no intra-sentential segment beginnings/end)
+Content
+========
+This repository contains:
+(1) the Java source code (as Maven project)
+(2) the ready-to-use binary version of the tool (graphseg.jar in the /binary folder)
+(3) the dataset of political manifestos manually annotated with segments (used for evaluation in the research paper the GraphSeg tool accompanies).
+Usage
+========
+The following command with four arguments runs the GraphSeg tool:
+java -jar graphseg.jar <input-folder-path> <output-folder-path> <relatedness-treshold> <minimal-segment-size>
+The argument (all mandatory) to be provided are:
+(1) <input-folder-path> is the path to the folder (directory) containing the raw text documents that need to be topically/semantically segmented;
+(2) <output-folder-path> is the path to the folder in which the semantically/topically segmented input documents are to be stored;
+(3) <relatedness-treshold> is the value of the relatedness treshold (decimal number) to be used in the construction of the relatedness graph: larger values will give large number of smalled segments, whereas the smaller treshold values will provide a smaller number of coarse segments;
+(4) <minimal-segment-size> defines the minimal segment size m (in number of sentences). This means that GraphSeg will not produce segments containing less than m sentences.
+Example command:
+java -jar graphseg.jar /home/seg-input /home/seg-output 0.25 3
+Credit
+========
+In case you use GraphSeg in your research, please give approproate credit to our work by citing the following publication:
+@InProceedings{glavavs-nanni-ponzetto:2016:*SEM,
+  author    = {Glava\v{s}, Goran  and  Nanni, Federico  and  Ponzetto, Simone Paolo},
+  title     = {Unsupervised Text Segmentation Using Semantic Relatedness Graphs},
+  booktitle = {Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics},
+  month     = {August},
+  year      = {2016},
+  address   = {Berlin, Germany},
+  publisher = {Association for Computational Linguistics},
+  pages     = {125--130},
+  url       = {http://anthology.aclweb.org/S16-2016}
+}
+Contact
+========
+Please address all questions about the GraphSeg tool and the *SEM publication to:
+Dr. Goran Glava�
+Data and Web Science Group
+University of Mannheim
+Email: [email protected]

binary/graphseg.jar ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83ac4ce85663bd97072a2fad76349bf923d1869b7acd7a67f797e6c16a1a47b2
+size 350762888

data/manifestos-gold-segmented/61320_200411.txt ADDED Viewed