writinwaters commited on
Commit
4b50c07
·
1 Parent(s): 25a190f

Updated parser_config description (#3104)

Browse files

### What problem does this PR solve?



### Type of change


- [x] Documentation Update

api/http_api_reference.md CHANGED
@@ -78,7 +78,7 @@ curl --request POST \
78
  - `"chunk_method"`: (*Body parameter*), `enum<string>`
79
  The chunking method of the dataset to create. Available options:
80
  - `"naive"`: General (default)
81
- - `"manual`: Manual
82
  - `"qa"`: Q&A
83
  - `"table"`: Table
84
  - `"paper"`: Paper
@@ -88,16 +88,23 @@ curl --request POST \
88
  - `"picture"`: Picture
89
  - `"one"`: One
90
  - `"knowledge_graph"`: Knowledge Graph
91
- - `"email"`: Email
92
 
93
  - `"parser_config"`: (*Body parameter*), `object`
94
- The configuration settings for the dataset parser, a JSON object containing the following attributes:
95
- - `"chunk_token_count"`: Defaults to `128`.
96
- - `"layout_recognize"`: Defaults to `true`.
97
- - `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
98
- - `"delimiter"`: Defaults to `"\n!?。;!?"`.
99
- - `"task_page_size"`: Defaults to `12`. For PDF only.
100
- - `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
 
 
 
 
 
 
 
 
101
 
102
  ### Response
103
 
@@ -256,7 +263,6 @@ curl --request PUT \
256
  - `"picture"`: Picture
257
  - `"one"`:One
258
  - `"knowledge_graph"`: Knowledge Graph
259
- - `"email"`: Email
260
 
261
  ### Response
262
 
@@ -511,13 +517,22 @@ curl --request PUT \
511
  - `"picture"`: Picture
512
  - `"one"`: One
513
  - `"knowledge_graph"`: Knowledge Graph
514
- - `"email"`: Email
515
  - `"parser_config"`: (*Body parameter*), `object`
516
- The parsing configuration for the document:
517
- - `"chunk_token_count"`: Defaults to `128`.
518
- - `"layout_recognize"`: Defaults to `true`.
519
- - `"delimiter"`: Defaults to `"\n!?。;!?"`.
520
- - `"task_page_size"`: Defaults to `12`. For PDF only.
 
 
 
 
 
 
 
 
 
 
521
 
522
  ### Response
523
 
 
78
  - `"chunk_method"`: (*Body parameter*), `enum<string>`
79
  The chunking method of the dataset to create. Available options:
80
  - `"naive"`: General (default)
81
+ - `"manual"`: Manual
82
  - `"qa"`: Q&A
83
  - `"table"`: Table
84
  - `"paper"`: Paper
 
88
  - `"picture"`: Picture
89
  - `"one"`: One
90
  - `"knowledge_graph"`: Knowledge Graph
 
91
 
92
  - `"parser_config"`: (*Body parameter*), `object`
93
+ The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
94
+ - If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
95
+ - `"chunk_token_count"`: Defaults to `128`.
96
+ - `"layout_recognize"`: Defaults to `true`.
97
+ - `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
98
+ - `"delimiter"`: Defaults to `"\n!?。;!?"`.
99
+ - `"task_page_size"`: Defaults to `12`. For PDF only.
100
+ - `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
101
+ - If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
102
+ - `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
103
+ - If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
104
+ - If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
105
+ - `"chunk_token_count"`: Defaults to `128`.
106
+ - `"delimiter"`: Defaults to `"\n!?。;!?"`.
107
+ - `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
108
 
109
  ### Response
110
 
 
263
  - `"picture"`: Picture
264
  - `"one"`:One
265
  - `"knowledge_graph"`: Knowledge Graph
 
266
 
267
  ### Response
268
 
 
517
  - `"picture"`: Picture
518
  - `"one"`: One
519
  - `"knowledge_graph"`: Knowledge Graph
 
520
  - `"parser_config"`: (*Body parameter*), `object`
521
+ The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
522
+ - If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
523
+ - `"chunk_token_count"`: Defaults to `128`.
524
+ - `"layout_recognize"`: Defaults to `true`.
525
+ - `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
526
+ - `"delimiter"`: Defaults to `"\n!?。;!?"`.
527
+ - `"task_page_size"`: Defaults to `12`. For PDF only.
528
+ - `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
529
+ - If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
530
+ - `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
531
+ - If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
532
+ - If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
533
+ - `"chunk_token_count"`: Defaults to `128`.
534
+ - `"delimiter"`: Defaults to `"\n!?。;!?"`.
535
+ - `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
536
 
537
  ### Response
538
 
api/python_api_reference.md CHANGED
@@ -75,16 +75,31 @@ The chunking method of the dataset to create. Available options:
75
  - `"picture"`: Picture
76
  - `"one"`: One
77
  - `"knowledge_graph"`: Knowledge Graph
78
- - `"email"`: Email
79
 
80
  #### parser_config
81
 
82
- The parser configuration of the dataset. A `ParserConfig` object contains the following attributes:
83
-
84
- - `chunk_token_count`: Defaults to `128`.
85
- - `layout_recognize`: Defaults to `True`.
86
- - `delimiter`: Defaults to `"\n!?。;!?"`.
87
- - `task_page_size`: Defaults to `12`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
  ### Returns
90
 
@@ -225,7 +240,6 @@ A dictionary representing the attributes to update, with the following keys:
225
  - `"picture"`: Picture
226
  - `"one"`: One
227
  - `"knowledge_graph"`: Knowledge Graph
228
- - `"email"`: Email
229
 
230
  ### Returns
231
 
@@ -296,11 +310,6 @@ Updates configurations for the current document.
296
  A dictionary representing the attributes to update, with the following keys:
297
 
298
  - `"display_name"`: `str` The name of the document to update.
299
- - `"parser_config"`: `dict[str, Any]` The parsing configuration for the document:
300
- - `"chunk_token_count"`: Defaults to `128`.
301
- - `"layout_recognize"`: Defaults to `True`.
302
- - `"delimiter"`: Defaults to `'\n!?。;!?'`.
303
- - `"task_page_size"`: Defaults to `12`.
304
  - `"chunk_method"`: `str` The parsing method to apply to the document.
305
  - `"naive"`: General
306
  - `"manual`: Manual
@@ -313,7 +322,27 @@ A dictionary representing the attributes to update, with the following keys:
313
  - `"picture"`: Picture
314
  - `"one"`: One
315
  - `"knowledge_graph"`: Knowledge Graph
316
- - `"email"`: Email
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
 
318
  ### Returns
319
 
@@ -412,7 +441,6 @@ A `Document` object contains the following attributes:
412
  - `thumbnail`: The thumbnail image of the document. Defaults to `None`.
413
  - `dataset_id`: The dataset ID associated with the document. Defaults to `None`.
414
  - `chunk_method` The chunk method name. Defaults to `"naive"`.
415
- - `parser_config`: `ParserConfig` Configuration object for the parser. Defaults to `{"pages": [[1, 1000000]]}`.
416
  - `source_type`: The source type of the document. Defaults to `"local"`.
417
  - `type`: Type or category of the document. Defaults to `""`. Reserved for future use.
418
  - `created_by`: `str` The creator of the document. Defaults to `""`.
@@ -430,6 +458,27 @@ A `Document` object contains the following attributes:
430
  - `"DONE"`
431
  - `"FAIL"`
432
  - `status`: `str` Reserved for future use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433
 
434
  ### Examples
435
 
 
75
  - `"picture"`: Picture
76
  - `"one"`: One
77
  - `"knowledge_graph"`: Knowledge Graph
 
78
 
79
  #### parser_config
80
 
81
+ The parser configuration of the dataset. A `ParserConfig` object's attributes vary based on the selected `"chunk_method"`:
82
+
83
+ - `"chunk_method"`=`"naive"`:
84
+ `{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
85
+ - `chunk_method`=`"qa"`:
86
+ `{"raptor": {"user_raptor": False}}`
87
+ - `chunk_method`=`"manuel"`:
88
+ `{"raptor": {"user_raptor": False}}`
89
+ - `chunk_method`=`"table"`:
90
+ `None`
91
+ - `chunk_method`=`"paper"`:
92
+ `{"raptor": {"user_raptor": False}}`
93
+ - `chunk_method`=`"book"`:
94
+ `{"raptor": {"user_raptor": False}}`
95
+ - `chunk_method`=`"laws"`:
96
+ `{"raptor": {"user_raptor": False}}`
97
+ - `chunk_method`=`"presentation"`:
98
+ `{"raptor": {"user_raptor": False}}`
99
+ - `chunk_method`=`"one"`:
100
+ `None`
101
+ - `chunk_method`=`"knowledge-graph"`:
102
+ `{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
103
 
104
  ### Returns
105
 
 
240
  - `"picture"`: Picture
241
  - `"one"`: One
242
  - `"knowledge_graph"`: Knowledge Graph
 
243
 
244
  ### Returns
245
 
 
310
  A dictionary representing the attributes to update, with the following keys:
311
 
312
  - `"display_name"`: `str` The name of the document to update.
 
 
 
 
 
313
  - `"chunk_method"`: `str` The parsing method to apply to the document.
314
  - `"naive"`: General
315
  - `"manual`: Manual
 
322
  - `"picture"`: Picture
323
  - `"one"`: One
324
  - `"knowledge_graph"`: Knowledge Graph
325
+ - `"parser_config"`: `dict[str, Any]` The parsing configuration for the document. Its attributes vary based on the selected `"chunk_method"`:
326
+ - `"chunk_method"`=`"naive"`:
327
+ `{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
328
+ - `chunk_method`=`"qa"`:
329
+ `{"raptor": {"user_raptor": False}}`
330
+ - `chunk_method`=`"manuel"`:
331
+ `{"raptor": {"user_raptor": False}}`
332
+ - `chunk_method`=`"table"`:
333
+ `None`
334
+ - `chunk_method`=`"paper"`:
335
+ `{"raptor": {"user_raptor": False}}`
336
+ - `chunk_method`=`"book"`:
337
+ `{"raptor": {"user_raptor": False}}`
338
+ - `chunk_method`=`"laws"`:
339
+ `{"raptor": {"user_raptor": False}}`
340
+ - `chunk_method`=`"presentation"`:
341
+ `{"raptor": {"user_raptor": False}}`
342
+ - `chunk_method`=`"one"`:
343
+ `None`
344
+ - `chunk_method`=`"knowledge-graph"`:
345
+ `{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
346
 
347
  ### Returns
348
 
 
441
  - `thumbnail`: The thumbnail image of the document. Defaults to `None`.
442
  - `dataset_id`: The dataset ID associated with the document. Defaults to `None`.
443
  - `chunk_method` The chunk method name. Defaults to `"naive"`.
 
444
  - `source_type`: The source type of the document. Defaults to `"local"`.
445
  - `type`: Type or category of the document. Defaults to `""`. Reserved for future use.
446
  - `created_by`: `str` The creator of the document. Defaults to `""`.
 
458
  - `"DONE"`
459
  - `"FAIL"`
460
  - `status`: `str` Reserved for future use.
461
+ - `parser_config`: `ParserConfig` Configuration object for the parser. Its attributes vary based on the selected `chunk_method`:
462
+ - `chunk_method`=`"naive"`:
463
+ `{"chunk_token_num":128,"delimiter":"\\n!?;。;!?","html4excel":False,"layout_recognize":True,"raptor":{"user_raptor":False}}`.
464
+ - `chunk_method`=`"qa"`:
465
+ `{"raptor": {"user_raptor": False}}`
466
+ - `chunk_method`=`"manuel"`:
467
+ `{"raptor": {"user_raptor": False}}`
468
+ - `chunk_method`=`"table"`:
469
+ `None`
470
+ - `chunk_method`=`"paper"`:
471
+ `{"raptor": {"user_raptor": False}}`
472
+ - `chunk_method`=`"book"`:
473
+ `{"raptor": {"user_raptor": False}}`
474
+ - `chunk_method`=`"laws"`:
475
+ `{"raptor": {"user_raptor": False}}`
476
+ - `chunk_method`=`"presentation"`:
477
+ `{"raptor": {"user_raptor": False}}`
478
+ - `chunk_method`=`"one"`:
479
+ `None`
480
+ - `chunk_method`=`"knowledge-graph"`:
481
+ `{"chunk_token_num":128,"delimiter": "\\n!?;。;!?","entity_types":["organization","person","location","event","time"]}`
482
 
483
  ### Examples
484