LocalDoc
/

semantic_chunker

@@ -1,199 +1,1017 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: cc-by-4.0
+language:
+- az
+base_model:
+- FacebookAI/xlm-roberta-base
+pipeline_tag: text-classification
+tags:
+- rag
+- chunker
+- semantic
 ---
+# Azerbaijani Semantic Text Chunker
+Advanced semantic text chunking model for Azerbaijani language based on XLM-RoBERTa
+This model specifically trained for Azerbaijani texts, featuring semantic understanding, strict length control, and sliding window processing for documents of any size.
+## Features
+- **Semantic Chunking**: Understands context and meaning, not just sentence boundaries
+- **Azerbaijani Specialized**: Native support for Azerbaijani language nuances
+- **Strict Length Control**: Enforces maximum chunk size in tokens and/or characters
+- **Sliding Window Processing**: Handles documents of unlimited length (2K, 10K, 50K+ tokens)
+- **Multiple Strategies**: Optimal, conservative, and aggressive chunking modes
+- **Flexible Limits**: Supports token limits, character limits, or both combined
+- **Detailed Analytics**: Comprehensive chunking process diagnostics
+- **RAG-Ready**: Perfect for vector databases and retrieval systems
+## Performance
+Our model achieves excellent results on Azerbaijani text segmentation:
+| Metric | Score |
+|--------|-------|
+| **Precision** | 0.7850 |
+| **Recall** | 0.6485 |
+| **F1-Score** | 0.7102 |
+| **Accuracy** | 0.9980 |
+## Technical Architecture
+### Sliding Window Processing
+The model uses an advanced sliding window approach to handle texts longer than the 512-token context window:
+Text Length: 2048+ tokens
+Window Size: 510 tokens (512 - 2 special tokens)
+Stride: 255 tokens (50% overlap)
+**Window 1**: tokens 0-510
+**Window 2**: tokens 255-765     (255 token overlap)
+**Window 3**: tokens 510-1020    (255 token overlap)
+**Window 4**: tokens 765-1275    (255 token overlap)
+**Window 5**: tokens 1020-1530   (255 token overlap)
+**Window 6**: tokens 1275-1785   (255 token overlap)
+**Window 7**: tokens 1530-2040   (255 token overlap)
+**Window 8**: tokens 1785-2048   (final window)
+**Key advantages:**
+- **No length limitations**: Process documents of any size
+- **Intelligent overlap**: Ensures no semantic boundaries are missed
+- **Prediction fusion**: Combines results using maximum confidence scores
+- **Linear complexity**: O(N) processing time regardless of document length
+### Length Control Algorithm
+- **Strict enforcement**: Never exceeds specified token/character limits
+- **Smart boundary detection**: Prefers natural separators (spaces, punctuation)
+- **Fallback mechanisms**: Intelligent splitting when no semantic boundaries found
+- **Combined limits**: Supports both token AND character limits simultaneously
+## Quick Start
+### Installation
+```bash
+# Install requirements
+pip install torch transformers numpy pandas
+```
+### Basic Usage
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import numpy as np
+from typing import List, Optional, Tuple
+# Global priority constants
+PRIORITY_TOKENS = "tokens"
+PRIORITY_CHARS = "chars"
+PRIORITY_STRICT = "strict"
+# Global configuration variables for easy modification
+MODEL_PATH = "LocalDoc/semantic_chunker"
+MAX_CHUNK_TOKENS = 400  # Single parameter for chunk size
+TARGET_TOKENS = 256      # Single parameter for target size
+MAX_CHUNK_CHARS = None  # No character limit by default
+THRESHOLD = 0.12
+PRIORITY = PRIORITY_STRICT
+# Dynamic calculation constants (avoid hardcoding)
+MIN_LENGTH_DIVISOR = 8      # target_tokens // MIN_LENGTH_DIVISOR for min chunk length
+MARGIN_RATIO = 0.25         # 25% margin for max_tokens validation
+MIN_MARGIN = 20             # Minimum margin when validating max_tokens
+ROUGH_TOKEN_MULTIPLIER = 1.3  # Rough token estimation multiplier
+LONG_TEXT_THRESHOLD = 400   # Threshold for long text processing
+BERT_MAX_LENGTH = 512       # Standard BERT model limit
+SEARCH_RANGE = 50           # Character search range for boundaries
+SENTENCE_SEARCH_RANGE = 100 # Range for sentence boundary search
+SPACE_SEARCH_RANGE = 50     # Range for space boundary search
+OPTIMAL_TARGET_RATIO = 0.9  # 90% of max as safety margin for merging
+GOOD_SIZE_MIN_RATIO = 0.7   # 70% of target for minimum good size
+MERGE_THRESHOLD_RATIO = 0.5 # 50% of target for merge threshold
+MERGE_ATTEMPT_RATIO = 0.6   # 60% of target for merge attempts
+OPTIMAL_MAX_RATIO = 1.1     # 110% of max for slight overflow
+MIN_ACCEPTABLE_RATIO = 0.3  # 30% of target for minimum acceptable size
+MAX_MERGE_RATIO = 1.5       # 150% of target for maximum merge size
+# Sample text for testing
+#SAMPLE_TEXT = """Azərbaycan, Qafqazın incisi olaraq, zəngin tarixi, unikal mədəniyyəti və möhtəşəm təbiəti ilə hər zaman diqqət çəkmişdir. Bu torpaqlar minillər boyu müxtəlif sivilizasiyaların qovuşağında yerləşmiş, onların izlərini özündə yaşatmışdır. Şərqlə Qərb arasında bir körpü rolunu oynayan Azərbaycan, özünün çoxəsrlik dövlətçilik ənənəsi ilə fəxr edir. Ölkənin paytaxtı Bakı, Xəzər dənizinin sahilində yerləşən, Qədim Şəhər (İçərişəhər) kimi UNESCO Dünya İrs Siyahısına daxil edilmiş tarixi məkanları və müasir memarlıq abidələri ilə bir araya gətirən canlı bir metropoldur. Müasir Bakının simvollarından olan Heydər Əliyev Mərkəzi, incəsənət, mədəniyyət və təhsilin mərkəzi kimi fəaliyyət göstərir, onun innovativ dizaynı ilə dünyanın diqqətini cəlb edir. Azərbaycanın təbiəti də heyrətamiz dərəcədə müxtəlifdir. Böyük Qafqaz dağlarının əzəməti, Kiçik Qafqazın mənzərəli vadiləri, Kür-Araz ovalığının bərəkəti və Xəzər dənizinin sahil zolağı ölkəyə özünəməxsus gözəllik qatır. Qax rayonundakı Laza kəndi, dağların əhatəsində yerləşən, təbii gözəlliyi ilə seçilən bir turizm mərkəzidir. Bu ərazilərdə füsunkar meşələr, bol sulu çaylar, şəlalələr və müalicəvi mineral bulaqlar mövcuddur. Göygöl Milli Parkı, Azərbaycanın ən gözəl təbii guşələrindən biridir, onun saf gölü və ətrafındakı meşələr, nadir bitki və heyvan növlərinə ev sahibliyi edir. Texnologiyanın sürətli inkişafı dövründə Azərbaycan da bu qlobal trenddən kənarda qalmayıb. Son illərdə ölkədə rəqəmsal transformasiya prosesləri sürətlənmiş, informasiya-kommunikasiya texnologiyaları (İKT) sektoruna böyük investisiyalar qoyulmuşdur. Bakıda "Hi-Tech Park" kimi müasir texnoloji mərkəzlər fəaliyyət göstərir, innovativ startaplara dəstək verilir. Süni intellekt (AI), maşın təlimi (ML) və böyük verilənlər (Big Data) kimi sahələr prioritet istiqamətlər olaraq müəyyən edilmişdir. Ölkənin gələcəyi üçün əsaslı önəm kəsb edən bu sahələr, təhsil və elmi araşdırmalarla sıx əlaqədardır. Universitetlərdə və elmi institutlarda İKT-nin müxtəlif istiqamətləri üzrə tədqiqatlar aparılır, gənc kadrlar hazırlanır. Virtual və artırılmış reallıq texnologiyaları da təhsil, mədəniyyət və turizm sahələrində tətbiq olunmağa başlanmışdır. "Azercosmos" ASC, peyk texnologiyaları sahəsində ölkənin potensialını artırır, bu da telekommunikasiya, naviqasiya və yerin uzaqdan müşahidəsi kimi sahələrdə əhəmiyyətli rol oynayır. Azərbaycanın tarixi boyu bir çox mühüm hadisələrə şahidlik etmişdir. Qədim zamanlarda Albaniya dövlətinin yaranması, İpək Yolunun bir hissəsinin bu ərazidən keçməsi, orta əsrlərdə Şirvanşahlar və Səfəvilər kimi güclü dövlətlərin hökmranlığı ölkənin siyasi və mədəni inkişafına təsir göstərmişdir. 1918-ci ildə Şərqdə ilk demokratik respublika olan Azərbaycan Xalq Cümhuriyyətinin qurulması, bu torpaqların müstəqillik uğrunda mübarizəsinin parlaq nümunəsidir. Sovet hakimiyyəti illərindəki çətinliklərə baxmayaraq, Azərbaycan 1991-ci ildə yenidən müstəqilliyini qazanmışdır. Ulu Öndər Heydər Əliyevin müstəqil Azərbaycanın təməlini qoyması və Prezident İlham Əliyevin rəhbərliyi ilə ölkənin inkişaf etməsi, bu günümüzdə də davam edən uğurlu siyasətin nəticəsidir. Qarabag müharibəsi və 2020-ci ildəki Vətən Müharibəsi, Azərbaycanın ərazi bütövlüyünün təmin edilməsində həlledici rol oynamışdır. Bu qələbələr, xalqımızın birliyini və gücünü bir daha təsdiqlədi. Gələcəyə baxış, Azərbaycan üçün böyük potensial vəd edir. Ölkə, enerji resursları baxımından zəngin olmaqla yanaşı, tranzit və logistika mərkəzi kimi də strateji əhəmiyyət kəsb edir. "Asrın Müqaviləsi"nin imzalanması, Xəzər dənizinin neft və qaz yataqlarının dünya bazarına çıxarılmasında mühüm rol oynamışdır. "Şərq-Qərb" və "Şimal-Cənub" beynəlxalq nəqliyyat dəhlizlərinin inkişafı, Azərbaycanın regionda aparıcı logistika mərkəzinə çevrilməsinə şərait yaradır. Yaşıl enerji və dayanıqlı inkişaf konsepsiyaları da ölkənin gələcək strategiyasının əsasını təşkil edir. Günəş və külək enerjisinin inkişafına böyük diqqət yetirilir, iqlim dəyişikliyinin təsirlərini azaltmaq üçün səylər artırılır. Təhsil sisteminin təkmilləşdirilməsi, gənclərin beynəlxalq standartlara uyğun kadr kimi yetişdirilməsi, elmi-texniki potensialın gücləndirilməsi gələcək uğurların təminatıdır. Şəhərlərin yenidən qurulması, infrastrukturun modernləşdirilməsi, sosial sahələrin inkişafı, bütün bunlar ölkənin gələcək tərəqqisi üçün atılan addımlardır. Azərbaycan, Avropa ilə inteqrasiya, beynəlxalq əməkdaşlığın gücləndirilməsi və regional sabitliyin təmin edilməsi istiqamətində də aktiv fəaliyyət göstərir. Sülh və əməkdaşlıq siyasəti, Azərbaycanın beynəlxalq aləmdəki mövqeyini daha da gücləndirir. İnformasiya Texnologiyaları Universiteti Azərbaycan Respublikasında informasiya cəmiyyəti quruculuğu istiqaməti üzrə yüksək hazırlığa malik kadr potensialının formalaşdırılmasını təmin edən ali təhsil müəssisəsi idi. Azərbaycan Respublikasının Prezidentinin Sərəncamı ilə 1 fevral 2013-cü ildə yaradılmışdır.[2] İnformasiya Texnologiyaları Universitetinin və Azərbaycan Respublikası Xarici İşlər Nazirliyinin Diplomatik Akademiyasının əsasında, 13 yanvar 2014-cü ildə "ADA" Universiteti yaradılması ilə fəaliyyəti başa çatmışdır.[3]Universitetin tarixi 2006-cı ildən başlayır. Belə ki, 2006-cı ilin martında Azərbaycan Xarici İşlər Nazirliyinin nəzdində Azərbaycan Diplomatik Akademiyası adı altında ali təhsil müəssisəsi yaradılmışdır. Azərbaycan Respublikası Prezidentinin 13 yanvar 2014-cü il tarixli Sərəncamı ilə Azərbaycan Respublikası Xarici İşlər Nazirliyinin Diplomatik Akademiyasının və Azərbaycan Respublikasında İnformasiya Texnologiyaları Universitetinin əsasında "ADA" Universiteti yaradılmışdır[4].Əsas məqsədi diplomatiya, ictimai münasibətlər, biznes, informasiya texnologiyaları və sistem mühəndisliyi üzrə qlobal liderlər hazırlamaqdır. Akademiyanın əsasını qoyan rektor, Azərbaycan Xarici İşlər Nazirinin müavini və Azərbaycanın ABŞ-dəki sabiq səfiri Hafiz Paşayevdir.[5]2012-ci ildə Bakı şəhərində "Dədə Qorqud" parkı yaxınlığında yerləşən yeni "Green and Smart" kampusuna köçmüşdür. 2009-cu ildən magistr pilləsi, 2011-ci ildən bakalavr pilləsində ali təhsil verir. XIV əsrdə Rəşidəddin Fəzlullah ibn Əbil-Xeyrə Əli Həmədaninin qələmə aldığı Cəmi ət-Təvarix (Tarix toplusu) adlı əsərinin "Mujallad-i Awwal" (Birinci Kitabı: Monqol tarixi)in "Bab-i Awwal" (Birinci Bölüm: Türk ve Monqol qəbilələrinin tarixi)ində monqolların yaradılış dastanı olaraq qeyd edilmiş əfsanə,[5][6][7] 17. yüzildə Şibanın nəvələrindən və Xivə xanlığının xanı olan Əbulqazi Bahadır xanın qələmə aldığı Şəcərəyi Türk adlı əsərdə də monqolların yaradılış dastanı olaraq qeyd edilmişdir, lakin bəzi mənbələrə görə də Türk dastanıdır.[6][7] Bəhsi keçən hər iki tarixi mənbədə Nekuz (Nüküz) və Qiyan (Kıyan) adlı qardaşlar ilə xanımları tatarlar tərəfindən məğlub edildikdən sonra Ərgənəqon (Farsca:ارگنه قون; Ergene Qon) adı verilən dar və sıldırım bir yerə getmiş, 400 ildə sülaləsi çoxalıb Ərgənəqondan çıxmşdır. Ərgənəqondan çıxdıqları zaman yol göstərənin Börteçine olduğu düşünülməkdədir.[7]Ancaq Göytürklərin diriliş dastanı ilə olan oxşarlıqları səbəb göstərərək Türklərə aid bir dastan olduğunu iddia edən tədqiqatçılar da var.[7][8] Talat Sait Halman isə mifoloji bir varlıq olan Bozqurdun müdafiəsi sayəsində soyunun tükənmə təhlükəsindən qurtulan və yenə Bozqurtlar sayəsində dağlarla əhatə olunmuş Ərgənəqon vadisindən çıxan bir Türk toplumunun hekayəsindən bəhs edildiyini iddia edir.[9] Digər görüşlərə görə isə Türklər və monqollar arasında bənzər olan əfsanələr vardır.[10] Əfsanə bəzən də Novruz ilə əlaqələndirilir.[11]"""
+SAMPLE_TEXT = """Azərbaycan Respublikası, Qafqaz regionunda yerləşən, unikal coğrafi mövqeyi, zəngin tarixi və çoxşaxəli mədəniyyəti ilə hər zaman özünəməxsus yer tutmuşdur. Bu torpaqlar minillər ərzində müxtəlif sivilizasiyaların, dövlətlərin və mədəniyyətlərin təsirinə məruz qalmış, özündə dərin izlər buraxmışdır. Şərqlə Qərb arasında körpü rolunu oynayan Azərbaycan, Şirvanşahlar, Səfəvilər, Qaraqoyunlular, Ağqoyunlular kimi güclü dövlətlərin mərkəzi olmuş, İpək Yolu üzərində strateji əhəmiyyət kəsb etmişdir. Müstəqilliyini bərpa etdikdən sonra, xüsusilə son illərdə, ölkə dinamik inkişaf yolu keçərək regionda aparıcı dövlətlərdən birinə çevrilmişdir. Bakı şəhəri, Azərbaycanın paytaxtı olaraq, Xəzər dənizinin sahilində yerləşən, qədimlik və müasirlik vəhdətini özündə yaşadan möhtəşəm bir məkandır. UNESCO Dünya İrs Siyahısına daxil edilmiş Qədim Şəhər (İçərişəhər) kompleksi, orta əsr memarlığının nadir nümunələrini özündə əks etdirir. Şirvanşahlar Sarayı, Qız Qalası kimi tarixi abidələr buranın qədim tarixindən xəbər verir. Eyni zamanda, Heydər Əliyev Mərkəzi, Alov Qüllələri, Bakı Abadlıq Kompleksi kimi müasir memarlıq inciləri şəhərə müasir və unikal görkəm qatır. Bu müasir tikililər, innovativ dizaynları və texnoloji həlləri ilə diqqət çəkir. Azərbaycanın təbiəti də onun mədəni zənginliyi qədər heyranedici və müxtəlifdir. Ölkə ərazisi, Böyük Qafqaz dağ sisteminin cənub-şərq yamaclarından Kiçik Qafqaz silsiləsinə, Kür-Araz ovalığından Lənkəran ovalığına qədər geniş diapazonda müxtəlif relyef formalarını əhatə edir. Bu coğrafi müxtəliflik, müxtəlif iqlim tiplərinin və zəngin biomüxtəlifliyin yaranmasına səbəb olmuşdur. Qəbələ, Şəki, Qax kimi şimal rayonları dağlıq və meşəlik əraziləri, bol bulaqları, ecazkar mənzərələri ilə tanınır. Lənkəran və Astara rayonları subtropik iqlimə malik olub, rütubətli meşələri və unikal flora və faunası ilə seçilir. Göygöl Milli Parkı, Şahdağ Milli Parkı kimi qorunan ərazilər, ölkənin təbiətini qorumaq, həm də elmi tədqiqatlar aparmaq üçün əhəmiyyətli mərkəzlərdir. Göygölün özü, onun ətrafındakı meşələr və dağlar, həm də bir çox əfsanə və rəvayətlərə məskən olmuşdur. Elm və texnologiya sahəsində Azərbaycanın nailiyyətləri də xüsusi qeyd olunmalıdır. Ölkədə İKT sektorunun inkişafına böyük önəm verilir. Rəqəmsal transformasiya, süni intellekt (AI), maşın təlimi (ML), böyük verilənlər (Big Data) texnologiyalarının tətbiqi prioritet istiqamətlərdir. Bakıda fəaliyyət göstərən "Hi-Tech Park" və "Sumqayıt Kimya Sənaye Parkı" kimi texnoloji və sənaye parkları, innovativ layihələrin həyata keçirilməsi, yeni texnologiyaların yaradılması və tətbiqi üçün əlverişli mühit yaradır. "Azercosmos" Açıq Səhmdar Cəmiyyəti, peyk texnologiyaları sahəsində ölkənin potensialını artıraraq, telekommunikasiya, naviqasiya, telekommunikasiya, yerin uzaqdan müşahidəsi və informasiya təhlükəsizliyi kimi sahələrdə mühüm rol oynayır. Universitetlərdə, xüsusilə də ADA Universiteti, İnformasiya Texnologiyaları Universiteti (indiki ADA Universitetinin tərkibində) kimi təhsil müəssisələrində İKT sahələri üzrə tədqiqatlar aparılır, yüksək ixtisaslı kadrlar hazırlanır. Virtual və artırılmış reallıq texnologiyaları da təhsil, mədəniyyət və turizm sahələrində geniş tətbiq olunur. Bu sahələrin inkişafı ölkənin gələcək dayanıqlı inkişafı üçün əsasdır. Azərbaycan tarixi boyu bir çox mühüm hadisələrə şahidlik etmişdir. Qədim dövrlərdə Midiya, Atropatena, Qafqaz Albaniyası kimi dövlətlərin yaranması, sonrakı dövrlərdə Arran, Şirvan, Gəncə xanlıqlarının mövcudiyyəti ölkənin dövlətçilik ənənəsini gücləndirmişdir. 1918-ci ildə Şərqdə ilk demokratik respublika olan Azərbaycan Xalq Cümhuriyyətinin qurulması, bu torpaqların azadlıqsevərliyinin və müstəqillik uğrunda mübarizəsinin parlaq bir nümunəsidir. 1920-ci ildə Aprel işğalından sonra Sovet hakimiyyəti qurulsa da, xalq heç vaxt öz milli kimliyindən və dövlətçilik arzularından vaz keçməmişdir. 1991-ci ildə Ümummilli Lider Heydər Əliyevin müdrik siyasəti və xalqın iradəsi sayəsində Azərbaycan yenidən müstəqilliyini qazanmışdır. Sonrakı illərdə, Prezident İlham Əliyevin rəhbərliyi altında ölkə sosial-iqtisadi, siyasi və hərbi sahələrdə böyük nailiyyətlər əldə etmişdir. 2020-ci ildə aparılan Vətən Müharibəsi nəticəsində ölkənin ərazi bütövlüyü tam təmin olunmuş, doğma torpaqlarımız 30 illik işğaldan azad edilmişdir. Bu qələbə, Azərbaycan xalqının birliyinin, gücünün və iradəsini bütün dünyaya bir daha sübut etmişdir. Azərbaycanın gələcəkə baxışı, həm ölkə daxilində, həm də beynəlxalq aləmdə özünü göstərir. Ölkə, enerji resursları (neft və qaz) ixracatçısı olmaqla yanaşı, yeni tranzit və logistika mərkəzi kimi də strateji əhəmiyyətini artırmışdır. "Asrın Müqaviləsi" və sonrakı neft-qaz layihələri ölkə iqtisadiyyatının inkişafına böyük təkan vermişdir. Bakı-Tbilisi-Qars dəmir yolu, "Şərq-Qərb" və "Şimal-Cənub" beynəlxalq nəqliyyat dəhlizlərinin mühüm hissəsi kimi, Azərbaycanın Avrasiya məkanında rolunu gücləndirmişdir. Yaşıl enerji və dayanıqlı inkişaf konsepsiyaları da ölkənin gələcək strategiyasının əsasını təşkil edir. Günəş və külək enerjisi potensialından səmərəli istifadə etmək, emissiyaları azaltmaq və iqlim dəyişikliyinin mənfi təsirlərini minimuma endirmək üçün ciddi səylər göstərilir. Təhsil sisteminin modernləşdirilməsi, gənclərin beynəlxalq səviyyədə rəqabətədavamlı kadr kimi yetişdirilməsi, elmi-texniki potensialın artırılması ölkənin gələcək uğurları üçün prioritetdir. Şəhərlərin, xüsusilə də regionların sosial-iqtisadi inkişafına yönəlmiş dövlət proqramları, infrastruktur layihələri, sosial təminatın gücləndirilməsi, bütün bunlar Azərbaycanın davamlı tərəqqisinin təminatıdır. Beynəlxalq əməkdaşlığın genişləndirilməsi, Avropa İttifaqı və digər beynəlxalq təşkilatlarla əlaqələrin möhkəmləndirilməsi, regional sabitliyin və təhlükəsizliyin təmin edilməsində aktiv rol oynamaq, Azərbaycanın xarici siyasətinin əsas istiqamətləridir. Sülhə və əməkdaşlığa əsaslanan bu siyasət, ölkənin beynəlxalq aləmdəki nüfuzunu daha da artırır. Tarixi və mədəni irsimizin qorunması da dövlətimizin prioritetlərindəndir. Naxçıvan Muxtar Respublikasında yerləşən Əshabi-Kəhf, Şəki Xan Sarayı, Qobustan qaya təsvirləri, Qarabağın mədəniyyət abidələri, milli musiqimizin (muğam) UNESCO tərəfindən qeyri-maddi mədəni irs siyahısına daxil edilməsi, Azərbaycanın zəngin mədəniyyətini dünyaya tanıtmaq istiqamətində atılan mühüm addımlardır. Qarabağın azad olunmasından sonra, işğaldan ziyan dəymiş mədəniyyət obyektlərinin, dini məbədlərin, xüsusilə də Şuşa şəhərinin bərpası və qorunması işlərinə başlanılmışdır. Şuşa, Azərbaycanın mədəniyyət paytaxtı elan edilmişdir və bu, onun milli-mənəvi dəyərlər sistemimizdəki xüsusi yerini bir daha təsdiqləyir. Arxeoloji qazıntılar, tarixi abidələrin restavrasiyası, muzeylərin fəaliyyətinin təkmilləşdirilməsi, yeni mədəniyyət mərkəzlərinin yaradılması, gənclərin milli mədəniyyətimizə marağının artırılması istiqamətində davamlı işlər aparılır. Bu işlər, gələcək nəsillərə örnək olaraq, ölkəmizin unikal mədəni irsinin qorunub saxlanılmasına və təbliğinə xidmət edir. Elmi tədqiqatlar sahəsində də böyük irəliləyişlər müşahidə olunur. Milli Elmlər Akademiyası (AMEA) və müxtəlif ali təhsil müəssisələrinin alimləri, fundamental və tətbiqi elmlərin müxtəlif sahələrində mühüm nailiyyətlər əldə edirlər. Fizika, kimya, biologiya, texnika elmləri, humanitar və sosial elmlər sahələrində aparılan tədqiqatlar, ölkənin elmi potensialını artırmaqla yanaşı, həm də qlobal elmi proseslərə töhfə verir. Xüsusilə nanotexnologiyalar, materialşünaslıq, süni intellekt, bioteknologiya, ekologiya və yer elmləri kimi müasir istiqamətlərdə aparılan tədqiqatlar, prioritet sahələr olaraq müəyyən edilmişdir. Gənc alimlərin hazırlanması, elmi-tədqiqat işlərinə cəlb edilməsi, beynəlxalq elmi əməkdaşlığın gücləndirilməsi, elmi jurnal və konfransların təşkili, müasir innovativ ideyaların reallaşdırılması üçün qrant proqramlarının maliyyələşdirilməsi, bütün bunlar elmin inkişafına dövlət qayğısının bariz nümunələridir. Akademik təqaüdlərin verilməsi, elmi müəssisələrin maddi-texniki bazasının gücləndirilməsi, beynəlxalq reytinqli jurnallarda məqalələrin dərc olunması üçün şərait yaradılması, həmçinin elmi nəticələrin sənaye və iqtisadiyyatla inteqrasiyası, ölkənin elmi-texniki potensialını gücləndirir. Tarixi Azərbaycan torpaqları, həm də zəngin folklora, ədəbiyyata və incəsənətə malikdir. Dədə Qorqud dastanları, Nizami Gəncəvinin, Füzulinin, Vahid, Nəsimi kimi dahi şairlərin əsərləri, Azərbaycan ədəbiyyatının qızıl fondunu təşkil edir. Muğam ifaçılıq sənəti, xalçaçılıq, metal üzərində işləmə sənəti, memarlıq məktəbləri, milli rəqs və musiqi janrları – bütün bunlar Azərbaycanın dünya mədəniyyətinə verdiyi töhfələrdir. Bu dəyərlərin qorunması, təbliği və gələcək nəsillərə ötürülməsi, dövlətimizin qarşısında duran mühüm vəzifələrdəndir. Mədəniyyət Nazirliyinin fəaliyyəti, muzeylərin, teatrların, musiqi kollektivlərinin işinin təkmilləşdirilməsi, xarici ölkələrdə Azərbaycan mədəniyyətinin təbliğ edilməsi, müxtəlif mədəni tədbirlərin – festivalların, sərgilərin, konsertlərin – təşkili, bütün bunlar ölkəmizin mədəni həyatının zənginliyini göstərir."""
+class AzerbaijaniTextChunker:
+    """Optimized Azerbaijani text chunker with single parameter configuration"""
+    def __init__(self, model_path: str = MODEL_PATH):
+        self.model_path = model_path
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.tokenizer = None
+        self.model = None
+        self.is_loaded = False
+    def load_model(self) -> bool:
+        """Load the trained model"""
+        try:
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                self.model_path, use_fast=True, trust_remote_code=False
+            )
+            self.model = AutoModelForTokenClassification.from_pretrained(
+                self.model_path, trust_remote_code=False
+            )
+            self.model.to(self.device)
+            self.model.eval()
+            self.is_loaded = True
+            return True
+        except Exception as e:
+            print(f"Error loading model: {e}")
+            return False
+    def chunk(self,
+              text: str,
+              max_chunk_tokens: Optional[int] = None,
+              target_tokens: Optional[int] = None,
+              max_chunk_chars: Optional[int] = None,
+              threshold: float = None,
+              priority: str = None) -> List[str]:
+        """
+        Chunk text into semantic segments with optimal sizing
+        Args:
+            text: Input text to chunk
+            max_chunk_tokens: Maximum tokens per chunk (default: from global config)
+            target_tokens: Target optimal size in tokens (default: from global config)
+            max_chunk_chars: Maximum characters per chunk (default: from global config)
+            threshold: Confidence threshold for splitting (default: from global config)
+            priority: Which limit to prioritize when both are set (default: from global config)
+        Returns:
+            List of optimally-sized text chunks
+        """
+        if not self.is_loaded:
+            if not self.load_model():
+                return [text]
+        # Use global configuration as defaults
+        if max_chunk_tokens is None:
+            max_chunk_tokens = MAX_CHUNK_TOKENS
+        if target_tokens is None:
+            target_tokens = TARGET_TOKENS
+        if max_chunk_chars is None:
+            max_chunk_chars = MAX_CHUNK_CHARS
+        if threshold is None:
+            threshold = THRESHOLD
+        if priority is None:
+            priority = PRIORITY
+        # Dynamic validation and minimum length calculation
+        dynamic_min_length = max(10, target_tokens // MIN_LENGTH_DIVISOR)
+        if not text or len(text.strip()) < dynamic_min_length:
+            return [text.strip()] if text.strip() else []
+        text = text.strip()
+        # Ensure max >= target with reasonable margin
+        if max_chunk_tokens < target_tokens:
+            margin = max(target_tokens * MARGIN_RATIO, MIN_MARGIN)
+            max_chunk_tokens = target_tokens + int(margin)
+        # Step 1: Get all semantic boundaries from the model
+        semantic_boundaries = self._get_semantic_boundaries_fixed(text, threshold)
+        # Step 2: Split text at semantic boundaries (no limits yet)
+        initial_chunks = self._split_at_boundaries(text, semantic_boundaries)
+        # Step 3: Optimize chunk sizes with dynamic parameters
+        optimized_chunks = self._optimize_chunk_sizes(
+            initial_chunks, max_chunk_tokens, target_tokens
+        )
+        # Step 4: Apply additional limits if specified
+        if max_chunk_chars is not None:
+            final_chunks = self._apply_char_limits(optimized_chunks, max_chunk_chars)
+        else:
+            final_chunks = optimized_chunks
+        return self._clean_and_validate_chunks(final_chunks, target_tokens, max_chunk_tokens)
+    def _get_semantic_boundaries_fixed(self, text: str, threshold: float) -> List[int]:
+        """Get all semantic boundaries without sequence length warnings"""
+        # Check text length first and use chunked approach if needed
+        rough_token_count = len(text.split()) * ROUGH_TOKEN_MULTIPLIER
+        if rough_token_count > LONG_TEXT_THRESHOLD:
+            return self._get_boundaries_for_long_text(text, threshold)
+        # For shorter texts, use standard approach with truncation
+        full_encoding = self.tokenizer(
+            text,
+            return_tensors="pt",
+            return_offsets_mapping=True,
+            add_special_tokens=True,
+            truncation=True,  # Enable truncation to avoid warnings
+            max_length=BERT_MAX_LENGTH,   # Standard BERT limit
+            padding=False
+        )
+        input_ids = full_encoding['input_ids'][0]
+        offset_mapping = full_encoding['offset_mapping'][0]
+        # Remove special tokens (CLS at start, SEP at end)
+        content_input_ids = input_ids[1:-1]
+        content_offsets = offset_mapping[1:-1]
+        if len(content_input_ids) == 0:
+            return []
+        # Process with single window since we truncated
+        boundaries = self._process_single_window_safe(
+            text, input_ids, content_offsets, threshold
+        )
+        # Remove duplicates and sort
+        boundaries = sorted(list(set(boundaries)))
+        # Remove boundaries at text start/end
+        boundaries = [b for b in boundaries if 0 < b < len(text)]
+        return boundaries
+    def _get_boundaries_for_long_text(self, text: str, threshold: float) -> List[int]:
+        """Handle long texts by processing in semantic chunks"""
+        # Split text into rough sections first (by sentences)
+        import re
+        sentences = re.split(r'[.!?]+', text)
+        all_boundaries = []
+        current_pos = 0
+        # Process each section that fits in model limits
+        current_section = ""
+        for sentence in sentences:
+            test_section = current_section + sentence + "."
+            rough_tokens = len(test_section.split()) * ROUGH_TOKEN_MULTIPLIER
+            if rough_tokens > LONG_TEXT_THRESHOLD and current_section:
+                # Process current section
+                section_boundaries = self._process_text_section(
+                    current_section, current_pos, threshold
+                )
+                all_boundaries.extend(section_boundaries)
+                # Start new section
+                current_pos += len(current_section)
+                current_section = sentence + "."
+            else:
+                current_section = test_section
+        # Process remaining section
+        if current_section:
+            section_boundaries = self._process_text_section(
+                current_section, current_pos, threshold
+            )
+            all_boundaries.extend(section_boundaries)
+        return sorted(list(set(all_boundaries)))
+    def _process_text_section(self, section: str, offset: int, threshold: float) -> List[int]:
+        """Process a section of text that fits in model limits"""
+        try:
+            # Tokenize section (should be safe now)
+            encoding = self.tokenizer(
+                section,
+                return_tensors="pt",
+                return_offsets_mapping=True,
+                add_special_tokens=True,
+                truncation=True,
+                max_length=BERT_MAX_LENGTH
+            )
+            input_ids = encoding['input_ids'][0]
+            offset_mapping = encoding['offset_mapping'][0]
+            # Remove special tokens
+            content_offsets = offset_mapping[1:-1]
+            # Get model predictions
+            with torch.no_grad():
+                outputs = self.model(encoding['input_ids'].to(self.device))
+                probabilities = F.softmax(outputs.logits, dim=-1)
+                # Extract B-CHUNK probabilities
+                b_chunk_id = self.model.config.label2id.get("B-CHUNK", 1)
+                chunk_probs = probabilities[0, 1:-1, b_chunk_id].cpu().numpy()
+            # Find boundaries
+            boundaries = []
+            for i, (prob, char_offset) in enumerate(zip(chunk_probs, content_offsets)):
+                if prob > threshold:
+                    char_start = char_offset[0].item()
+                    char_end = char_offset[1].item()
+                    # Find clean boundary in full text
+                    boundary_pos = self._find_clean_boundary_global(
+                        section, char_start, char_end, offset
+                    )
+                    if boundary_pos is not None and boundary_pos > offset:
+                        boundaries.append(boundary_pos)
+            return boundaries
+        except Exception as e:
+            print(f"Warning: Error processing section: {e}")
+            return []
+    def _find_clean_boundary_global(self, section: str, char_start: int,
+                                   char_end: int, global_offset: int) -> Optional[int]:
+        """Find clean boundary and return global position"""
+        local_boundary = self._find_clean_boundary(section, char_start, char_end)
+        if local_boundary is not None:
+            global_pos = global_offset + local_boundary
+            # Additional validation: ensure we don't cut words in half
+            if global_pos > 0 and global_pos < len(section) + global_offset:
+                return global_pos
+        return None
+    def _process_single_window_safe(self, text: str, input_ids: torch.Tensor,
+                                   content_offsets: torch.Tensor, threshold: float) -> List[int]:
+        """Process text that fits in a single window safely"""
+        try:
+            # Get model prediction
+            with torch.no_grad():
+                input_ids_batch = input_ids.unsqueeze(0).to(self.device)
+                outputs = self.model(input_ids_batch)
+                probabilities = F.softmax(outputs.logits, dim=-1)
+                # Extract B-CHUNK probabilities for content tokens only
+                b_chunk_id = self.model.config.label2id.get("B-CHUNK", 1)
+                chunk_probs = probabilities[0, 1:-1, b_chunk_id].cpu().numpy()
+            # Find boundaries
+            boundaries = []
+            for i, (prob, offset) in enumerate(zip(chunk_probs, content_offsets)):
+                if prob > threshold:
+                    char_start = offset[0].item()
+                    char_end = offset[1].item()
+                    # Find clean boundary position
+                    boundary_pos = self._find_clean_boundary(text, char_start, char_end)
+                    if boundary_pos is not None and boundary_pos > 0:
+                        boundaries.append(boundary_pos)
+            return boundaries
+        except Exception as e:
+            print(f"Warning: Error in single window processing: {e}")
+            return []
+    def _find_clean_boundary(self, text: str, char_start: int, char_end: int) -> Optional[int]:
+        """Find a clean boundary near the predicted position, prioritizing sentence starts"""
+        # Ensure positions are within text bounds
+        char_start = max(0, min(char_start, len(text) - 1))
+        char_end = max(0, min(char_end, len(text)))
+        # Search range around the token
+        search_start = max(0, char_start - SEARCH_RANGE)
+        search_end = min(len(text), char_end + SEARCH_RANGE)
+        # Priority 1: Look for sentence endings followed by capital letters (forward search)
+        for i in range(char_start, search_end):
+            if i < len(text) and text[i] in '.!?':
+                # Look for the start of next sentence
+                boundary = i + 1
+                # Skip whitespace
+                while boundary < len(text) and text[boundary] in ' \t\n':
+                    boundary += 1
+                # Check if next character is uppercase (start of sentence)
+                if boundary < len(text) and (text[boundary].isupper() or text[boundary].isdigit()):
+                    return boundary
+        # Priority 2: Search backwards for sentence endings followed by capitals
+        for i in range(char_start - 1, search_start - 1, -1):
+            if i >= 0 and text[i] in '.!?':
+                boundary = i + 1
+                # Skip whitespace
+                while boundary < len(text) and text[boundary] in ' \t\n':
+                    boundary += 1
+                # Check if next character is uppercase (start of sentence)
+                if boundary < len(text) and (text[boundary].isupper() or text[boundary].isdigit()):
+                    return boundary
+        # Priority 3: Word boundaries (spaces) only if followed by capital
+        for i in range(char_start, search_end):
+            if i < len(text) and text[i] in ' \t':
+                boundary = i + 1
+                while boundary < len(text) and text[boundary] in ' \t':
+                    boundary += 1
+                # Only use if followed by capital letter
+                if boundary < len(text) and text[boundary].isupper():
+                    return boundary
+        # Fallback: use the end of the token
+        fallback_pos = char_end if char_end <= len(text) else len(text)
+        return fallback_pos
+    def _split_at_boundaries(self, text: str, boundaries: List[int]) -> List[str]:
+        """Split text at boundaries ensuring no gaps or overlaps and proper sentence starts"""
+        if not boundaries:
+            return [text]
+        chunks = []
+        start = 0
+        for boundary in boundaries:
+            # Ensure boundary is within text
+            boundary = min(boundary, len(text))
+            if start < boundary:
+                chunk = text[start:boundary].strip()
+                if chunk:  # Only add non-empty chunks
+                    chunks.append(chunk)
+            start = boundary
+        # Add remaining text
+        if start < len(text):
+            remaining = text[start:].strip()
+            if remaining:
+                chunks.append(remaining)
+        return chunks
+    def _optimize_chunk_sizes(self, chunks: List[str], max_tokens: int, target_tokens: int) -> List[str]:
+        """Fully dynamic chunk size optimization"""
+        if not chunks:
+            return []
+        # Calculate dynamic thresholds based on target
+        good_size_min = target_tokens * GOOD_SIZE_MIN_RATIO
+        good_size_max = max_tokens  # Use actual max
+        merge_threshold = target_tokens * MERGE_THRESHOLD_RATIO
+        optimized = []
+        i = 0
+        while i < len(chunks):
+            current_chunk = chunks[i]
+            current_tokens = self._count_tokens(current_chunk)
+            # If chunk is in good size range, keep it
+            if good_size_min <= current_tokens <= good_size_max:
+                optimized.append(current_chunk)
+                i += 1
+                continue
+            # If chunk is too large, split it
+            if current_tokens > good_size_max:
+                split_chunks = self._split_large_chunk_dynamic(
+                    current_chunk, max_tokens, target_tokens
+                )
+                optimized.extend(split_chunks)
+                i += 1
+                continue
+            # If chunk is too small, try to merge with next chunks
+            if current_tokens < merge_threshold:
+                merged_chunk, chunks_consumed = self._merge_small_chunks_dynamic(
+                    chunks[i:], max_tokens, target_tokens
+                )
+                optimized.append(merged_chunk)
+                i += chunks_consumed
+                continue
+            # Default: keep the chunk (it's in acceptable range)
+            optimized.append(current_chunk)
+            i += 1
+        return optimized
+    def _count_tokens(self, text: str) -> int:
+        """Count tokens in text"""
+        if not text.strip():
+            return 0
+        encoding = self.tokenizer(text, add_special_tokens=False)
+        return len(encoding['input_ids'])
+    def _merge_small_chunks_dynamic(self, chunks: List[str], max_tokens: int, target_tokens: int) -> Tuple[str, int]:
+        """Dynamic merging based on target size"""
+        if not chunks:
+            return "", 0
+        merged = chunks[0]
+        consumed = 1
+        merged_tokens = self._count_tokens(merged)
+        # Dynamic target: aim for target_tokens but stop before max_tokens
+        optimal_target = min(target_tokens, max_tokens * OPTIMAL_TARGET_RATIO)
+        # Try to merge with following chunks
+        for i in range(1, len(chunks)):
+            candidate = merged + " " + chunks[i]
+            candidate_tokens = self._count_tokens(candidate)
+            # Stop if we exceed max tokens
+            if candidate_tokens > max_tokens:
+                break
+            # Merge and continue
+            merged = candidate
+            merged_tokens = candidate_tokens
+            consumed += 1
+            # Stop if we reached optimal target
+            if merged_tokens >= optimal_target:
+                break
+        return merged, consumed
+    def _split_large_chunk_dynamic(self, chunk: str, max_tokens: int, target_tokens: int) -> List[str]:
+        """Dynamic splitting based on target size"""
+        result = []
+        remaining = chunk
+        while remaining:
+            current_tokens = self._count_tokens(remaining)
+            # If remaining fits in max size, add it
+            if current_tokens <= max_tokens:
+                result.append(remaining)
+                break
+            # Find optimal split point (prefer target_tokens, but respect max_tokens)
+            optimal_split = min(target_tokens, max_tokens * OPTIMAL_TARGET_RATIO)
+            split_pos = self._find_optimal_split_position(remaining, optimal_split, max_tokens)
+            if split_pos > 0 and split_pos < len(remaining):
+                chunk_part = remaining[:split_pos].strip()
+                if chunk_part:
+                    result.append(chunk_part)
+                remaining = remaining[split_pos:].strip()
+            else:
+                # Fallback: force split at max_tokens
+                split_pos = self._find_token_split_position(remaining, max_tokens)
+                if split_pos > 0:
+                    chunk_part = remaining[:split_pos].strip()
+                    if chunk_part:
+                        result.append(chunk_part)
+                    remaining = remaining[split_pos:].strip()
+                else:
+                    # Last resort: take the remaining text
+                    result.append(remaining)
+                    break
+        return [r for r in result if r.strip()]
+    def _find_optimal_split_position(self, text: str, target_tokens: int, max_tokens: int) -> int:
+        """Find optimal split position aiming for target_tokens but not exceeding max_tokens"""
+        # Binary search for position closest to target_tokens
+        left, right = 0, len(text)
+        best_pos = 0
+        best_tokens = 0
+        while left <= right:
+            mid = (left + right) // 2
+            test_text = text[:mid]
+            if not test_text.strip():
+                left = mid + 1
+                continue
+            tokens = self._count_tokens(test_text)
+            if tokens <= max_tokens:
+                # This position is valid, check if it's closer to target
+                if abs(tokens - target_tokens) < abs(best_tokens - target_tokens) or best_tokens == 0:
+                    best_pos = mid
+                    best_tokens = tokens
+                if tokens < target_tokens:
+                    left = mid + 1  # Try to get closer to target
+                else:
+                    break  # We've reached or exceeded target
+            else:
+                right = mid - 1
+        # Refine to find clean boundary
+        if best_pos > 0:
+            clean_pos = self._find_char_split_position(text, best_pos)
+            return clean_pos
+        return best_pos
+    def _find_token_split_position(self, text: str, max_tokens: int) -> int:
+        """Find a good position to split text within token limit"""
+        # Binary search approach for accurate token-based splitting
+        left, right = 0, len(text)
+        best_pos = 0
+        while left <= right:
+            mid = (left + right) // 2
+            # Test if text[:mid] fits within token limit
+            test_text = text[:mid]
+            if not test_text.strip():
+                left = mid + 1
+                continue
+            encoding = self.tokenizer(test_text, add_special_tokens=False)
+            token_count = len(encoding['input_ids'])
+            if token_count <= max_tokens:
+                best_pos = mid
+                left = mid + 1
+            else:
+                right = mid - 1
+        return best_pos
+    def _find_char_split_position(self, text: str, max_chars: int) -> int:
+        """Find a good position to split text within character limit"""
+        if max_chars >= len(text):
+            return len(text)
+        # Look for sentence endings before the limit
+        search_start = max(0, max_chars - SENTENCE_SEARCH_RANGE)
+        for i in range(min(max_chars, len(text)) - 1, search_start, -1):
+            if i < len(text) and text[i] in '.!?':
+                # Skip forward past any whitespace
+                boundary = i + 1
+                while boundary < len(text) and text[boundary] in ' \t\n':
+                    boundary += 1
+                return min(boundary, len(text))
+        # Look for spaces before the limit
+        search_start = max(0, max_chars - SPACE_SEARCH_RANGE)
+        for i in range(min(max_chars, len(text)) - 1, search_start, -1):
+            if i < len(text) and text[i] in ' \t':
+                return i + 1
+        # Fallback: split at the limit
+        return min(max_chars, len(text))
+    def _apply_char_limits(self, chunks: List[str], max_chars: int) -> List[str]:
+        """Apply character limits to optimized chunks"""
+        result = []
+    def _apply_char_limits(self, chunks: List[str], max_chars: int) -> List[str]:
+        """Apply character limits to optimized chunks"""
+        result = []
+        for chunk in chunks:
+            if len(chunk) <= max_chars:
+                result.append(chunk)
+            else:
+                # Split by character limit
+                sub_chunks = self._split_by_chars(chunk, max_chars)
+                result.extend(sub_chunks)
+        return result
+    def _split_by_chars(self, chunk: str, max_chars: int) -> List[str]:
+        """Split chunk by character limit only"""
+        result = []
+        remaining = chunk
+        while remaining:
+            if len(remaining) <= max_chars:
+                result.append(remaining)
+                break
+            # Find a good split point within char limit
+            split_pos = self._find_char_split_position(remaining, max_chars)
+            if split_pos > 0 and split_pos < len(remaining):
+                result.append(remaining[:split_pos].strip())
+                remaining = remaining[split_pos:].strip()
+            else:
+                # Fallback: force split at limit
+                result.append(remaining[:max_chars].strip())
+                remaining = remaining[max_chars:].strip()
+        return [r for r in result if r.strip()]
+    def _clean_and_validate_chunks(self, chunks: List[str], target_tokens: int, max_tokens: int) -> List[str]:
+        """Dynamic cleaning and validation - fully adaptive to target size"""
+        if not chunks:
+            return []
+        # Dynamic thresholds based on target
+        dynamic_min_length = max(10, target_tokens // MIN_LENGTH_DIVISOR)
+        merge_threshold = target_tokens * MERGE_ATTEMPT_RATIO
+        optimal_max = max_tokens * OPTIMAL_MAX_RATIO
+        # Remove very small chunks first
+        cleaned = [chunk.strip() for chunk in chunks if len(chunk.strip()) >= dynamic_min_length]
+        if not cleaned:
+            return chunks  # Return original if all chunks are removed
+        # Dynamic merging for better chunk sizes
+        final = []
+        i = 0
+        while i < len(cleaned):
+            current = cleaned[i]
+            current_tokens = self._count_tokens(current)
+            # If current chunk is small, try to merge
+            if current_tokens < merge_threshold:
+                merged_successfully = False
+                # Try to merge with previous chunk if it exists and is not too large
+                if final:
+                    prev_tokens = self._count_tokens(final[-1])
+                    combined = final[-1] + " " + current
+                    combined_tokens = self._count_tokens(combined)
+                    if combined_tokens <= optimal_max and combined_tokens <= target_tokens * MAX_MERGE_RATIO:
+                        final[-1] = combined
+                        merged_successfully = True
+                # If couldn't merge with previous, try to merge with next chunks
+                if not merged_successfully and i + 1 < len(cleaned):
+                    merged_chunk, chunks_consumed = self._merge_small_chunks_dynamic(
+                        cleaned[i:], max_tokens, target_tokens
+                    )
+                    final.append(merged_chunk)
+                    i += chunks_consumed
+                    continue
+                # If still couldn't merge, add as is (unless it's too small)
+                if not merged_successfully:
+                    min_acceptable = target_tokens * MIN_ACCEPTABLE_RATIO
+                    if current_tokens >= min_acceptable:
+                        final.append(current)
+                    elif final:  # Try one more time to merge with previous
+                        prev_tokens = self._count_tokens(final[-1])
+                        combined = final[-1] + " " + current
+                        combined_tokens = self._count_tokens(combined)
+                        if combined_tokens <= optimal_max:
+                            final[-1] = combined
+                        else:
+                            final.append(current)  # Keep as separate chunk
+                    else:
+                        final.append(current)  # First chunk, keep it
+            else:
+                final.append(current)
+            i += 1
+        return final
+def chunk_azerbaijani_text(text: str,
+                          model_path: str = MODEL_PATH,
+                          max_chunk_tokens: Optional[int] = MAX_CHUNK_TOKENS,
+                          target_tokens: Optional[int] = TARGET_TOKENS,
+                          max_chunk_chars: Optional[int] = MAX_CHUNK_CHARS,
+                          threshold: float = THRESHOLD,
+                          priority: str = PRIORITY) -> List[str]:
+    """
+    Simplified function to chunk Azerbaijani text with single parameter configuration
+    Args:
+        text: Input text to chunk
+        model_path: HuggingFace model path
+        max_chunk_tokens: Maximum tokens per chunk
+        target_tokens: Target optimal size in tokens
+        max_chunk_chars: Maximum characters per chunk
+        threshold: Confidence threshold for splitting
+        priority: Which limit to prioritize when both are set
+    Returns:
+        List of optimally-sized text chunks
+    """
+    chunker = AzerbaijaniTextChunker(model_path)
+    return chunker.chunk(text, max_chunk_tokens, target_tokens, max_chunk_chars, threshold, priority)
+def main():
+    """Main function demonstrating the simplified chunker"""
+    print("=== SIMPLIFIED AZERBAIJANI TEXT CHUNKER ===\n")
+    # Show current configuration
+    print("CURRENT CONFIGURATION:")
+    print(f"Model path: {MODEL_PATH}")
+    print(f"Max tokens: {MAX_CHUNK_TOKENS}")
+    print(f"Target tokens: {TARGET_TOKENS}")
+    print(f"Max chars: {MAX_CHUNK_CHARS}")
+    print(f"Threshold: {THRESHOLD}")
+    print(f"Priority: {PRIORITY}")
+    print("="*60 + "\n")
+    # Basic chunking with global settings
+    print(f"Chunking with global settings (target ~{TARGET_TOKENS} tokens, max {MAX_CHUNK_TOKENS}):")
+    chunks = chunk_azerbaijani_text(SAMPLE_TEXT)
+    total_tokens = 0
+    text_preview_main = 80  # Preview length for main chunks
+    for i, chunk in enumerate(chunks, 1):
+        try:
+            from transformers import AutoTokenizer
+            tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
+            tokens = tokenizer(chunk, add_special_tokens=False)['input_ids']
+            token_count = len(tokens)
+            total_tokens += token_count
+        except:
+            token_count = "N/A"
+        print(f"{i}. [{len(chunk)} chars, {token_count} tokens]")
+        print(f"   {chunk[:text_preview_main]}{'...' if len(chunk) > text_preview_main else ''}\n")
+    print(f"Total chunks: {len(chunks)}")
+    if total_tokens > 0:
+        print(f"Total tokens: {total_tokens}")
+        print(f"Average tokens per chunk: {total_tokens/len(chunks):.1f}")
+    print("="*60 + "\n")
+# Example usage
+if __name__ == "__main__":
+    main()
+```
+```bash
+# Results
+Chunking with global settings (target ~256 tokens, max 400):
+1. [1247 chars, 288 tokens]
+   Azərbaycan Respublikası, Qafqaz regionunda yerləşən, unikal coğrafi mövqeyi, zən...
+2. [1007 chars, 247 tokens]
+   Azərbaycanın təbiəti də onun mədəni zənginliyi qədər heyranedici və müxtəlifdir....
+3. [1027 chars, 197 tokens]
+   Rəqəmsal transformasiya, süni intellekt (AI), maşın təlimi (ML), böyük verilənlə...
+4. [1128 chars, 222 tokens]
+   Azərbaycan tarixi boyu bir çox mühüm hadisələrə şahidlik etmişdir. Qədim dövrlər...
+5. [1194 chars, 256 tokens]
+   Azərbaycanın gələcəkə baxışı, həm ölkə daxilində, həm də beynəlxalq aləmdə özünü...
+6. [1412 chars, 277 tokens]
+   Beynəlxalq əməkdaşlığın genişləndirilməsi, Avropa İttifaqı və digər beynəlxalq t...
+7. [1224 chars, 229 tokens]
+   Elmi tədqiqatlar sahəsində də böyük irəliləyişlər müşahidə olunur. Milli Elmlər ...
+8. [857 chars, 179 tokens]
+   Tarixi Azərbaycan torpaqları, həm də zəngin folklora, ədəbiyyata və incəsənətə m...
+Total chunks: 8
+Total tokens: 1895
+Average tokens per chunk: 236.9
+```
+# Quick Configuration for Chunking
+## Global Parameters
+| Parameter           | Value | Description                                                   |
+|--------------------|-------|---------------------------------------------------------------|
+| MAX_CHUNK_TOKENS   | 300   | Maximum tokens per chunk (must be < model context window)     |
+| TARGET_TOKENS      | 200   | Target optimal chunk size                                     |
+| THRESHOLD          | 0.12  | Semantic boundary confidence (0.10–0.20)                      |
+---
+## Configuration for 384-token Embedding Models
+### Recommended Settings
+| Parameter           | Value | Notes                                           |
+|--------------------|-------|--------------------------------------------------|
+| MAX_CHUNK_TOKENS   | 300   | 78% of context window (384 × 0.78)               |
+| TARGET_TOKENS      | 200   | Two-thirds of max tokens                         |
+| THRESHOLD          | 0.12  | Balanced segmentation                            |
+### Conservative Settings
+| Parameter           | Value | Notes                                           |
+|--------------------|-------|--------------------------------------------------|
+| MAX_CHUNK_TOKENS   | 256   | ~67% of context window                          |
+| TARGET_TOKENS      | 180   | ~70% of max tokens                              |
+| THRESHOLD          | 0.15  | Fewer, larger chunks                            |
+### Aggressive Settings
+| Parameter           | Value | Notes                                           |
+|--------------------|-------|--------------------------------------------------|
+| MAX_CHUNK_TOKENS   | 320   | 83% of context window                           |
+| TARGET_TOKENS      | 220   | Close to max tokens                             |
+| THRESHOLD          | 0.10  | More boundaries, finer segmentation             |
+---
+## Key Rules
+- MAX_CHUNK_TOKENS should be 20–25% less than the embedding model’s context window
+- TARGET_TOKENS should be 60–80% of MAX_CHUNK_TOKENS
+- THRESHOLD controls granularity:
+  - Lower → more chunks
+  - Higher → fewer chunks
+- Always test with your specific embedding model and adjust as needed
+---
+## Expected Results
+- Average chunk size: ~TARGET_TOKENS ± 20%
+- All chunks: < MAX_CHUNK_TOKENS
+- Semantic boundaries preserved
+- No text loss or duplication
+# Use Cases
+## Perfect for RAG Systems
+- **Vector Databases**: Ensure chunks fit embedding model limits
+- **Search Applications**: Optimal chunk sizes for retrieval
+- **Question Answering**: Maintain semantic coherence
+## Document Processing
+- **Academic Papers**: Respect section and paragraph boundaries
+- **Legal Documents**: Maintain clause integrity
+- **News Articles**: Preserve story flow and context
+## Content Management
+- **CMS Integration**: Automatic content segmentation
+- **API Limits**: Respect external service constraints
+- **Storage Optimization**: Consistent chunk sizes for databases
+---
+# Chunking Strategies
+## Optimal Strategy (Default)
+- **Threshold**: 0.13
+- **Best for**: General purpose, balanced precision/recall
+- **Typical output**: 3–5 chunks for medium texts
+## Conservative Strategy
+- **Threshold**: 0.3
+- **Best for**: Longer chunks, fewer segments
+- **Typical output**: 1–3 chunks for medium texts
+## Aggressive Strategy
+- **Threshold**: 0.05
+- **Best for**: Fine-grained segmentation
+- **Typical output**: 5–10 chunks for medium texts
+---
+## CC BY 4.0 License — What It Allows
+The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
+You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
+For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>.
+## Contact
+For more information, questions, or issues, please contact LocalDoc at [[email protected]].