first try finetune deepseek-r1-distill-qwen-1.5b by java. needed comparation on leaderboard )

Strategic Methodology for Staged Artificial Intelligence Learning: From Architectural Thinking to Multilinguistic Understanding

Modern approaches to the development and training of artificial intelligence systems increasingly turn to biological analogies, particularly to the process of human cognitive development. The presented research proposes a comprehensive methodology for multi-phase training of AI systems that mimics the sequential process of human learning – from basic architectural concepts to complex language competencies. This methodology is based on the principle that artificial intelligence should first form a structural understanding of information and basic thinking skills before immersing itself in specialized knowledge domains. Such an approach potentially contributes to the formation of deeper contextual understanding and more effective reasoning abilities in artificial intelligence systems.

Theoretical Foundations of Progressive Artificial Intelligence Learning

Human learning is characterized by the gradual accumulation of knowledge, where each new concept builds on the foundation of previous understandings. Similarly, the proposed methodology views AI as a developing system that must sequentially build competencies and integrate new knowledge into an already formed cognitive structure. A key aspect of this approach is not simply the mechanical accumulation of information, but the formation in the AI system of the ability to understand the relationships between different areas of knowledge, which is critical for the development of true reasoning abilities. Artificial intelligence systems trained according to the traditional model with a one-time presentation of the entire data array often demonstrate a superficial understanding of concepts and a limited ability to transfer knowledge between domains.

Unlike traditional methods, the proposed approach provides for the formation of an architectural understanding of the subject area as the primary stage of learning, creating a kind of "mental map" of knowledge, on which more specialized concepts are subsequently superimposed. This method allows the system to develop not only a functional understanding of individual components of information but also a deeper awareness of their structural relationships. The gradual increase in the complexity of the studied material, combined with the rethinking of previously acquired knowledge in a new context, contributes to the development of more flexible and adaptive intelligence.

It is important to note that this methodology not only imitates the external aspects of human learning but also seeks to reproduce the internal cognitive processes underlying understanding and reasoning. This includes the formation of abstract models, categorization of information, pattern recognition, and the development of metacognitive abilities – awareness of one's own cognitive process. In an ideal scenario, an AI system should not only consume information but also actively interact with it, forming its own "notes" and conclusions, which contributes to a deeper assimilation of knowledge.

Cognitive Architecture of Multi-Phase Learning

The proposed cognitive architecture for AI learning is based on five sequential phases, each with its specific function in the formation of the system's intellectual abilities. This architecture reflects the hierarchical nature of knowledge – from general structural principles to specific implementations and language competencies. Each phase involves passing through approximately three epochs of training, which ensures sufficient depth of material assimilation before moving to the next level of complexity.

The first phase – "Architecture" – focuses on forming basic ideas about the structural principles of information organization. At this stage, the system gets acquainted with architectural instructions, frameworks, and UML diagrams, which lays the foundation for understanding the organizational aspects of software systems. Training materials at this stage include specialized datasets on software architecture, documentation on architectural patterns, and system descriptions in PlantUML notation. This allows AI to form a primary understanding of abstract structures and relationships, which will become the basis for further learning.

The second phase – "Computer Science" – expands the system's understanding through the study of more specific aspects of computer science, including algorithms, data structures, and system design. At this stage, AI becomes familiar with the fundamental principles of software development, algorithmic approaches to problem-solving, and basic concepts of computational theory. An important aspect of this phase is the formation of an understanding of trade-offs between different approaches to problem-solving, which develops the system's ability for analytical thinking and evaluation of the effectiveness of various solutions.

The third and fourth phases are devoted to specific programming languages – C++ and Java, respectively. These phases involve immersion in the specifics of syntax, semantics, and implementation features of various programming concepts in specific languages. It is important to note that at these stages, AI does not simply study syntactic constructions but also develops an understanding of idiomatic approaches to programming in each language, stylistic features, and optimal practices.

The final fifth phase introduces the understanding of natural languages, which allows the system to integrate technical knowledge with language competencies. This phase is of particular importance for creating AI systems capable of effectively communicating with humans and translating between technical and everyday language. The inclusion of the Russian language in the training materials at this stage contributes to the formation of multilinguistic competencies, which is critical for the global application of AI systems.

Integration of Metadata and Contextual Information

A special role in the proposed methodology is played by the quality and diversity of metadata accompanying the training materials. Proper field mapping and extraction of meaningful metadata significantly increase the effectiveness of training, providing a deeper understanding of the context and structure of the information being studied. Key types of metadata include: identification of language type (programming or natural), abstract syntax tree (AST) of code, design patterns used in the code, and contextual information about the problems being solved.

The proposed approach provides not only for the superficial annotation of data but also for a deep analysis of their structure and relationships. For example, for program code, it is recommended to extract not just the full AST, but also simplified representations of the program structure – lists of fields, methods, declarations, which helps the AI system to form a structural understanding of the code. Integration of different representations of the same material – for example, source code together with its intermediate representations (IR in the case of LLVM or bytecode for JVM) – allows the system to form a deeper understanding of the internal mechanisms of program operation.

Contextual information, such as a description of the task solved by the code or the purpose of an architectural pattern, plays a critical role in forming a functional understanding of the material being studied. This approach allows AI not just to memorize syntactic constructions but to understand their applicability in specific scenarios, which contributes to the development of a deeper conceptual understanding of the subject area.

Multilinguistic Aspects in Artificial Intelligence Training

An important component of the proposed methodology is the integration of a multilinguistic approach, where the same concept is presented in several natural languages. This approach contributes to the formation of more flexible language associations in the AI system and a better understanding of universal conceptual connections that do not depend on specific linguistic expression. For experimental purposes, the use of two natural languages appears optimal, allowing the system to form cross-linguistic associations without overly complicating the learning process.

The practical implementation of the multilinguistic approach may include presenting code comments in several languages, describing architectural patterns using terminology from different linguistic traditions, and including texts in different languages describing the same concepts in the training dataset. This approach is especially valuable for creating models with multilinguistic abilities that can effectively serve users speaking different languages and perform quality translation of technical documentation between languages.

The multilinguistic approach also contributes to the development of a more abstract understanding of concepts in the AI system, not tied to specific language constructions. This is especially important for technical fields, where there are significant differences in terminology between different linguistic traditions, which can create barriers to international knowledge exchange. An AI system trained on materials in different languages is potentially capable of overcoming these barriers, providing more effective communication between specialists from different countries.

Tokenization Optimization for Technical Domains

The process of tokenization – breaking text into minimal semantic units for processing – plays a critical role in the formation of AI language models. In the context of training for technical domains, a specific approach to tokenization is proposed, prioritizing keywords from programming languages and specialized notations, such as PlantUML. This allows the system to more effectively recognize and process program constructions, which is critical for the correct understanding of code.

The proposed tokenization strategy reflects the hierarchy of priorities in technical training: first technical vocabulary and syntactic constructions, then elements of natural language. This approach provides more accurate modeling of technical domains and creates a foundation for the subsequent integration of technical knowledge with natural language contexts. Technical terms and constructions of programming languages should be considered as integral lexical units, not broken down into smaller components, which helps the system maintain their semantic integrity.

Special attention in the tokenization strategy is paid to the processing of multilingual texts, where it is necessary to take into account differences in grammatical structures and lexical composition of different languages. The optimal approach appears to be the use of subword tokenizers capable of efficiently processing texts in different languages without excessively increasing the vocabulary. It is important to ensure a balanced representation of technical terminology in different languages to avoid bias towards the dominant language.

Dataset Structure and Field Mapping Strategies

The proposed methodology includes a detailed description of the structure of training datasets for each phase of training and field mapping strategies for the effective integration of different types of data. Each dataset should contain not only the main content (code, text, diagrams) but also a rich set of metadata providing context and structural understanding. Proper organization and annotation of training data play a critical role in forming a holistic understanding of the studied area in the AI system.

For the architectural training phase, it is recommended to use datasets containing architectural instructions, framework descriptions, and UML diagrams with detailed annotations explaining their structure and purpose. Field mapping for this phase should provide a clear separation between descriptive text, structural elements, and metadata. The inclusion of various types of architectural representations – from high-level descriptions to detailed diagrams – contributes to the formation of a multi-level understanding of architectural principles.

For phases related to programming, it is recommended to use datasets containing quality code examples with detailed comments and descriptions of the tasks being solved. Field mapping for these phases should ensure the integration of source code with its structural representations (AST), intermediate forms (IR, bytecode), and contextual information. This allows the AI system to form a multidimensional understanding of program code, including both syntactic and semantic aspects.

Special attention is paid to the language learning phase, where datasets should include diverse texts in several languages, covering both general topics and specialized technical areas. Field mapping for this phase should ensure correct identification of the text language and its thematic orientation. The inclusion of parallel texts in different languages describing the same concepts contributes to the formation of cross-linguistic associations and a deeper conceptual understanding.

Cyclical Learning and Competency Expansion

The proposed methodology provides for the possibility of cyclical repetition of the learning process, where after passing through all five phases, the system can start a new learning cycle on more complex material or with an emphasis on certain aspects. This approach allows gradually deepening and expanding the knowledge of the AI system, based on the already formed conceptual base. Cyclical learning imitates the natural process of human cognition, where periodic return to basic concepts with a new level of understanding contributes to the formation of deeper and more integrated knowledge.

In the process of subsequent learning cycles, the system begins to develop the ability for qualitative assessment of the studied material – to distinguish "well-designed" and "poorly designed" code, identify effective and ineffective architectural solutions, recognize patterns and anti-patterns of design. This ability for critical analysis represents the highest level of cognitive competencies, indicating a deep understanding of the subject area and the principles underlying it.

An important aspect of cyclical learning is the active interaction of the system with the studied material, when AI not only passively consumes information but also forms its own conclusions, generalizations, and "notes." This approach contributes to the development of metacognitive abilities – awareness of one's own process of cognition and the ability to direct learning independently. In an ideal scenario, the system should not only assimilate the proposed information but also formulate questions, identify gaps in knowledge, and seek ways to eliminate them.

Expanding the Scope of Methodology Application

Although the presented methodology is initially focused on training AI in the field of programming with a focus on C++ and Java, its principles are universal and can be adapted for other programming languages and technical domains. The key value of the approach lies in the sequential transition from architectural understanding to specific implementations, which is applicable to any programming language and technical discipline. The flexibility of the methodology provides the possibility of creating specialized AI assistants for various technology stacks while maintaining a general approach to the formation of fundamental understanding.

Potential areas for expanding the methodology include: training AI for the analysis and creation of software in other programming languages (Python, Go, Rust, etc.), adaptation for specific domains (machine learning, cybersecurity, game development), integration with other natural languages besides Russian and English. The methodology can also be expanded to include other data modalities, such as images (diagrams, graphs, visual representations of code) and audio (explanations of concepts, lectures, discussions of design solutions), which contributes to the formation of a more versatile understanding of the subject area.

Of particular interest is the adaptation of the methodology for training specialized AI systems in the field of machine learning and artificial intelligence – a kind of "meta-learning," where the system studies the principles of its own operation. Such a recursive approach can potentially contribute to the development of a deeper self-understanding in AI systems and the formation of more advanced cognitive abilities.

Metrics for Evaluating Learning Effectiveness

For an objective assessment of the effectiveness of the proposed methodology, it is necessary to develop a comprehensive system of metrics covering various aspects of the intellectual abilities of the AI system. Traditional metrics based on the accuracy of next token prediction or model perplexity do not fully reflect the depth of understanding and reasoning ability, which are key goals of the proposed approach. It is necessary to develop more complex metrics evaluating the system's ability for structural analysis, abstract thinking, knowledge transfer between domains, and metacognitive processes.

Potential metrics for effectiveness evaluation include: the system's ability to decompose complex tasks into simpler components, the ability to identify and apply appropriate design patterns in new contexts, the ability to explain the principles of code operation in natural language, the ability to identify and correct errors in code, the ability to generate code based on high-level specifications. Of particular value are metrics evaluating the system's ability to learn in the process of interaction with the user – how effectively the system integrates new information and adjusts its models of understanding.

A comprehensive assessment of learning effectiveness should also take into account long-term aspects – how stable the acquired knowledge is, to what extent the system is capable of integrating it in new contexts, how its cognitive abilities develop over time. This requires conducting longitudinal studies tracking changes in the intellectual abilities of the system over a long period of time and multiple learning cycles.

Development Prospects and Potential Applications

The proposed methodology for staged artificial intelligence learning opens up broad prospects for the development of deeper and more contextually aware AI systems. Potential directions for further development include: integration with neurosymbolic approaches combining neural networks with symbolic reasoning systems; development of more complex architectures capable of multi-level information processing and metacognitive processes; creation of specialized educational programs for AI in various professional domains.

Practical applications of this methodology cover a wide range of areas: from the development of intelligent programming assistants capable of not only generating code but also explaining the principles of its operation and proposing architectural solutions, to the creation of educational systems adapting educational material to the individual needs of students. Of particular value is the possibility of creating AI systems capable of effectively communicating technical concepts to non-technical specialists, which can significantly increase the accessibility of technical knowledge and promote interdisciplinary collaboration.

In the long term, the development of AI systems with a deep understanding of architecture and programming can lead to a qualitative leap in the automation of software development – from automatic refactoring and optimization of existing code to the generation of full-fledged software systems based on high-level specifications. Such systems can become not just tools for automating routine tasks but full-fledged partners in the creative process of development, capable of proposing innovative solutions and participating in architectural discussions on an equal footing with humans.

Ethical Aspects and Social Implications

The development of AI systems with a deep understanding of programming and architecture raises a number of ethical questions and social implications that require careful consideration. One of the key issues is the potential impact of such systems on the labor market in the IT sphere – from changing qualification requirements for programmers to the possible reduction in the number of jobs in certain segments of the industry. A balanced analysis of the potential consequences of the widespread introduction of such systems and the development of strategies for adapting educational programs and professional trajectories to new realities is necessary.

Another important aspect is the question of control and responsibility for decisions made by AI systems in the process of software development. As systems become more autonomous in making architectural and design decisions, the question arises of who bears responsibility for potential errors or vulnerabilities in the generated code. It is necessary to develop mechanisms for auditing and validating AI decisions, ensuring transparency in the decision-making process and the possibility of human control over critical aspects of development.

Attention should also be paid to issues of accessibility and democratization of technologies – how to ensure that the advantages of advanced AI systems are available to a wide range of developers and organizations, not just large technology companies with extensive resources. This requires the development of open standards, accessible tools, and educational programs providing wide access to the benefits of AI-assisted development.

Conclusion

The proposed methodology for staged artificial intelligence learning represents a comprehensive approach that mimics the natural process of human learning – from forming a basic structural understanding to developing specialized competencies. Key aspects of this methodology include: sequential introduction of information from architecture to specific implementations, enrichment of data with meta-information, integration of different representations of the same material, a multilinguistic approach to learning, and cyclical repetition of the educational process with gradual deepening of understanding.

The application of this methodology can contribute to the creation of deeper and more contextually aware AI systems capable of not only effectively performing specific tasks but also understanding their structural foundations and relationships. Such systems can potentially become more reliable partners in solving complex problems requiring a deep understanding of the subject area and the ability for complex reasoning. In perspective, the development of such systems can lead to a qualitative leap in the automation of intellectual activity and the formation of new forms of collaboration between humans and machines.

It is important to note that the proposed methodology is not complete or exhaustive – it is rather an invitation to discussion and experiments in the field of more structured and contextually rich artificial intelligence learning. Further development of this approach requires interdisciplinary collaboration of specialists in the fields of artificial intelligence, cognitive science, education, and software engineering. Only such a comprehensive consideration of the problem will allow creating truly deep and understanding artificial intelligence systems capable of productive collaboration with humans in solving complex intellectual problems.

{
"phase_1_architecture":
    [
    {
        "path": "software-architecture-instructions.parquet",
        "field_mapping": {
            "text": "instructions",
            "lang": "static:en",
            "architecture": "instructions",
            "patterns": "none",
            "code": "none",
            "comments": "none",
            "metadata": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "Architectural-Frameworks_Final.jsonl",
        "field_mapping": {
            "text": "input",
            "lang": "static:en",
            "architecture": "output",
            "patterns": "output",
            "code": "none",
            "comments": "none",
            "metadata": "none",
            "ast": "none",
            "context": "instruction"
        }
    },
    {
        "path": "Software_Architecture_Final.jsonl",
        "field_mapping": {
            "text": "input",
            "lang": "static:en",
            "architecture": "output",
            "patterns": "output",
            "code": "none",
            "comments": "none",
            "metadata": "none",
            "ast": "none",
            "context": "instruction"
        }
    },
    {
    "path": "dimsavva_puml-class/conversational_dataset.jsonl",
        "field_mapping": {
            "code": "conversations.value",
            "lang": "static:plantuml",
            "text": "conversations",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "metadata": "none",
            "ast": "none",
            "context": "none"
        }
    }
    ],
"phase_2_computer_science": 
    [
    {
    "path": "nslaughter_system_design_prompts/data/train-00000-of-00001.parquet",
        "field_mapping": {
            "text": "input",
            "lang": "static:en",
            "code": "output",
            "metadata": "metadata",
            "architecture": "output",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "instruction"
        }
    },
    {
        "path": "system_design_prompts.parquet",
        "field_mapping": {
            "text": "input",
            "lang": "static:en",
            "code": "output",
            "metadata": "metadata",
            "architecture": "output",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "instruction"
        }
    },
    {
    "path": "computer_science_synthetic_dataset*.csv",
        "field_mapping": {
            "text": "input",
            "lang": "static:en",
            "code": "none",
            "metadata": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "output"
        }
    }
    ],
"phase_3_cpp_code": [
    {
        "path": "nguyentruong-ins_nhlcoding_cleaned_cpp_dataset/data/train-*.parquet",
        "field_mapping": {
            "code": "solution",
            "lang": "static:cpp",
            "metadata": "difficulty",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "nguyentruong-ins_nhlcoding_cleaned_cpp_dataset/data/valid-00000-of-00001.parquet",
        "field_mapping": {
            "code": "solution",
            "lang": "static:cpp",
            "metadata": "difficulty",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "nguyentruong-ins_nhlcoding_cleaned_cpp_dataset/data/test-00000-of-00001.parquet",
        "field_mapping": {
            "code": "solution",
            "lang": "static:cpp",
            "metadata": "difficulty",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    }
    ],
"phase_4_java_code": [
    {
        "path": "ammarnasr_the-stack-java-clean/data/train-*.parquet",
        "field_mapping": {
            "code": "content",
            "lang": "static:java",
            "metadata": "hexsha,size",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "ammarnasr_the-stack-java-clean/data/valid-00000-of-00001-3765edf410c1d15e.parquet",
        "field_mapping": {
            "code": "content",
            "lang": "static:java",
            "metadata": "hexsha,size",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "ammarnasr_the-stack-java-clean/bigcode-the-stack-dedup-train.jsonl",
        "field_mapping": {
            "code": "content",
            "lang": "static:java",
            "metadata": "hexsha,size",
            "text": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    }
    ],
"phase_5_languages": [
    {
        "path": "Den4ikAI_russian_dialogues/dataset.jsonl",
        "field_mapping": {
            "text": "question,answer",
            "lang": "static:ru",
            "code": "none",
            "metadata": "relevance",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "Egor-AI_Russian-Words/russian-mnemonic-words.txt",
        "field_mapping": {
            "text": "file_content",
            "lang": "static:ru",
            "code": "none",
            "metadata": "none",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    },
    {
        "path": "Egor-AI_Dataset_of_Russian_thinking/RTD.json",
            "field_mapping": {
            "text": "prompt,response",
            "lang": "static:ru",
            "code": "none",
            "metadata": "system",
            "architecture": "none",
            "patterns": "none",
            "comments": "none",
            "ast": "none",
            "context": "none"
        }
    }
    ]
}

my repo https://github.com/lexasub/codestar

dfsafdsf
/

deepseek-r1-distill-qwen-1.5b-java