Training of useful small language models.

How full line complete works in PyCharm and other JetBrains IDEs

Paper explains how a tiny (100M parameters) code completion model was trained. I’m fascinated by SLMs (small language models), and it’s always a joy to find when they become more than just toys but useful in the products. You don’t need any math or deep learning knowledge to read this paper, it explains the setup and trainig data in plain english most of it will be accessible for an engineer.

Another good reading on this topic is the Falcon-H1-Tiny blog post from TII reasearchers. This one requeries general intuition how LLMs are trained. What I like the most is because the model is relatively cheap to train, team was able to evaluate mulitple strategies and explains it in great details. most interesting: they were able to train instruct model in one go without separate SFT phase. They also experimented with a Pyhton specific model which you can use in VS code.
You can get an idea how fluent 90M general purpose model might be:

~ ollama run hf.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF:Q8_0 "What is considered pythonic?"

Pythonic is a programming paradigm that emphasizes simplicity, readability, and ease of maintenance. It is often associated 
with writing code that is easier to understand and maintain. Here are some key characteristics of Pythonic:

### 1. **Simple Syntax**
   - **Avoids unnecessary complexity**: Pythonic code is minimal and focused on functionalities. Avoid unnecessary 
constructs like `print`, `if`, `else`, or `def`.
   - **Uses built-in functions**: Instead of writing complex code, use standard Python functions where possible.

### 2. **Minimal Code**
   - **Reduces boilerplate**: Minimizes the amount of code by eliminating unnecessary steps.
   - **Modular design**: Each function or block is self-contained and can be called independently.

### 3. **Readability**
   - **Clarity and simplicity**: Pythonic code is easier to read and understand, making it more maintainable.
   - **Consistency**: Ensures code is organized and follows a logical flow.

### 4. **Testability**
   - **Easily testable**: Pythonic code is easier to test because it follows best practices for unit testing.
   - **Integration tests**: Easier to write and run tests for complex systems.

### 5. **Performance**
   - **Optimized for speed**: Pythonic code is often more efficient because it avoids unnecessary computations.
   - **Concurrency**: Can be more predictable and easier to manage in multi-threaded environments.

### 6. **Documentation**
   - **Clear comments**: Provide meaningful documentation for functions and modules.
   - **Docstrings**: Help other developers understand the purpose of functions.

### 7. **Reusability**
   - **Modular reusable components**: Components can be reused across different parts of the program.
   - **Lambda or decorators**: Allow for creating generic code that works with various data types.

### 8. **Community and Ecosystem**
   - **Large and active community**: A robust ecosystem with extensive resources and plugins.
   - **Documentation and tutorials**: Provides tools and examples for beginners.

### Example of Pythonic Code:
def greet(name: str) -> str:
    return f"Hello, {name}!"

print(greet("Alice"))  # Output: Hello, Alice!

### Why Pythonic Matters:
- **Smaller, easier to debug**: Pythonic code is often more readable and easier to understand.
- **Faster development**: Easier to run in production environments.
- **Better maintainability**: Minimizes the risk of introducing new features or errors.

In summary, Pythonic is about writing code that is simple, efficient, and easy to maintain, making it a preferred choice 
for many projects.

Don’t get too excited, this model is 30 times smaller than Apple’s local model for Writing Tools. Which is 200+ times smaller than frontier models.