Pacific-Prime commited on
Commit
06f75ea
·
verified ·
1 Parent(s): 6974223

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. .gitignore +1 -5
  2. README.md +41 -348
  3. model.safetensors +2 -2
  4. tokenizer.json +1 -6
  5. tokenizer_config.json +1 -1
.gitignore CHANGED
@@ -1,5 +1 @@
1
- checkpoints/*
2
- test_cot_now.py
3
- __pycache__/
4
- evaluate_cot_model.py
5
- cot_extended_training/*
 
1
+ inl_llm/*
 
 
 
 
README.md CHANGED
@@ -1,376 +1,69 @@
1
- ---
2
- language: en
3
- license: cc-by-nc-4.0
4
- tags:
5
- - text-generation
6
- - integrator-neuron
7
- - custom-architecture
8
- pipeline_tag: text-generation
9
- ---
10
 
11
- # INL Architecture - Integrator Neuron Layer
12
-
13
- **Production-ready neural architecture** using **Integrator Neuron dynamics** - replaces traditional FFN layers with iterative dynamics. **Universal architecture** that works for any type of model: LLMs, vision transformers, multimodal, diffusion, RL policies, etc.
14
-
15
- ### Architecture Features
16
-
17
- - **Universal** - Build LLMs, vision models, audio, multimodal, diffusion, RL agents with same architecture
18
- - **HuggingFace ready** - Drop-in replacement for FFN in any transformer
19
- - **KV caching** - Full support for efficient autoregressive generation
20
- - **Adaptive compute** - Auto-stops when converged (30-50% faster)
21
- - **Parameter efficient** - Shared controllers = 96% fewer params than FFN
22
- - **Bio-inspired** - Based on integrator neurons from neuroscience
23
- - **Configurable** - Tune iterations, controllers, equilibrium for your task
24
-
25
- ### This Checkpoint
26
-
27
- **Example implementation**: 1.1B parameter **language model** with INL architecture.
28
- - 25 layers × 5 iterations/layer = rich iterative computation
29
- - But the **architecture scales** from 100M to 100B+ params
30
- - And works for **any domain** (language, vision, audio, etc.)
31
-
32
- ## What is INL?
33
-
34
- **Traditional transformers** use static feedforward layers:
35
- ```python
36
- x_out = x + FFN(x) # One-shot computation
37
- ```
38
-
39
- **INL-LLM** uses iterative integrator dynamics to find equilibrium:
40
- ```python
41
- # Each of the 25 layers performs 5 iterations (configurable)
42
- # Total: 25 layers × 5 iterations = 125 computation steps
43
- for iteration in range(num_iterations_per_layer): # = 5
44
- error = x - mu # Distance from learned equilibrium
45
- v_next = alpha * v + (1 - alpha) * v_target - beta * error
46
- x_next = x + dt * gate * v_next
47
- ```
48
-
49
- **Result**: The model "thinks" iteratively like biological integrator neurons, achieving better parameter efficiency through shared dynamics and adaptive early stopping.
50
 
51
  ## Model Details
52
 
53
- | Parameter | Value |
54
- |-----------|-------|
55
- | Parameters | 1.1B |
56
- | d_model | 1728 |
57
- | Layers | 25 |
58
- | Attention heads | 32 |
59
- | Iterations/layer | 5 (configurable: more = better quality but slower) |
60
- | Context length | 2048 |
61
- | Vocabulary | 50,261 |
62
-
63
- ### Key Optimizations
64
 
65
- - **Shared controllers**: One controller shared across all 25 layers (96% fewer parameters)
66
- - **Low-rank embeddings**: 87% fewer embedding parameters
67
- - **Adaptive stopping**: Stops when converged (30-50% faster inference)
68
- - **Sparse excitation**: 90% sparsity for efficiency
69
-
70
- ## Usage
71
 
72
  ```python
73
- from transformers import AutoModelForCausalLM, AutoTokenizer
74
-
75
- model = AutoModelForCausalLM.from_pretrained(
76
- "Pacific-Prime/pacific-prime",
77
- trust_remote_code=True,
78
- torch_dtype="bfloat16"
79
- )
80
- tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/pacific-prime")
81
-
82
- # Generate with KV caching (default, much faster!)
83
- prompt = "The future of AI is"
84
- inputs = tokenizer(prompt, return_tensors="pt")
85
- outputs = model.generate(
86
- **inputs,
87
- max_new_tokens=100,
88
- temperature=0.8,
89
- use_cache=True # Enable KV cache (default)
90
- )
91
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
92
- ```
93
 
94
- ### Chat Format
 
95
 
96
- ```python
97
- messages = [
98
- {"role": "user", "content": "What is machine learning?"}
99
- ]
100
-
101
- chat_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
102
- inputs = tokenizer(chat_text, return_tensors="pt")
103
  outputs = model.generate(**inputs, max_new_tokens=100)
 
104
  ```
105
 
106
- Special tokens: `<USER>`, `<ASSISTANT>`, `<SYSTEM>`, `<ERROR>`
107
-
108
- ## vLLM Serving
109
 
110
  ```bash
 
 
 
 
111
  python -m vllm.entrypoints.openai.api_server \
112
- --model Pacific-Prime/pacific-prime \
113
  --trust-remote-code \
114
- --dtype bfloat16
115
- ```
116
-
117
- ## Why Integrator Neurons?
118
-
119
- **Main benefit**: Achieve similar quality with fewer parameters through parameter sharing and iterative refinement.
120
-
121
- - **Parameter efficiency**: One shared controller for all 25 layers (instead of 25 separate FFNs)
122
- - **Adaptive computation**: Stops iterating early when converged (faster inference)
123
- - **Iterative refinement**: Each layer "thinks" multiple times instead of one-shot computation
124
- - **Interpretable**: Can visualize how the model converges to solutions
125
- - **Bio-inspired**: Mimics integrator neurons found in neuroscience
126
-
127
- ## Learn More
128
-
129
- For detailed technical documentation about the INL architecture:
130
- - **GitHub Repository**: [ARKITEKTURE_TRANSFORMER_ADL](https://github.com/pacific-prime777/ARKITEKTURE_TRANSFORMER_ADL)
131
- - **Architecture Docs**: See the repo for implementation details, training code, and benchmarks
132
-
133
- ## Convergence Theorem
134
-
135
- ### Mathematical Formulation
136
-
137
- The INL architecture implements a discrete-time dynamical system that converges to a learned equilibrium point. For each layer:
138
- ```python
139
- error = x - mu # (1)
140
- v_next = alpha * v + (1 - alpha) * v_target - beta * error # (2)
141
- x_next = x + dt * gate * v_next # (3)
142
- ```
143
-
144
- **Theorem (Asymptotic Convergence):**
145
-
146
- Given the discrete dynamics above, if the following stability conditions hold:
147
-
148
- 1. **Damping condition**: `0 < alpha < 1`
149
- 2. **Restoring force**: `beta > 0`
150
- 3. **Time step bound**: `dt < 2/(beta * sqrt(1 - alpha²))`
151
- 4. **Gating**: `0 ≤ gate ≤ 1`
152
-
153
- Then for any initial state `(x₀, v₀)`, the system converges asymptotically to the equilibrium:
154
- ```
155
- lim(n→∞) x_n = mu
156
- lim(n→∞) v_n = v_target
157
- ```
158
-
159
- **Formally**: `∀ε > 0, ∃N ∈ ℕ : ∀n > N ⟹ ||x_n - mu|| < ε`
160
-
161
- ### Proof Sketch
162
-
163
- The system behaves as a **damped harmonic oscillator** in the embedding space:
164
-
165
- 1. **Energy function**: Define `E(n) = ½||x_n - mu||² + ½||v_n - v_target||²`
166
-
167
- 2. **Energy decay**: Under stability conditions, `E(n+1) < E(n)` for all `n`
168
-
169
- 3. **Lower bound**: `E(n) ≥ 0` always
170
-
171
- 4. **Conclusion**: By monotone convergence theorem, `E(n) → 0`, thus `x_n → mu`
172
-
173
- The proof follows from discrete Lyapunov stability analysis. The parameters `alpha` (damping), `beta` (restoring force), and `dt` (discretization step) control the convergence rate and oscillation behavior.
174
-
175
- ### Convergence Modes
176
-
177
- | Regime | Condition | Behavior |
178
- |--------|-----------|----------|
179
- | **Underdamped** | `alpha² < 4*beta*dt` | Oscillates then converges |
180
- | **Critically damped** | `alpha² = 4*beta*dt` | Fastest convergence (no overshoot) |
181
- | **Overdamped** | `alpha² > 4*beta*dt` | Slow monotonic convergence |
182
-
183
- ### Practical Implications
184
-
185
- **Hybrid Discrete-Continuous Approximation:**
186
- ```
187
- Discrete (finite iterations) ←→ Continuous (infinite time)
188
- ↓ ↓
189
- GPU-friendly Theoretical limit
190
  ```
191
 
192
- - **5 iterations**: Fast, 70-80% convergence quality
193
- - **10 iterations**: Balanced, 85-95% convergence
194
- - **50+ iterations**: Near-perfect, 98%+ convergence
195
- - **∞ iterations**: Theoretical guarantee (impractical)
196
-
197
- **Adaptive Early Stopping:**
198
 
199
- The architecture monitors `||error||` and stops when:
200
  ```python
201
- if ||x_n - mu|| < tolerance: # Converged!
202
- break # Save 30-50% compute
203
- ```
204
-
205
- This makes the system both **theoretically grounded** (convergence guarantee) and **practically efficient** (adaptive compute).
206
-
207
- ### Connection to Neural ODEs
208
-
209
- In the continuous limit (`dt → 0`), the dynamics become:
210
- ```
211
- dx/dt = gate * v
212
- dv/dt = -(1-alpha)/dt * v + (1-alpha)/dt * v_target - beta * (x - mu)
213
- ```
214
-
215
- This is a **second-order ODE** with learned equilibrium `mu`, combining:
216
- - **Physics-inspired** dynamics (momentum, damping, restoring force)
217
- - **Learned** target state (mu, v_target from neural network)
218
-
219
- ### Why This Matters
220
-
221
- 1. **Theoretical guarantees**: Not just empirical - proven convergence
222
- 2. **Interpretability**: Physics-based dynamics are explainable
223
- 3. **Robustness**: Stable across wide parameter ranges
224
- 4. **Efficiency**: Can trade iterations for quality (5 for speed, 50 for precision)
225
- 5. **Universal**: Same convergence theory applies to all domains (text, vision, audio)
226
-
227
- ---
228
-
229
- ## Empirical Stability Analysis
230
-
231
- ### Stability Region Characterization
232
-
233
- We performed extensive empirical analysis to validate the theoretical convergence guarantees and characterize the practical stability region. The analysis explores the parameter space of `alpha` (damping) and `p = dt * g * beta` (effective time step × restoring force).
234
-
235
- **Key Finding**: The system exhibits three distinct behavioral regimes:
236
-
237
- 1. **STABLE** (ρ < 1): Green region - guaranteed convergence
238
- 2. **NEAR-BOUNDARY** (ρ ≈ 1): Yellow region - convergence but slower
239
- 3. **UNSTABLE** (ρ > 1): Red region - divergence
240
-
241
- ![Stability Contour](stability_contour.png)
242
-
243
- The empirical stability boundary closely matches the theoretical sufficient condition:
244
- ```
245
- Stable if: 0 ≤ alpha < 1 AND 0 < p < 2(1 + alpha)
246
- ```
247
-
248
- ### Eigenvalue Analysis
249
-
250
- The spectral radius (maximum eigenvalue magnitude) determines system stability. For convergence, we need `ρ(J) < 1` where `J` is the Jacobian of the discrete dynamics.
251
-
252
- ![Eigenvalue Examples](eigenvalue_examples.png)
253
-
254
- **Representative parameter sets:**
255
- - **Safe** (α=0.1, p=0.4): ρ ≈ 0.5 - Fast, stable convergence
256
- - **Near-bound** (α=0.3, p=1.6): ρ ≈ 0.57 - Stable but approaching boundary
257
- - **Unstable** (α=0.5, p=2.5): ρ ≈ 0.7 - Exceeds stability bound, diverges
258
- - **Damped** (α=0.7, p=0.2): ρ ≈ 0.83 - High damping, slow convergence
259
- - **High-alpha** (α=0.9, p=1.0): ρ ≈ 0.95 - Near-critical, very slow
260
-
261
- ![Spectral Radius Heatmap](spectral_radius_heatmap.png)
262
-
263
- The heatmap reveals the complete stability landscape in (α, p) space. Dark blue regions (ρ < 0.5) converge rapidly, while yellow/green regions (ρ > 1.0) are unstable.
264
-
265
- ### Convergence Dynamics
266
-
267
- Energy trajectories `E(n) = ½||x_n - mu||² + ½||v_n - v_target||²` demonstrate convergence behavior:
268
-
269
- ![Energy Trajectories](energy_trajectories.png)
270
 
271
- **Observations:**
272
- - **Damped** (red, α=0.2): Fastest initial decay, oscillatory but converges
273
- - **Safe/Near-bound** (blue/orange): Smooth exponential decay to equilibrium
274
- - **Unstable** (green, α=0.8, p=2.5): Energy fails to decay, remains elevated
275
- - **High-alpha** (purple, α=0.9): Slowest convergence due to high damping
276
 
277
- ### Practical Parameter Selection
278
-
279
- Based on empirical analysis, recommended parameter ranges for INL layers:
280
-
281
- | Use Case | α (damping) | p (dt×g×β) | Behavior | Iterations Needed |
282
- |----------|-------------|------------|----------|-------------------|
283
- | **Fast inference** | 0.1 - 0.3 | 0.3 - 1.0 | Quick convergence | 5-10 |
284
- | **Balanced** | 0.3 - 0.6 | 0.5 - 1.5 | Stable, moderate speed | 10-20 |
285
- | **High precision** | 0.4 - 0.7 | 0.4 - 1.2 | Slow but accurate | 20-50 |
286
- | **Avoid** | > 0.8 | > 2.0 | Too slow or unstable | N/A |
287
-
288
- **Safety margin**: Stay well within the theoretical bound `p < 2(1+α)`. Practical recommendation: `p < 1.5(1+α)` for reliable convergence with finite iterations.
289
-
290
- ### Connection to Model Architecture
291
-
292
- The **Pacific Prime 1.1B** model uses:
293
- - `alpha` ≈ 0.4-0.6 (moderate damping)
294
- - `p` ≈ 0.8-1.2 (safe region)
295
- - 5 iterations/layer (sufficient for 85-95% convergence)
296
-
297
- These parameters balance:
298
- - **Convergence quality**: 90%+ of theoretical equilibrium
299
- - **Inference speed**: ~30-50% faster than full convergence
300
- - **Stability**: Robust across diverse inputs and training stages
301
-
302
- ### Theoretical vs. Empirical
303
-
304
- | Aspect | Theoretical | Empirical |
305
- |--------|-------------|-----------|
306
- | **Condition** | `p < 2(1+α)` | `p < 1.8(1+α)` (practical) |
307
- | **Convergence** | Asymptotic (n→∞) | 85-95% in 5-10 iterations |
308
- | **Guarantee** | Mathematical proof | Statistical validation |
309
- | **Application** | Infinite time | Finite GPU budget |
310
-
311
- The empirical analysis validates the theory while providing practical guidance for finite-iteration deployment. The stability region is robust: small parameter perturbations during training don't cause instability.
312
-
313
- ### Validation Methodology
314
-
315
- **Data**: Sampled 11 α values × 100 p values (1,100 parameter combinations)
316
-
317
- **Metrics**:
318
- - Spectral radius computation via eigenvalue analysis
319
- - Energy trajectory simulation (300 iterations)
320
- - Convergence rate measurement
321
-
322
- **Tools**: NumPy, SciPy, Matplotlib for numerical analysis
323
-
324
- For full analysis code, see: [stability_analysis.ipynb](link-to-notebook)
325
-
326
-
327
-
328
- ## Optimizations
329
-
330
- ### KV Caching
331
-
332
- Full KV caching support for fast autoregressive generation.
333
-
334
- ```python
335
- # Automatic caching with .generate()
336
- outputs = model.generate(
337
- **inputs,
338
- max_new_tokens=100,
339
- use_cache=True # Enable KV caching (default)
340
  )
341
 
342
- # Manual caching for custom generation loops
343
- past_key_values = None
344
- for _ in range(max_tokens):
345
- outputs = model(input_ids, past_key_values=past_key_values, use_cache=True)
346
- past_key_values = outputs.past_key_values
347
- # ... get next token ...
348
  ```
349
 
350
- **Benefits**:
351
- - **1.1-1.3× faster** generation for long sequences (100+ tokens)
352
- - Compatible with HuggingFace `.generate()` and vLLM
353
- - Beam search supported via `_reorder_cache()`
354
- - Minimal memory overhead (<1%)
355
-
356
- **How it works**: Unlike standard transformers that cache K, V for attention, INL-LLM only needs to cache attention states. Integrator dynamics (x, v) are computed fresh for each token since they operate within each layer, not across tokens.
357
-
358
- **Performance Note**: The speedup is more modest than standard transformers (which get 10-20× gains) because **INL architecture is dominated by integrator iterations, not attention**. Most compute (70-90%) goes to iterative dynamics (3-10 iterations per layer × 12-25 layers), while attention is only ~10-30% of FLOPs. The cache optimizes that 10-30%, giving ~1.1-1.3× overall speedup. This is an architectural tradeoff - you get richer dynamics at the cost of less cache benefit.
359
-
360
- ## Technical Requirements
361
 
362
- - Requires `trust_remote_code=True` (custom INL architecture)
363
- - Python 3.8+, PyTorch 2.0+, transformers 4.35+
364
-
365
- ## Citation
366
-
367
- ```bibtex
368
- @misc{inl-llm-2024,
369
- author = {Boris Peyriguère},
370
- title = {INL-LLM: Integrator Neural Language Model},
371
- year = {2024},
372
- url = {https://huggingface.co/Pacific-Prime/pacific-prime}
373
- }
374
- ```
375
 
376
- **License**: CC BY-NC 4.0 (Non-Commercial - Contact author for commercial use)
 
1
+ # INL-LLM HuggingFace Format
 
 
 
 
 
 
 
 
2
 
3
+ This is a HuggingFace-compatible version of the INL-LLM model (1.1B parameters).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Model Details
6
 
7
+ - **Architecture**: inl-llm
8
+ - **Parameters**: ~1.1B effective parameters
9
+ - **d_model**: 1728
10
+ - **Layers**: 25
11
+ - **Heads**: 32
12
+ - **Vocab Size**: 50261
 
 
 
 
 
13
 
14
+ ## Usage with HuggingFace
 
 
 
 
 
15
 
16
  ```python
17
+ from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ model = AutoModelForCausalLM.from_pretrained("/home/boris/vAgent/architecture/checkpoints/inl_11b_hf", trust_remote_code=True)
20
+ tokenizer = AutoTokenizer.from_pretrained("/home/boris/vAgent/architecture/checkpoints/inl_11b_hf")
21
 
22
+ # Generate
23
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
 
 
 
 
 
24
  outputs = model.generate(**inputs, max_new_tokens=100)
25
+ print(tokenizer.decode(outputs[0]))
26
  ```
27
 
28
+ ## Usage with vLLM
 
 
29
 
30
  ```bash
31
+ # Install vLLM
32
+ pip install vllm
33
+
34
+ # Serve with vLLM
35
  python -m vllm.entrypoints.openai.api_server \
36
+ --model /home/boris/vAgent/architecture/checkpoints/inl_11b_hf \
37
  --trust-remote-code \
38
+ --dtype bfloat16 \
39
+ --max-model-len 2048
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```
41
 
42
+ Then use OpenAI-compatible API:
 
 
 
 
 
43
 
 
44
  ```python
45
+ from openai import OpenAI
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
 
 
 
 
48
 
49
+ response = client.chat.completions.create(
50
+ model="inl_11b_hf",
51
+ messages=[
52
+ {"role": "user", "content": "What is machine learning?"}
53
+ ],
54
+ temperature=0.8,
55
+ max_tokens=100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  )
57
 
58
+ print(response.choices[0].message.content)
 
 
 
 
 
59
  ```
60
 
61
+ ## Optimizations Enabled
 
 
 
 
 
 
 
 
 
 
62
 
63
+ - Low-rank embeddings: True
64
+ - Shared controllers: True
65
+ - Hierarchical equilibrium: group_size=64
66
+ - Sparse excitation: 10.0% sparsity
67
+ - Adaptive stopping: True
 
 
 
 
 
 
 
 
68
 
69
+ Converted from: /home/boris/vAgent/architecture/checkpoints/inl_1b_model
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:12164b4034153ec2f3eb299953d1da3070cc4e4748e8eaad124d82830df2ecb9
3
- size 4140509476
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eef434843bd7d70147372e012ec7f8fc1164dd5517991013780d10aec6dd9aae
3
+ size 4442160156
tokenizer.json CHANGED
@@ -1,11 +1,6 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 65,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
  "padding": null,
10
  "added_tokens": [
11
  {
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
 
 
 
 
 
4
  "padding": null,
5
  "added_tokens": [
6
  {
tokenizer_config.json CHANGED
@@ -52,7 +52,7 @@
52
  "clean_up_tokenization_spaces": false,
53
  "eos_token": "<|endoftext|>",
54
  "extra_special_tokens": {},
55
- "model_max_length": 1024,
56
  "pad_token": "<|endoftext|>",
57
  "tokenizer_class": "GPT2Tokenizer",
58
  "unk_token": "<|endoftext|>"
 
52
  "clean_up_tokenization_spaces": false,
53
  "eos_token": "<|endoftext|>",
54
  "extra_special_tokens": {},
55
+ "model_max_length": 2048,
56
  "pad_token": "<|endoftext|>",
57
  "tokenizer_class": "GPT2Tokenizer",
58
  "unk_token": "<|endoftext|>"