| | --- |
| | license: mit |
| | language: |
| | - en |
| | - zh |
| | base_model: |
| | - Qwen/Qwen3-VL-2B-Instruct |
| | - Qwen/Qwen3-VL-4B-Instruct |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | tags: |
| | - Qwen3-VL |
| | - Qwen3-VL-2B-Instruct |
| | - Qwen3-VL-4B-Instruct |
| | - Int4 |
| | - VLM |
| | - GPTQ |
| | --- |
| | |
| | # Qwen3-VL |
| |
|
| | This version of Qwen3-VL-2B-Instruct has been converted to run on the Axera NPU using **w4a16** quantization. |
| |
|
| | Compatible with Pulsar2 version: 5.0 |
| |
|
| | ## Convert tools links: |
| |
|
| | For those who are interested in model conversion, you can try to export axmodel through the original repo : |
| |
|
| | - https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| | - https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct |
| |
|
| | [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) |
| |
|
| | [AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen3-VL.AXERA) |
| |
|
| |
|
| | ## Support Platform |
| |
|
| | - AX630C |
| | - AX630C DEMO Board |
| |
|
| | **Image Process** |
| | |Chips| input size | image num | image encoder | ttft(168 tokens) | w4a16 | CMM | Flash | |
| | |--|--|--|--|--|--|--|--| |
| | |AX630C| 384*384 | 1 | ms | ms | tokens/sec| 2.0 GB | 2.7 GiB | |
| | |
| | **Video Process** |
| | |Chips| input size | image num | image encoder |ttft(600 tokens) | w4a16 | CMM | Flash | |
| | |--|--|--|--|--|--|--|--| |
| | |AX630C| 384*384 | 8 | ms | ms | tokens/sec| 2.0 GB | 2.7 GB | |
| |
|
| | The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. |
| |
|
| | ## How to use |
| |
|
| | Download all files from this repository to the device |
| |
|
| | **If you using AX630C Board** |
| |
|
| | ### Prepare tokenizer server |
| |
|
| | #### Install transformer |
| |
|
| | ``` |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Demo Run |
| |
|
| | #### Image understand demo |
| |
|
| | ##### start tokenizer server for image understand demo |
| |
|
| | ``` |
| | python3 qwen3_tokenizer.py --port 8080 |
| | ``` |
| |
|
| | ##### run image understand demo |
| |
|
| | - input text |
| |
|
| | ``` |
| | 描述这张图片 |
| | ``` |
| |
|
| | - input image |
| |
|
| |  |
| |
|
| | ``` |
| | root@AX630C ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_image_AX630C.sh |
| | [I][ Init][ 156]: LLM init start |
| | [I][ Init][ 158]: Total CMM:4353 MB |
| | [I][ Init][ 34]: connect http://127.0.0.1:8080 ok |
| | bos_id: -1, eos_id: 151645 |
| | img_start_token: 151652 |
| | img_context_token: 151655 |
| | 3% | ██ | 1 / 31 [0.01s<0.46s, 66.67 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap |
| | 6% | ███ | 2 / 31 [0.02s<0.34s, 90.91 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28 |
| | 103% | ██████████████████████████████████ | 32 / 31 [34.03s<32.96s, 0.94 count/s] init vpm axmodel ok,remain_cmm(854 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652 |
| | [I][ Init][ 309]: image encoder output float32 |
| | |
| | [I][ Init][ 339]: max_token_len : 2047 |
| | [I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 |
| | [I][ Init][ 352]: prefill_token_num : 128 |
| | [I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 |
| | [I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 |
| | [I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 |
| | [I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 |
| | [I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 |
| | [I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 |
| | [I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 |
| | [I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 |
| | [I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 |
| | [I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 |
| | [I][ Init][ 360]: prefill_max_token_num : 1152 |
| | [I][ Init][ 372]: LLM init ok |
| | [I][ Init][ 374]: Left CMM:854 MB |
| | Type "q" to exit, Ctrl+c to stop current running |
| | prompt >> 描述这张图片 |
| | image >> images/recoAll_attractions_1.jpg |
| | [I][ EncodeImage][ 440]: pixel_values size 1 |
| | [I][ EncodeImage][ 441]: grid_h 24 grid_w 24 |
| | [I][ EncodeImage][ 489]: image encode time : 237.778000 ms, size : 1 |
| | [I][ Encode][ 532]: input_ids size:168 |
| | [I][ Encode][ 540]: offset 15 |
| | [I][ Encode][ 569]: img_embed.size:1, 294912 |
| | [I][ Encode][ 583]: out_embed size:344064 |
| | [I][ Encode][ 584]: input_ids size 168 |
| | [I][ Encode][ 586]: position_ids size:168 |
| | [I][ Run][ 607]: input token num : 168, prefill_split_num : 2 |
| | [I][ Run][ 641]: input_num_token:128 |
| | [I][ Run][ 641]: input_num_token:40 |
| | [I][ Run][ 865]: ttft: 313.60 ms |
| | 这是一张在埃及沙漠中拍摄的风景照片。画面中,三座巨大的金字塔在晴朗的天空下矗立,它们是古埃及文明的象征。这些金字塔由巨大的石块堆叠而成,表面因岁月侵蚀而显得斑驳。在金字塔的前方,有几个人影在沙地上行走,这为整个场景提供了比例感和尺度感。整个场景充满了历史的厚重感和神秘的氛围。 |
| | |
| | [N][ Run][ 992]: hit eos,avg 14.14 token/s |
| | ``` |
| |
|
| | #### Video understand demo |
| |
|
| | ##### start tokenizer server for image understand demo |
| |
|
| | ``` |
| | python qwen3_tokenizer.py --port 8080 |
| | ``` |
| |
|
| | ##### run video understand demo |
| | - input text |
| |
|
| | ``` |
| | 描述这个视频 |
| | ``` |
| |
|
| | - input video |
| |
|
| | ./video |
| |
|
| | ``` |
| | root@AX630C ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_video_AX630C.sh |
| | [I][ Init][ 156]: LLM init start |
| | [I][ Init][ 158]: Total CMM:7884 MB |
| | [I][ Init][ 34]: connect http://127.0.0.1:8080 ok |
| | bos_id: -1, eos_id: 151645 |
| | img_start_token: 151652 |
| | img_context_token: 151656 |
| | 3% | ██ | 1 / 31 [0.01s<0.34s, 90.91 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap |
| | 6% | ███ | 2 / 31 [0.01s<0.23s, 133.33 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28 |
| | 103% | ██████████████████████████████████ | 32 / 31 [32.37s<31.36s, 0.99 count/s] init vpm axmodel ok,remain_cmm(4385 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652 |
| | [I][ Init][ 309]: image encoder output float32 |
| | |
| | [I][ Init][ 339]: max_token_len : 2047 |
| | [I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 |
| | [I][ Init][ 352]: prefill_token_num : 128 |
| | [I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 |
| | [I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 |
| | [I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 |
| | [I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 |
| | [I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 |
| | [I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 |
| | [I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 |
| | [I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 |
| | [I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 |
| | [I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 |
| | [I][ Init][ 360]: prefill_max_token_num : 1152 |
| | [I][ Init][ 372]: LLM init ok |
| | [I][ Init][ 374]: Left CMM:4385 MB |
| | Type "q" to exit, Ctrl+c to stop current running |
| | prompt >> 描述这个视频 |
| | video >> video |
| | video/frame_0000.jpg |
| | video/frame_0008.jpg |
| | video/frame_0016.jpg |
| | video/frame_0024.jpg |
| | video/frame_0032.jpg |
| | video/frame_0040.jpg |
| | video/frame_0048.jpg |
| | video/frame_0056.jpg |
| | [I][ EncodeImage][ 440]: pixel_values size 4 |
| | [I][ EncodeImage][ 441]: grid_h 24 grid_w 24 |
| | [I][ EncodeImage][ 489]: image encode time : 751.481018 ms, size : 4 |
| | [I][ Encode][ 532]: input_ids size:600 |
| | [I][ Encode][ 540]: offset 15 |
| | [I][ Encode][ 569]: img_embed.size:4, 294912 |
| | [I][ Encode][ 574]: offset:159 |
| | [I][ Encode][ 574]: offset:303 |
| | [I][ Encode][ 574]: offset:447 |
| | [I][ Encode][ 583]: out_embed size:1228800 |
| | [I][ Encode][ 584]: input_ids size 600 |
| | [I][ Encode][ 586]: position_ids size:600 |
| | [I][ Run][ 607]: input token num : 600, prefill_split_num : 5 |
| | [I][ Run][ 641]: input_num_token:128 |
| | [I][ Run][ 641]: input_num_token:128 |
| | [I][ Run][ 641]: input_num_token:128 |
| | [I][ Run][ 641]: input_num_token:128 |
| | [I][ Run][ 641]: input_num_token:88 |
| | [I][ Run][ 865]: ttft: 843.36 ms |
| | 这是一段关于两只山地旱獭(也称“山地土拨鼠”)在山地环境中互动的视频。 |
| | |
| | 在画面中,两只山地旱獭正站在布满碎石的山坡上,背景是连绵起伏的山脉和蓝天。它们的毛色以灰、棕、黑相间,脸部和耳朵周围有明显的黑白条纹,显得非常可爱。 |
| | |
| | 这两只旱獭正在进行一场激烈的“拳击”或“格斗”游戏。它们的前爪高高举起,像在互相击打,但它们的姿势和动作表明它们可能是在进行一场激烈的“拳击”或“格斗”游戏。它们的嘴巴和前爪在空中挥舞,似乎在互相攻击或展示力量。 |
| | |
| | 整个场景充满了动感和活力,展现了这些小动物在自然环境中充满活力和趣味的一面。 |
| | |
| | [N][ Run][ 992]: hit eos,avg 14.16 token/s |
| | |
| | ``` |
| |
|
| | ### Gradio demo |
| |
|
| | ##### start tokenizer server for demo |
| | ```shell |
| | python qwen3_tokenizer.py --port 12345 --host 0.0.0.0 |
| | ``` |
| |
|
| |
|
| | #### start openai style api server |
| | if the tokenizer server is not run in the same machine,please modify the tokenizer server ip in shell file. |
| | ```shell |
| | # for axcl |
| | ./run_axcl_api.sh |
| | # for AX630C |
| | ./run_ax_api.sh |
| | ``` |
| |
|
| | #### start gradio demo |
| | if the api server is not run in the same machine,please modify the api url in gradio web ui. |
| | ```shell |
| | python gradio_demo.py |
| | ``` |
| |
|
| |  |
| |
|
| |
|