Performance on Apple M3 Chips (MPS)

#9
by mahdimanesh - opened

Hi all,

I am experimenting with the model for execution on larger sets of data (e.g. >50 MB in JSON Text) but on an M3 Max.
The MPS driver does not seem to be sufficient for such load. I was under the assumption that I do not need manual chunking but my next approach would be to chunk the input, create embeddings, select relevant chunks based on ANN search and then run the extraction model only on this subset.

Am I on the right path? Any hints are welcome! :-)

I'm not sure I totally understand the problem. The bottleneck should be from the length of individual input sequences.
Can you provide some more details about what your input looks like?

So my raw input is a set of JSON files with textual data. I chunk them with size 512 and overlap 128.
The sizes I get is then anything between 100-100.000 chunks. So assuming I select only top 5 chunks for input into the model, then we speak about 1000-2500 tokens on average.

What is the task? Can each file be processed individually?

Yes there will be one extraction template for each JSON file.
At the moment I am "falling back" to a local RAG approach using Ollama and Mixtral -- maybe I am using NuExtract in a way not intended?

Ok, and so each individual JSON file is at least 100*512 tokens in length? Or are you merging multiple JSON files?

Yes correct, each JSON can be anywhere between 100kb to 100MB in pure text size. I do not merge JSON files but I do extract the portions with pure text before handing it into the model so there is no JSON syntax passed in.
Also, I am obviously chunking and selecting the top 7 chunks for input.

Ok, I see. Yeah those are really long texts so you will need to be doing some sort of chunking or retrieval. Hard to say what exactly is the best approach without understanding the specific problem better.

You can try our continuation example if you haven't already, but I suspect you could get a lot of error propagation since you will have a huge number of chunks. Probably some sort of retrieval step like you suggest will be best, assuming only specific parts of each document will be relevant to the extraction and you don't actually need the full context.

Agreed! I will go with the chunking approach and see how far I can get. :-) The output result was quite good on one smaller example so I need to find a way to speed up things.
I will also try to compare output with my specific prompt and data with Mixtral on Ollama.

Sign up or log in to comment