Package io.github.kirstenali.deepj.data
Class TextDataset
java.lang.Object
io.github.kirstenali.deepj.data.TextDataset
Simple in-memory dataset that samples random contiguous chunks from token ids.
This is intentionally minimal but correct; for large corpora, replace with a
memory-mapped or streaming implementation.
-
Constructor Summary
Constructors -
Method Summary
-
Constructor Details
-
TextDataset
public TextDataset(int[] tokens, int seqLen, long seed)
-
-
Method Details
-
fromFile
public static TextDataset fromFile(Path path, Tokenizer tok, int seqLen, long seed) throws IOException - Throws:
IOException
-
nextBatch
-
seqLen
public int seqLen() -
size
public int size()
-