I have an list of objects, let's call it dataset
and I want to split into two new lists, let's call them trainSet
and testSet
, while reading the input dataset
list line by line, where a list object inherits from a node
read stream object (in this specific case).
Let's define then a split ratio split_ratio=0.75
in order to split the dataset
into a trainSet
of at least 75% of the dataset
and 25% of the testSet
lists length (by example):
var dataSize = dataset.length,
lim_75 = 0,
lim_25 = 0;
lim_75 = Math.floor(dataSize * split_ratio); // e.g.: .75
if (lim_75 < 1) lim_75 = 1;
lim_25 = Math.floor(dataSize * (1 - split_ratio)); // eg: 1-.75=.25
if (lim_25 < 1) lim_25 = 1;
When reading line by line I will do
while (row = dataset.next()) {
var r = Math.random(); // split randomness seed
if (r < split_ratio && lim_75_count < lim_75 || r >= split_ratio && lim_25_count > lim_25) {
lim_75_count += 1;
trainSet.write(row);
} else {
lim_25_count += 1;
testSet.write(row);
}
}
This works, I will get at the end two lists trainSet
of a size of 0.75 and testSet
of 0.25 of the dataset
list length.
Now, supposed that my objects in the dataset
list, have a structure like
{
objectId: 12345,
objectClass: 'CLASS_A`
objectValue: 'The quick brown fox jumps over the lazy dog'
}
where objectClass
belongs to an enumeration of values like CLASS_A
, CLASS_B
, etc.
I want to keep the split_ratio
defined above for the trainSet
and testSet
length ratio, but adding a new condition that let me split the dataset
with a new ratio, in order to partition the list by the object key objectClass
, according to this partition map:
[
{
objectClass: 'TYPE_A',
ratio: 0.30
},
{
objectClass: 'TYPE_B',
ratio: 0.20
},
{
objectClass: 'TYPE_C',
ratio: 0.50
}
]
Which is the right partitioning condition that would work in AND
with the first one based on size?
via loretoparisi
No comments:
Post a Comment