Wednesday, 19 April 2017

Split an list of objects into two list by size ratio and object keys partitioning

I have an list of objects, let's call it dataset and I want to split into two new lists, let's call them trainSet and testSet, while reading the input dataset list line by line, where a list object inherits from a node read stream object (in this specific case).

Let's define then a split ratio split_ratio=0.75 in order to split the dataset into a trainSet of at least 75% of the dataset and 25% of the testSet lists length (by example):

var dataSize = dataset.length,
    lim_75 = 0,
    lim_25 = 0;

lim_75 = Math.floor(dataSize * split_ratio); // e.g.: .75
if (lim_75 < 1) lim_75 = 1;
lim_25 = Math.floor(dataSize * (1 - split_ratio)); // eg: 1-.75=.25
if (lim_25 < 1) lim_25 = 1;

When reading line by line I will do

while (row = dataset.next()) {
    var r = Math.random(); // split randomness seed
    if (r < split_ratio && lim_75_count < lim_75 || r >= split_ratio && lim_25_count > lim_25) {
        lim_75_count += 1;
        trainSet.write(row);
    } else {
        lim_25_count += 1;
        testSet.write(row);
    }
}

This works, I will get at the end two lists trainSet of a size of 0.75 and testSet of 0.25 of the dataset list length.

Now, supposed that my objects in the dataset list, have a structure like

{
 objectId: 12345,
 objectClass: 'CLASS_A`
 objectValue: 'The quick brown fox jumps over the lazy dog'
}

where objectClass belongs to an enumeration of values like CLASS_A, CLASS_B, etc.

I want to keep the split_ratiodefined above for the trainSet and testSet length ratio, but adding a new condition that let me split the dataset with a new ratio, in order to partition the list by the object key objectClass, according to this partition map:

[
{
 objectClass: 'TYPE_A',
 ratio: 0.30
},
{
 objectClass: 'TYPE_B',
 ratio: 0.20
},
{
 objectClass: 'TYPE_C',
 ratio: 0.50
}
]

Which is the right partitioning condition that would work in AND with the first one based on size?



via loretoparisi

No comments:

Post a Comment