Skip to content

generateTrainTestSplit

Short Description

The function generates a mask for the deep learning model training, using automated approaches. Splitting the data into training, validation and test sets is also included in the function, making it easier to feed the data directly into the deep learning algorithm. Note that manually drawing the mask on thumbnails is the ideal approach, however for scalability purposes, automation is used.

Function

generateTrainTestSplit(thumbnailFolder, projectDir, file_extension=None, verbose=True, TruePos='TruePos', NegToPos='NegToPos', TrueNeg='TrueNeg', PosToNeg='PosToNeg')

Parameters:

Name Type Description Default
thumbnailFolder list

List of folders that contains the human sorted Thumbnails that is to be used for generating training data and split them train test and validation cohorts.

required
projectDir str

Path to output directory.

required
file_extension str

If there are non-image files in the thumbnailFolder, the user can specify a file extension to only select those files for processing. The default is None.

None
verbose bool

If True, print detailed information about the process to the console.

True
TruePos str

Name of the folder that holds the Thumbnails classified as True Positive. The default is 'TruePos'.

'TruePos'
NegToPos str

Name of the folder that holds the Thumbnails classified as True Negative. The default is 'NegToPos'.

'NegToPos'
TrueNeg str

Name of the folder that holds the Thumbnails that were moved from True Positive to True Negative. The default is 'TrueNeg'.

'TrueNeg'
PosToNeg str

Name of the folder that holds the Thumbnails that were moved from True Negative to True Positive. The default is 'PosToNeg'.

'PosToNeg'

Returns:

Name Type Description
masks images

Segmentation masks are generated for every Thumbnail and split into Train, Test and Validation cohorts.

Example
# High level working directory
projectDir = '/Users/aj/Documents/cspotExampleData'

# Folder where the raw Thumbnails are stored
thumbnailFolder = [projectDir + '/CSPOT/Thumbnails/CD3D',
                   projectDir + '/CSPOT/Thumbnails/ECAD']

# The function accepts the four pre-defined folders. If you had renamed them, please change it using the parameter below.
# If you had deleted any of the folders and are not using them, replace the folder name with `None` in the parameter.
cs.generateTrainTestSplit ( thumbnailFolder, 
                            projectDir=projectDir,
                            file_extension=None,
                            TruePos='TruePos', NegToPos='NegToPos',
                            TrueNeg='TrueNeg', PosToNeg='PosToNeg')

# Same function if the user wants to run it via Command Line Interface
python generateTrainTestSplit.py             --thumbnailFolder /Users/aj/Desktop/cspotExampleData/CSPOT/Thumbnails/CD3D /Users/aj/Desktop/cspotExampleData/CSPOT/Thumbnails/ECAD             --projectDir /Users/aj/Desktop/cspotExampleData/
Source code in cspot/generateTrainTestSplit.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def generateTrainTestSplit (thumbnailFolder, 
                            projectDir, 
                            file_extension=None,
                            verbose=True,
                            TruePos='TruePos', NegToPos='NegToPos',
                            TrueNeg='TrueNeg', PosToNeg='PosToNeg'):
    """
Parameters:
    thumbnailFolder (list):
        List of folders that contains the human sorted Thumbnails that is to be used
        for generating training data and split them train test and validation cohorts.

    projectDir (str):
        Path to output directory.

    file_extension (str, optional):
        If there are non-image files in the thumbnailFolder, the user can specify
        a file extension to only select those files for processing. The default is None.

    verbose (bool, optional):
        If True, print detailed information about the process to the console. 

    TruePos (str, optional):
        Name of the folder that holds the Thumbnails classified as True Positive.
        The default is 'TruePos'.

    NegToPos (str, optional):
        Name of the folder that holds the Thumbnails classified as True Negative.
        The default is 'NegToPos'.

    TrueNeg (str, optional):
        Name of the folder that holds the Thumbnails that were moved from `True Positive`
        to `True Negative`. The default is 'TrueNeg'.

    PosToNeg (str, optional):
        Name of the folder that holds the Thumbnails that were moved from `True Negative`
        to `True Positive`. The default is 'PosToNeg'.

Returns:
    masks (images):
        Segmentation masks are generated for every Thumbnail and split into Train,
        Test and Validation cohorts.

Example:
        ```python

        # High level working directory
        projectDir = '/Users/aj/Documents/cspotExampleData'

        # Folder where the raw Thumbnails are stored
        thumbnailFolder = [projectDir + '/CSPOT/Thumbnails/CD3D',
                           projectDir + '/CSPOT/Thumbnails/ECAD']

        # The function accepts the four pre-defined folders. If you had renamed them, please change it using the parameter below.
        # If you had deleted any of the folders and are not using them, replace the folder name with `None` in the parameter.
        cs.generateTrainTestSplit ( thumbnailFolder, 
                                    projectDir=projectDir,
                                    file_extension=None,
                                    TruePos='TruePos', NegToPos='NegToPos',
                                    TrueNeg='TrueNeg', PosToNeg='PosToNeg')

        # Same function if the user wants to run it via Command Line Interface
        python generateTrainTestSplit.py \
            --thumbnailFolder /Users/aj/Desktop/cspotExampleData/CSPOT/Thumbnails/CD3D /Users/aj/Desktop/cspotExampleData/CSPOT/Thumbnails/ECAD \
            --projectDir /Users/aj/Desktop/cspotExampleData/

        ```

    """

    # Function takes in path to two folders, processes the images in those folders,
    # and saves them into a different folder that contains Train, Validation and Test samples
    #TruePos='TruePos'; NegToPos='NegToPos'; TrueNeg='TrueNeg'; PosToNeg='PosToNeg'; verbose=True

    # convert the folder into a list
    if isinstance (thumbnailFolder, str):
        thumbnailFolder = [thumbnailFolder]

    # convert all path names to pathlib
    thumbnailFolder = [pathlib.Path(p) for p in thumbnailFolder]
    projectDir = pathlib.Path(projectDir)

    # find all markers passed
    all_markers = [i.stem for i in thumbnailFolder]

    # create directories to save
    for i in all_markers:
        if not (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'training').exists ():
            (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'training').mkdir(parents=True, exist_ok=True)

        if not (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'validation').exists ():
            (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'validation').mkdir(parents=True, exist_ok=True)

        if not (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'test').exists ():
            (projectDir / 'CSPOT/TrainingData/' / f"{i}" /  'test').mkdir(parents=True, exist_ok=True)

    # standard format
    if file_extension is None:
        file_extension = '*'
    else:
        file_extension = '*' + str(file_extension)

    # Filter on pos cells
    def pos_filter (path):
        image = cv.imread(str(path.resolve()), cv.IMREAD_GRAYSCALE)
        blur = cv.GaussianBlur(image, ksize=(3,3), sigmaX=1, sigmaY=1)
        ret3,th3 = cv.threshold(blur,0,1,cv.THRESH_OTSU)
        mask = th3 + 1
        return [mask, image]

    # Filter on neg cells
    def neg_filter (path):
        image = cv.imread(str(path.resolve()), cv.IMREAD_GRAYSCALE)
        mask = np.ones(image.shape, dtype=np.uint8)
        return [mask, image]

    # identify the files within all the 4 folders
    def findFiles (folderIndex):
        if verbose is True:
            print ('Processing: ' + str(thumbnailFolder[folderIndex].stem))
        marker_name = str(thumbnailFolder[folderIndex].stem)

        baseFolder = thumbnailFolder[folderIndex]

        if TruePos is not None:
            pos = list(pathlib.Path.glob(baseFolder / TruePos, file_extension))
        if NegToPos is not None:
            negtopos = list(pathlib.Path.glob(baseFolder / NegToPos, file_extension))
        positive_cells = pos + negtopos

        if TrueNeg is not None:
            neg = list(pathlib.Path.glob(baseFolder / TrueNeg, file_extension))
        if PosToNeg is not None:
            postoneg = list(pathlib.Path.glob(baseFolder / PosToNeg, file_extension))
        negative_cells = neg + postoneg

        # prepare the Training, Validataion and Test Cohorts
        if len(positive_cells) > 0:
            train_pos = random.sample(positive_cells, round(len(positive_cells) * 0.6))
            remanining_pos = list(set(positive_cells) - set(train_pos))
            val_pos = random.sample(remanining_pos, round(len(remanining_pos) * 0.5)) # validation
            test_pos = list(set(remanining_pos) - set(val_pos)) # test
        else:
            train_pos = []; val_pos = []; test_pos = []
        if len(negative_cells) > 0:
            train_neg = random.sample(negative_cells, round(len(negative_cells) * 0.6))
            remanining_neg = list(set(negative_cells) - set(train_neg))
            val_neg = random.sample(remanining_neg, round(len(remanining_neg) * 0.5))
            test_neg = list(set(remanining_neg) - set(val_neg))
        else:
            train_neg = []; val_neg = []; test_neg = []


        # loop through training dataset and save images and masks
        newname_train = list(range(len(train_pos) + len(train_neg))); random.shuffle(newname_train)
        train_pos_name = newname_train[:len(train_pos)]; train_neg_name = newname_train[len(train_pos):]

        if len (train_pos_name) > 0:
            for i, j in zip( train_pos_name, train_pos):
                m, im = pos_filter (j)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'training' / f"{i}_img.tif"
                tifffile.imwrite(fPath,im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'training' / f"{i}_mask.tif"
                tifffile.imwrite(fPath, m)

        if len (train_neg_name) > 0:
            for k, l in zip( train_neg_name, train_neg):
                m, im = neg_filter (l)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'training' / f"{k}_img.tif"
                tifffile.imwrite(fPath, im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'training' / f"{k}_mask.tif"
                tifffile.imwrite(fPath, m)


        # loop through validation dataset and save images and masks
        newname_train = list(range(len(val_pos) + len(val_neg))); random.shuffle(newname_train)
        train_pos_name = newname_train[:len(val_pos)]; train_neg_name = newname_train[len(val_pos):]

        if len (train_pos_name) > 0:
            for i, j in zip( train_pos_name, val_pos):
                m, im = pos_filter (j)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'validation' / f"{i}_img.tif"
                tifffile.imwrite(fPath, im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'validation' / f"{i}_mask.tif"
                tifffile.imwrite(fPath, m)

        if len (train_neg_name) > 0:
            for k, l in zip( train_neg_name, val_neg):
                m, im = neg_filter (l)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'validation' / f"{k}_img.tif"
                tifffile.imwrite(fPath, im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'validation' / f"{k}_mask.tif"
                tifffile.imwrite(fPath, m)


        # loop through test dataset and save images and masks
        newname_train = list(range(len(test_pos) + len(test_neg))); random.shuffle(newname_train)
        train_pos_name = newname_train[:len(test_pos)]; train_neg_name = newname_train[len(test_pos):]

        if len (train_pos_name) > 0:
            for i, j in zip( train_pos_name, test_pos):
                m, im = pos_filter (j)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'test' / f"{i}_img.tif"
                tifffile.imwrite(fPath, im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'test' / f"{i}_mask.tif"
                tifffile.imwrite(fPath, m)

        if len (train_neg_name) > 0:
            for k, l in zip( train_neg_name, test_neg):
                m, im = neg_filter (l)
                # save image
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'test' / f"{k}_img.tif"
                tifffile.imwrite(fPath, im)
                # associated mask
                fPath = projectDir / 'CSPOT/TrainingData/' / f"{marker_name}" / 'test' / f"{k}_mask.tif"
                tifffile.imwrite(fPath, m)

    # apply function to all folders
    r_findFiles = lambda x: findFiles (folderIndex=x)
    process_folders = list(map(r_findFiles, list(range(len(thumbnailFolder)))))

    # Print
    if verbose is True:
        print('Training data has been generated, head over to "' + str(projectDir) + '/CSPOT/TrainingData" to view results')