FlowKit Tutorial - Part 1 - The `Sample` Class

https://flowkit.readthedocs.io/en/latest/?badge=latest

Welcome to the series of FlowKit tutorial notebooks! I hope you find these tutorials a helpful guide to using FlowKit for your FCM analysis. Part 1 covers the Sample class, the foundational class on which most of FlowKit is built. If you have any questions about FlowKit, find any bugs, or feel something is missing from these tutorials please submit an issue to the GitHub repository here.

[71]:

import bokeh
from bokeh.plotting import show

import flowkit as fk

bokeh.io.output_notebook()

Loading BokehJS ...

[72]:

# check version so users can verify they have the same version/API
fk.__version__

[72]:

'1.3.0'

Sample Class

A Sample instance represents a single FCS sample, and is the only point of entry for FCS event data into the FlowKit library.

A Sample object can conveniently be created from a variety of data sources:

A file path to an FCS file
A pathlib Path object to an FCS file
An already instantiated FlowIO FlowData object
A NumPy array (must provide sample_id & channel_labels)
A Pandas DataFrame (with channel labels as headers, must provide sample_id)

Let’s take a look at the Sample constructor method:

Sample(
    fcs_path_or_data,
    sample_id=None,
    filename_as_id=False,
    channel_labels=None,
    compensation=None,
    null_channel_list=None,
    ignore_offset_error=False,
    ignore_offset_discrepancy=False,
    use_header_offsets=False,
    preprocess=True,
    use_flowjo_labels=False,
    subsample=10000
)

fcs_path_or_data: a data source for the FCS sample as described above
sample_id: A text string to use for the Sample’s ID. If None, the ID will be taken from the ‘fil’ keyword of the metadata. If the ‘fil’ keyword is not present, the value will be the filename if given a file. For a NumPy array or Pandas DataFrame, a text value is required.
filename_as_id: Boolean option for using the file name (as it exists on the filesystem) for the Sample’s ID, default is False. This option is only valid for file-like objects (file paths, filehandles, Pathlib Paths). Note, the ‘sample_id’ kwarg takes precedence, if both are specified, the ‘filename_as_id’ option is ignored.
channel_labels: A list of strings or a list of tuples to use for the channel labels. Required if fcs_path_or_data is a NumPy array
compensation: Compensation matrix, which can be a:
- Matrix instance
- NumPy array
- CSV file path
- pathlib Path object to a CSV or TSV file
- string of CSV text
- None (default) for no compensation (it can be applied later via the apply_comensation method)
null_channel_list: List of PnN labels for acquired channels that do not contain useful data. Note, this should only be used if no fluorochromes were used to target those detectors. Null channels do not contribute to compensation and should not be included in a compensation matrix for this sample.
ignore_offset_error: An option to ignore data offset error (see note below for more details)
ignore_offset_discrepancy: option to ignore discrepancy between the HEADER and TEXT values for the DATA byte offset location, default is False
use_header_offsets: use the HEADER section for the data offset locations, default is False. Setting this option to True also suppresses an error in cases of an offset discrepancy.
preprocess: Controls whether preprocessing is applied to the ‘raw’ data (retrievable via the get_events() method with source=’raw’). Binary events in an FCS file are stored unprocessed, meaning they have not been scaled according to channel gain, corrected for proper lin/log display, or had the time channel scaled by the ‘timestep’ keyword value (if present). Unprocessed event data is typically not useful for analysis, so the default is True. Preprocessing does not include compensation or transformation (e.g. biex, Logicle) which are separate operations.
use_flowjo_labels: FlowJo converts forward slashes (‘/’) in PnN labels to underscores. This option matches that behavior. Default is False.
subsample: The number of events to use for subsampling. The number of subsampled events can be changed after instantiation using the subsample_events method. The random seed can also be specified using that method. Subsampled events are used predominantly for speeding up plotting methods.

Note about FCS files with a data offset error:

Some FCS files incorrectly report the location of the last data byte as the last byte exclusive of the data section rather than the last byte inclusive of the data section. Technically, these are invalid FCS files but these are not corrupted data files. To attempt to read in these files, set the ignore_offset_error option to True.

Note on ``ignore_offset_discrepancy`` and ``use_header_offset``: The byte offset location for the DATA segment is defined in 2 places in an FCS file: the HEADER and the TEXT segments. By default, FlowIO uses the offset values found in the TEXT segment. If the HEADER values differ from the TEXT values, a DataOffsetDiscrepancyError will be raised. This option allows overriding this error to force the loading of the FCS file. The related use_header_offset can be used to force loading the file using the data offset locations found in the HEADER section rather than the TEXT section. Setting use_header_offset to True is equivalent to setting both options to True, meaning no error will be raised for an offset discrepancy.

Event type names in the Sample class

Several methods in the Sample class include a source argument that determines the type of events used or retrieved by the method. The options for the source argument are:

raw
comp
xform

The raw option retrieves the event data that are not compensated or transformed. With preprocess=True (the default), the raw events are preprocessed according the the channel gain and time step information. These preprocessing steps are necessary for correct interpretation of the encoded FCS event data for analysis. If a Sample is loaded with preprocess=False, then the raw events are exactly as they were encoded in the FCS file, without applying the gain (from the $PnG keywords) or the time step from the FCS metadata. Non-preprocessed event data are typically not useful for processing or analysis.

The comp option specifies the event data as the raw data with a compensation matrix applied. These events are only available if a compensation matrix was specified in the Sample object instantiation or if the apply_compensation method has been called.

The xform option specifies transformed events. Transformed events will be stored post-compensation if a compensation matrix was supplied when creating a Sample instance or if the apply_compensation method has been called. Transformations can be also be applied to a non-compensated Sample.

Applying compensation and transforms is covered in part 2 of the tutorial notebook series.

Information is also available via the Python help function, along with descriptions of the Sample class methods:

help(fk.Sample)

Create a Sample Instance

As stated above in the Sample docstring, a Sample instance can be created from a variety of data sources:

File path to an FCS file
pathlib Path object to an FCS file
FlowIO FlowData object
NumPy array (must provide sample_id & channel_labels)
Pandas DataFrame (with channel labels as headers, must provide sample_id)

From an FCS File

Let’s create a Sample instance from a file path to an FCS file.

[73]:

fcs_path = '../../data/gate_ref/data1.fcs'

[74]:

sample = fk.Sample(fcs_path)

[75]:

sample

[75]:

Sample(v2.0, B07, 8 channels, 13367 events)

The string representation tells us this is an FCS 2.0 file with the ‘$FIL’ keyword value of ‘B07’. There are 8 channels of event data with 13,367 total events.

From a pandas DataFrame or NumPy array

A Sample can also be created from a pandas DataFrame or a NumPy array. In both cases, a sample ID must be provided. For FCS files, this ID is read from the metadata, and without an ID other features of FlowKit will have no mechanism to reference a Sample.

Let’s get a DataFrame from the previous Sample we made. We’ll cover more on exporting events later in this tutorial, but for now we’ll just use the as_dataframe method. Then, we can use it as an example of creating a new Sample instance.

[76]:

df_events = sample.as_dataframe(source='raw')

[77]:

df_events.head()

[77]:

pnn	FSC-H	SSC-H	FL1-H	FL2-H	FL3-H	FL2-A	FL4-H	Time
pns	FSC-Height	SSC-Height	CD4 FITC	CD8 B PE	CD3 PerCP		CD8 APC	Time (102.40 sec.)
0	88.010899	27.250	7.233942	34.598917	11.039992	5.0	5.186134	0.0
1	19.073569	5.375	36.517413	1.000000	170.007762	0.0	4.293510	0.0
2	70.572207	26.000	2.480454	12.863969	3.023213	0.0	8.582104	0.0
3	98.910082	31.750	5.473703	14.989296	3.751619	0.0	6.731704	0.0
4	29.972752	34.750	2.641648	2.665516	2.641648	0.0	6.097562	0.0

[78]:

sample_from_df = fk.Sample(df_events, sample_id='my_sample_from_dataframe')

[79]:

sample_from_df

[79]:

Sample(v3.1, my_sample_from_dataframe, 8 channels, 13367 events)

Creating a Sample from a NumPy array is similar, but we must also provide the channel names. In the case of the DataFrame above, the channel names were taken from the column names.

[80]:

np_events = sample.get_events(source='raw')
channel_labels = sample.pnn_labels

[81]:

sample_from_np = fk.Sample(np_events, channel_labels=channel_labels, sample_id='my_sample_from_numpy')

[82]:

sample_from_np

[82]:

Sample(v3.1, my_sample_from_numpy, 8 channels, 13367 events)

Loading Multiple Samples

The utility function load_samples allows loading multiple FCS files. The input can be a list of file paths or a string for a directory or file path. If given a directory, any .fcs files in the directory will be loaded.

load_samples takes many of the same arguments as the Sample class constructor. These arguments require all the FCS samples to have the same set of channels.

Let’s review the docstring for load_samples and then use it to load the 3 FCS files in the 8 color data set.

[83]:

fk.load_samples?

Signature:
fk.load_samples(
    fcs_samples,
    filename_as_id=False,
    compensation=None,
    null_channel_list=None,
    preprocess=True,
    use_flowjo_labels=False,
)
Docstring:
Returns a list of Sample instances from a variety of input types (fcs_samples), such as a file or
    directory path string, Path object, a Sample instance, or a list of the previous types. Lists
    of mixed types are not supported.

:param fcs_samples: Sample, str, or list. Allowed types: a Sample instance, list of Sample instances,
        a directory or file path, or a list of directory or file paths. If a directory, any .fcs
        files in the directory will be loaded. If a list, then it must be a list of file paths or a
        list of Sample instances. Lists of mixed types are not supported.
:param filename_as_id: Boolean option for using the file name (as it exists on the
    filesystem) for the Sample's ID, default is False. Only applies to file paths given to the
    'fcs_samples' argument.
:param compensation: Compensation matrix. The matrix must be applicable to all samples in
    'fcs_samples'. Acceptable types include a Matrix instance, NumPy array, CSV file path,
    pathlib Path object to a CSV or TSV file, or a string of CSV text
:param null_channel_list: List of PnN labels for acquired channels that do not contain
    useful data. Note, this should only be used if no fluorochromes were used to target
    those detectors. Null channels do not contribute to compensation and should not be
    included in a compensation matrix for this sample. This option is ignored if
    `fcs_path_or_data` is a FlowData object. The null channel list must be applicable
    to all samples.
:param preprocess: Controls whether preprocessing is applied to the 'raw' data (retrievable
    via Sample.get_events() with source='raw'). Binary events in an FCS file are stored
    unprocessed, meaning they have not been scaled according to channel gain, corrected for
    proper lin/log display, or had the time channel scaled by the 'timestep' keyword value
    (if present). Unprocessed event data is typically not useful for analysis, so the default
    is True. Preprocessing does not include compensation or transformation (e.g. biex, Logicle)
    which are separate operations.
:param use_flowjo_labels: FlowJo converts forward slashes ('/') in PnN labels to underscores.
    This option matches that behavior. Default is False.
:return: list of Sample instances
File:      ~/git/flowkit/src/flowkit/_utils/sample_utils.py
Type:      function

[84]:

path_to_8c_files = "../../data/8_color_data_set/fcs_files/"

samples_8c = fk.load_samples(path_to_8c_files)

[85]:

samples_8c

[85]:

[Sample(v3.1, 101_DEN084Y5_15_E01_008_clean.fcs, 15 channels, 290172 events),
 Sample(v3.1, 101_DEN084Y5_15_E03_009_clean.fcs, 15 channels, 283969 events),
 Sample(v3.1, 101_DEN084Y5_15_E05_010_clean.fcs, 15 channels, 285290 events)]

A useful tip: The Sample class supports the comparison operators on the id attribute, so sorting a list is easy.

[86]:

sorted(samples_8c, reverse=True)

[86]:

[Sample(v3.1, 101_DEN084Y5_15_E05_010_clean.fcs, 15 channels, 285290 events),
 Sample(v3.1, 101_DEN084Y5_15_E03_009_clean.fcs, 15 channels, 283969 events),
 Sample(v3.1, 101_DEN084Y5_15_E01_008_clean.fcs, 15 channels, 290172 events)]

Metadata and Channel Information

Get the FCS version of the file (returns None if a Sample was created from a NumPy array or Pandas DataFrame)

[87]:

sample.version

[87]:

'2.0'

The id attribute is derived from the FCS metadata ‘fil’ keyword. This is the default behavior, but can be forced to be the actual filesystem name by setting filename_as_id=True when creating a Sample. The id is an important attribute in FlowKit, as it is used to reference samples in other classes.

[88]:

sample.id

[88]:

'B07'

Retrieve all the FCS metadata using the get_metadata() method. FlowKit converts all FCS metadata to lowercase and strips the “$” from the standard FCS keywords. This is done for consistency and ease of typing when referencing them. When writing out new FCS files from a Sample instance, you don’t have to worry about formatting the keywords, this is handled by FlowKit automatically. Let’s look at the dictionary of metadata for our sample:

[89]:

sample.get_metadata()

[89]:

{'byteord': '4,3,2,1',
 'datatype': 'I',
 'nextdata': '0',
 'sys': 'Macintosh System Software 9.0.4',
 'creator': 'CELLQuestª 3.3',
 'tot': '13367',
 'mode': 'L',
 'par': '8',
 'p1n': 'FSC-H',
 'p1r': '1024',
 'p1b': '16',
 'p1e': '0,0',
 'p1g': '3.67',
 'p2n': 'SSC-H',
 'p2r': '1024',
 'p2b': '16',
 'p2e': '0,0',
 'p2g': '8',
 'p3n': 'FL1-H',
 'p3r': '1024',
 'p3b': '16',
 'p3e': '4,0',
 'p4n': 'FL2-H',
 'p4r': '1024',
 'p4b': '16',
 'p4e': '4,0',
 'p5n': 'FL3-H',
 'p5r': '1024',
 'p5b': '16',
 'p5e': '4,0',
 'p1s': 'FSC-Height',
 'p2s': 'SSC-Height',
 'p3s': 'CD4 FITC',
 'p4s': 'CD8 B PE',
 'p5s': 'CD3 PerCP',
 'p6n': 'FL2-A',
 'p6r': '1024',
 'p6b': '16',
 'p6e': '0,0',
 'timeticks': '100',
 'p7n': 'FL4-H',
 'p7r': '1024',
 'p7e': '4,0',
 'p7b': '16',
 'p7s': 'CD8 APC',
 'p8n': 'Time',
 'p8r': '1024',
 'p8e': '0,0',
 'p8b': '16',
 'p8s': 'Time (102.40 sec.)',
 'sample id': 'Default Patient ID',
 'src': 'Default',
 'case number': 'Default Case Number',
 'cyt': 'FACSCalibur',
 'cytnum': 'E3820',
 'btim': '16:31:33',
 'etim': '16:31:52',
 'bdacqlibversion': '3.1',
 'bdnpar': '7',
 'bdp1n': 'FSC-H',
 'bdp2n': 'SSC-H',
 'bdp3n': 'FL1-H',
 'bdp4n': 'FL2-H',
 'bdp5n': 'FL3-H',
 'bdp6n': 'FL2-A',
 'bdp7n': 'FL4-H',
 'bdword0': '24',
 'bdword1': '394',
 'bdword2': '492',
 'bdword3': '477',
 'bdword4': '566',
 'bdword5': '397',
 'bdword6': '397',
 'bdword7': '397',
 'bdword8': '398',
 'bdword9': '397',
 'bdword10': '300',
 'bdword11': '299',
 'bdword12': '551',
 'bdword13': '4',
 'bdword14': '397',
 'bdword15': '501',
 'bdword16': '481',
 'bdword17': '586',
 'bdword18': '574',
 'bdword19': '100',
 'bdword20': '100',
 'bdword21': '100',
 'bdword22': '100',
 'bdword23': '1',
 'bdword24': '1',
 'bdword25': '0',
 'bdword26': '0',
 'bdword27': '0',
 'bdword28': '136',
 'bdword29': '52',
 'bdword30': '52',
 'bdword31': '52',
 'bdword32': '52',
 'bdword33': '52',
 'bdword34': '12',
 'bdword35': '201',
 'bdword36': '6',
 'bdword37': '138',
 'bdword38': '280',
 'bdword39': '3',
 'bdword40': '3',
 'bdword41': '100',
 'bdword42': '100',
 'bdword43': '0',
 'bdword44': '1023',
 'bdword45': '1023',
 'bdword46': '1023',
 'bdword47': '53',
 'bdword48': '550',
 'bdword49': '56',
 'bdword50': '72',
 'bdword51': '52',
 'bdword52': '0',
 'bdword53': '0',
 'bdword54': '0',
 'bdword55': '0',
 'bdword56': '0',
 'bdword57': '0',
 'bdword58': '0',
 'bdword59': '0',
 'bdword60': '0',
 'bdword61': '0',
 'bdword62': '0',
 'bdword63': '0',
 'bdlasermode': '1',
 'calibfile': 'FALSE',
 'p7thresvol': '52',
 'fil': 'B07',
 'date': '23-Aug-02',
 'number well info keywords': '3',
 '&1sample': '200',
 '&2number of washes': '1',
 '&3mixing vol': '100',
 '&4number of mixes': '2',
 '&5data file prefix part #1\\\\&6data file prefix part #2\\\\&7data file prefix part #3\\\\&8acquisition doc.': 'LYMPH SUBSET ACQ',
 '&9instr. sett. file': 'E#7 Settings #1',
 '&10patient id': ' FJ#192659',
 '&11day': '35d',
 '&12sample id': 'T-cells',
 '&13analysis doc.': ''}

Retrieve a DataFrame of channel information, including the required PnN labels & optional PnS labels for each channel. Note that Samples distinguish between channel numbers and indices, with channel numbers being indexed at 1 and channel indices being indexed at 0.

[90]:

sample.channels

[90]:

	channel_number	pnn	pns	pne	png	pnr
0	1	FSC-H	FSC-Height	(0.0, 0.0)	3.67	1024.0
1	2	SSC-H	SSC-Height	(0.0, 0.0)	8.00	1024.0
2	3	FL1-H	CD4 FITC	(4.0, 1.0)	1.00	1024.0
3	4	FL2-H	CD8 B PE	(4.0, 1.0)	1.00	1024.0
4	5	FL3-H	CD3 PerCP	(4.0, 1.0)	1.00	1024.0
5	6	FL2-A		(0.0, 0.0)	1.00	1024.0
6	7	FL4-H	CD8 APC	(4.0, 1.0)	1.00	1024.0
7	8	Time	Time (102.40 sec.)	(0.0, 0.0)	1.00	1024.0

Get a list of only the PnN labels:

[91]:

sample.pnn_labels

[91]:

['FSC-H', 'SSC-H', 'FL1-H', 'FL2-H', 'FL3-H', 'FL2-A', 'FL4-H', 'Time']

The optional PnS labels are also available (empty values will be empty strings):

[92]:

sample.pns_labels

[92]:

['FSC-Height',
 'SSC-Height',
 'CD4 FITC',
 'CD8 B PE',
 'CD3 PerCP',
 '',
 'CD8 APC',
 'Time (102.40 sec.)']

The Sample class attempts to automatically identify fluorescent, scatter, and time channels. This is done by looking for simple substring values in channel names (i.e. ‘FSC’, ‘SSC’, ‘Time’). The channel indices for these channel types are available using the following attributes:

[93]:

sample.fluoro_indices

[93]:

[2, 3, 4, 5, 6]

[94]:

sample.scatter_indices

[94]:

[0, 1]

[95]:

sample.time_index

[95]:

Lookup a channel index by a label string:

[96]:

sample.get_channel_index('FL2-H')

[96]:

Or, lookup a channel number:

[97]:

sample.get_channel_number_by_label('FL2-H')

[97]:

And for completeness, get a channel index via its number:

[98]:

sample.get_channel_index(4)

[98]:

To get the event count:

[99]:

sample.event_count

[99]:

Several other Sample attributes are available including:

original_filename
is_preprocessed
acquisition_date
compensation
transform

And a few attributes for filtering events:

subsample_indices

Set by constructor or by calling the subsample_events method (see below).
negative_scatter_indices

Set by calling the filter_negative_scatter method.
flagged_indices

Assigned manually by the user for flagging any events (anomolous events from a QC routine, etc.)

Renaming a Channel

The FCS standard requires a value for the PnN channel label. Because of this, the PnN label is most often used to identify a channel. However, the label values stored in this field are not always the most descriptive. The FCS used in this notebook is a good example. Let’s rename a channel to something more human readable.

Note: It is recommended to retaining the height, width, or area postfix (e.g. the trailing ‘-A’) since some software utilizes this information.

[100]:

help(sample.rename_channel)

Help on method rename_channel in module flowkit._models.sample:

rename_channel(current_label, new_label, new_pns_label=None) method of flowkit._models.sample.Sample instance
    Rename a channel label.

    :param current_label: PnN label of a channel
    :param new_label: new PnN label
    :param new_pns_label: optional new PnS label
    :return: None

[101]:

# Channel 3 is CD4-FITC with height as the measurement, yet the PnN label of this channel is 'FL1-H'.
# Let's rename this to 'CD4-FITC-H' & set the PnS label to the simple 'CD4' marker.
sample.rename_channel('FL1-H', 'CD4-FITC-H', 'CD4')

[102]:

# Review the renamed channel
sample.channels

[102]:

	channel_number	pnn	pns	pne	png	pnr
0	1	FSC-H	FSC-Height	(0.0, 0.0)	3.67	1024.0
1	2	SSC-H	SSC-Height	(0.0, 0.0)	8.00	1024.0
2	3	CD4-FITC-H	CD4	(4.0, 1.0)	1.00	1024.0
3	4	FL2-H	CD8 B PE	(4.0, 1.0)	1.00	1024.0
4	5	FL3-H	CD3 PerCP	(4.0, 1.0)	1.00	1024.0
5	6	FL2-A		(0.0, 0.0)	1.00	1024.0
6	7	FL4-H	CD8 APC	(4.0, 1.0)	1.00	1024.0
7	8	Time	Time (102.40 sec.)	(0.0, 0.0)	1.00	1024.0

Subsampling

FlowKit is optimized for performance (or attempts to be!). However, when dealing with high-dimensional flow cytometry data and/or data containing millions of events, it can be useful to subsample events to speed up processing. This is especially true when trying to plot events. All Sample plot methods, except for plot_histogram, use subsampled events by default (we’ll see these methods in the next section).

On instantiation, the number of subsampled events can be specified (default is 10,000). The Sample class also provides a subsample_events method to change the number of subsample events or the random seed used to generate the subsample. The subsample is drawn randomly, but in a reproducible way. You are guaranteed the same subsample indices when re-running analysis by providing the same random seed as an argument (default seed is 1). We can retrieve the subsampled indices using the subsample_indices attribute.

Note that subsampling does not delete any events, the subsampled indices are simply stored and used as a subset of events. Any Sample class method that plots or retrieves events will have a subsample argument that takes a Boolean value specifying whether to use the subsampled events or all events. Any method that processes events (compensating or transforming) will always use all the events.

A final note on subsampling. If you use the filter_negative_scatter() method or set the flagged_indices attribute (a list of indices), then any event with those indices will be omitted from the subsampled indices. You can think of these as a form of “pre-gating” that can be useful for getting cleaner plots of your sample’s data.

Retrieving subsampled indices

Our Sample instance was already subsampled by default (at 10000 events) when creating the Sample. Note in the output below, the subsampled indices are not only randomly selected (in a reproducible way via the random seed), but are also randomly shuffled. This allows for safer subsampling the subsample if even fewer events are needed.

[103]:

sample.subsample_indices

[103]:

array([ 4136, 12180, 11048, ...,  9661, 10709, 10093], shape=(10000,))

Retrieving Events

Several methods are available in the Sample class for convenient retrieval event data, and in a variety of forms.

Retrieve Events as NumPy Array

The Sample methods get_events and get_channel_events return event data as a NumPy array. Both methods have similar input arguments, including the already familiar source and subsample arguments for specifying the event class (‘raw’, ‘comp’, or ‘xform’) and whether to return the subsampled events or all events.

Note: These methods return the arrays directly, not a copy of the array. Be careful if you are planning to modify the returned event data, and make a copy of the array when appropriate.

[104]:

help(sample.get_events)

Help on method get_events in module flowkit._models.sample:

get_events(source='xform', subsample=False, event_mask=None, col_order=None) method of flowkit._models.sample.Sample instance
    Returns a NumPy array of event data.

    Note: This method returns the array directly, not a copy of the array. Be careful if you
    are planning to modify returned event data, and make a copy of the array when appropriate.

    :param source: Controls which version of event data to return.Valid values are:
        'raw', 'comp', or 'xform'. For 'raw', events are returned uncompensated and
        non-transformed. For 'comp', events are returned compensated according to
        the stored compensation matrix. For 'xform', events are returned transformed
        according to the stored transformations and will include any compensation
        applied beforehand. Note: In all cases, events returned will be based on
        whether pre-processing was applied when loading the Sample.
    :param subsample: Whether to return all events or just the subsampled
        events. Default is False (all events)
    :param event_mask: Filter Sample events by a given Boolean array (events marked
        True will be returned). Can be combined with the subsample option.
    :param col_order: PnN label list for the channel columns and their order
    :return: NumPy array of event data

[105]:

sample.get_events(source='raw')

[105]:

array([[ 88.01089918,  27.25      ,   7.23394163, ...,   5.        ,
          5.18613419,   0.        ],
       [ 19.07356948,   5.375     ,  36.51741273, ...,   0.        ,
          4.29351021,   0.        ],
       [ 70.57220708,  26.        ,   2.48045441, ...,   0.        ,
          8.58210354,   0.        ],
       ...,
       [ 62.1253406 ,  27.625     ,  11.75743266, ...,   0.        ,
          1.77827941, 174.        ],
       [ 36.23978202,  64.5       ,   5.42469094, ...,   0.        ,
          4.95806824, 174.        ],
       [ 66.48501362,   8.75      ,   1.43301257, ...,   0.        ,
          6.0429639 , 174.        ]], shape=(13367, 8))

[106]:

# Get a single channel's events using `get_channel_events`
sample.get_channel_events('FSC-H', source='raw')

[106]:

array([88.01089918, 19.07356948, 70.57220708, ..., 62.1253406 ,
       36.23978202, 66.48501362], shape=(13367,))

Retrieve Events as pandas DataFrame

The Sample method as_dataframe returns a pandas DataFrame of the Sample event data. This method also supports source, subsample, and event_mask arguments, but includes the extra arguments col_order and col_names for choosing the order of columns by PnN label and/or specifying new names for the columns in the returned DataFrame. There is also a col_multi_index option for controlling whether the columns are MultiIndex (default) or just a simple index using the PnN labels.

[107]:

df_multi = sample.as_dataframe(source='raw')
df_multi.head()

[107]:

pnn	FSC-H	SSC-H	CD4-FITC-H	FL2-H	FL3-H	FL2-A	FL4-H	Time
pns	FSC-Height	SSC-Height	CD4	CD8 B PE	CD3 PerCP		CD8 APC	Time (102.40 sec.)
0	88.010899	27.250	7.233942	34.598917	11.039992	5.0	5.186134	0.0
1	19.073569	5.375	36.517413	1.000000	170.007762	0.0	4.293510	0.0
2	70.572207	26.000	2.480454	12.863969	3.023213	0.0	8.582104	0.0
3	98.910082	31.750	5.473703	14.989296	3.751619	0.0	6.731704	0.0
4	29.972752	34.750	2.641648	2.665516	2.641648	0.0	6.097562	0.0

[108]:

# Turn off the multi-index & just use PnN labels
df_simple = sample.as_dataframe(source='raw', col_multi_index=False)
df_simple.head()

[108]:

	FSC-H	SSC-H	CD4-FITC-H	FL2-H	FL3-H	FL2-A	FL4-H
0	88.010899	27.250	7.233942	34.598917	11.039992	5.0	5.186134
1	19.073569	5.375	36.517413	1.000000	170.007762	0.0	4.293510
2	70.572207	26.000	2.480454	12.863969	3.023213	0.0	8.582104
3	98.910082	31.750	5.473703	14.989296	3.751619	0.0	6.731704
4	29.972752	34.750	2.641648	2.665516	2.641648	0.0	6.097562

[109]:

# Retrieve just the columns you want (and specify their order) using `col_order`.
# And rename those columns as you'd like using the `col_names` option.
df_simple_sub_renamed = sample.as_dataframe(
    source='raw',
    col_multi_index=False,
    col_order=['CD4-FITC-H', 'FL2-H', 'FL3-H', 'FL4-H'],
    col_names=['CD4', 'CD8-B', 'CD3', 'CD8']
)
df_simple_sub_renamed.head()

[109]:

	CD4	CD8-B	CD3	CD8
0	7.233942	34.598917	11.039992	5.186134
1	36.517413	1.000000	170.007762	4.293510
2	2.480454	12.863969	3.023213	8.582104
3	5.473703	14.989296	3.751619	6.731704
4	2.641648	2.665516	2.641648	6.097562

Plotting Sample Events

Histogram
Channel Plot (plot channel events in order)
Contour Plot
Interactive Scatter Plot
Interactive Scatter Plot Matrix

All plotting methods return a Bokeh figure instance. This is done so the caller can modify and/or display the plot as required.

Histogram

[110]:

help(sample.plot_histogram)

Help on method plot_histogram in module flowkit._models.sample:

plot_histogram(channel_label_or_number, source='xform', subsample=False, bins=None, data_min=None, data_max=None, x_range=None) method of flowkit._models.sample.Sample instance
    Returns a histogram plot of the specified channel events

    :param channel_label_or_number:  A channel's PnN label or number to use
        for plotting the histogram
    :param source: 'raw', 'comp', 'xform' for whether the raw, compensated
        or transformed events are used for plotting
    :param subsample: Whether to use all events for plotting or just the
        subsampled events. Default is False (all events).
    :param bins: Number of bins to use for the histogram or a string compatible
        with the NumPy histogram function. If None, the number of bins is
        determined by the square root rule.
    :param data_min: filter event data, removing events below specified value
    :param data_max: filter event data, removing events above specified value
    :param x_range: Tuple of lower & upper bounds of x-axis. Used for modifying
        plot view, doesn't filter event data.
    :return: Bokeh figure of the histogram plot.

[111]:

p = sample.plot_histogram('FSC-H', source='raw')
show(p)

Changing the bin size:

[112]:

p = sample.plot_histogram('FSC-H', source='raw', bins=256)
show(p)

Change the display range:

[113]:

p = sample.plot_histogram('FSC-H', source='raw', bins=256, x_range=(10, 200))
show(p)

You can also change the data range that is used to compute the histogram, useful if there are extreme values that you would like to exclude. There are separate arguments for data_min and data_max. Let’s exclude data above 100:

[114]:

p = sample.plot_histogram('FSC-H', source='raw', bins=50, data_max=100)
show(p)

Plot Channel

The plot_channel method create a 2-D histogram of the specified channel data with the x-axis as the event index. This is similar to plotting a channel vs Time, except the events are equally distributed along the x-axis. Subsampling is not applied to this plot method.

[115]:

help(sample.plot_channel)

Help on method plot_channel in module flowkit._models.sample:

plot_channel(channel_label_or_number, source='xform', subsample=True, color_density=True, bin_width=4, event_mask=None, highlight_mask=None, x_min=None, x_max=None, y_min=None, y_max=None, width=900, aspect_ratio=3) method of flowkit._models.sample.Sample instance
Plot a 2-D histogram of the specified channel data with the x-axis as the event index.
This is similar to plotting a channel vs Time, except the events are equally
distributed along the x-axis.

:param channel_label_or_number: A channel's PnN label or number
:param source: 'raw', 'comp', 'xform' for whether the raw, compensated
or transformed events are used for plotting
:param subsample: Whether to use all events for plotting or just the
subsampled events. Default is True (subsampled events). Plotting
subsampled events is much faster.
:param color_density: Whether to color the events by density, similar
to a heat map. Default is True.
:param bin_width: Bin size to use for the color density, in units of
event point size. Larger values produce smoother gradients.
Default is 4 for a 4x4 grid size.
:param event_mask: Boolean array of events to plot. Takes precedence
over highlight_mask (i.e. events marked False in event_mask will
never be plotted).
:param highlight_mask: Boolean array of event indices to highlight
in color. Non-highlighted events will be light grey.
:param x_min: Lower bound of x-axis. If None, channel's min value will
be used with some padding to keep events off the edge of the plot.
:param x_max: Upper bound of x-axis. If None, channel's max value will
be used with some padding to keep events off the edge of the plot.
:param y_min: Lower bound of y-axis. If None, channel's min value will
be used with some padding to keep events off the edge of the plot.
:param y_max: Upper bound of y-axis. If None, channel's max value will
be used with some padding to keep events off the edge of the plot.
:param width: Width of the plot. Default is 900. By default, the width
to height ratio is 3:1 (default height of 300 pixels).
:param aspect_ratio: The width to height ratio of the plot. Default is
3. Set to 1 for a square plot.
:return: A Bokeh Figure object containing the interactive channel plot.

[116]:

f = sample.plot_channel('FSC-H', source='raw')
show(f)

Contour Plot

The plot_contour method uses the Kernel Density Estimate function from SciPy and is computationally intensive, so the plots can take some time to create.

[117]:

help(sample.plot_contour)

Help on method plot_contour in module flowkit._models.sample:

plot_contour(x_label_or_number, y_label_or_number, source='xform', subsample=True, plot_events=False, fill=False, x_min=None, x_max=None, y_min=None, y_max=None) method of flowkit._models.sample.Sample instance
Returns a contour plot of the specified channel events, available
as raw, compensated, or transformed data.

:param x_label_or_number: A channel's PnN label or number for x-axis
data
:param y_label_or_number: A channel's PnN label or number for y-axis
data
:param source: 'raw', 'comp', 'xform' for whether the raw, compensated
or transformed events are used for plotting
:param subsample: Whether to use all events for plotting or just the
subsampled events. Default is True (subsampled events). Running
with all events is not recommended, as the Kernel Density
Estimation is computationally demanding.
:param plot_events: Whether to display the event data points in
addition to the contours. Default is False.
:param x_min: Lower bound of x-axis. If None, channel's min value will
be used with some padding to keep events off the edge of the plot.
:param x_max: Upper bound of x-axis. If None, channel's max value will
be used with some padding to keep events off the edge of the plot.
:param y_min: Lower bound of y-axis. If None, channel's min value will
be used with some padding to keep events off the edge of the plot.
:param y_max: Upper bound of y-axis. If None, channel's max value will
be used with some padding to keep events off the edge of the plot.
:param fill: Whether to fill in color between contour lines. D default
is False.
:return: A Bokeh figure of the contour plot

[118]:

# by default, plot_contour uses subsampled events for performance
p = sample.plot_contour('FSC-H', 'SSC-H', source='raw', fill=False, plot_events=False)

[119]:

show(p)

To specify the axes ranges:

[120]:

x_min = y_min = 0
x_max = y_max = 250

p = sample.plot_contour(
    'FSC-H', 'SSC-H', source='raw', x_min=x_min, x_max=x_max, y_min=y_min, y_max=y_max
)
show(p)

Fill contours:

[121]:

p = sample.plot_contour('FSC-H', 'SSC-H', fill=True, source='raw')
show(p)

Adding events:

[122]:

p = sample.plot_contour('FSC-H', 'SSC-H', source='raw', plot_events=True)
show(p)

Scatter Plot

[123]:

p = sample.plot_scatter(
    'FSC-H', 'SSC-H',
    source='raw', y_min=0., y_max=130, x_min=0., x_max=280, color_density=True
)

[124]:

show(p)

Change the bin width to control the color density. The bin width is in units of the event point size and the default is 4 for a 4x4 grid size. Larger values will create a smoother color gradient but will lose detail. Let’s set the bin size to 8 and see how it compares.

[125]:

p = sample.plot_scatter(
    'FSC-H', 'SSC-H',
    source='raw', y_min=0., y_max=130, x_min=0., x_max=280, color_density=True, bin_width=8
)

[126]:

show(p)

Or, turn off the color density completely:

[127]:

p = sample.plot_scatter('FSC-H', 'SSC-H', source='raw', color_density=False)
show(p)

Apply a transform and plot fluorescent channels (raw and transformed)

Note: The ``transforms`` module will be covered in more detail in part 2 of the tutorial notebook series

[128]:

xform = fk.transforms.LogicleTransform(param_t=1024, param_w=0.5, param_m=4.5, param_a=0)
sample.apply_transform(xform)

[129]:

# source is 'raw' so not too useful for visualization
# Note: We renamed the 'FL1-H' channel above to 'CD4-FITC-H'
p = sample.plot_scatter('CD4-FITC-H', 'FL2-H', source='raw')
show(p)

[130]:

# change source to 'xform' to visualize the transformed data
p = sample.plot_scatter('CD4-FITC-H', 'FL2-H', source='xform')
show(p)

Highlight Specific Events

You can also highlight certain events to only apply the color density to them using a Boolean array. The density calculation is still based on all the events. Let’s highlight all events with CD3 values above 0.65.

[131]:

# Look up the PnN label for CD3 (this file has slightly confusing channel labels)
sample.channels

[131]:

	channel_number	pnn	pns	pne	png	pnr
0	1	FSC-H	FSC-Height	(0.0, 0.0)	3.67	1024.0
1	2	SSC-H	SSC-Height	(0.0, 0.0)	8.00	1024.0
2	3	CD4-FITC-H	CD4	(4.0, 1.0)	1.00	1024.0
3	4	FL2-H	CD8 B PE	(4.0, 1.0)	1.00	1024.0
4	5	FL3-H	CD3 PerCP	(4.0, 1.0)	1.00	1024.0
5	6	FL2-A		(0.0, 0.0)	1.00	1024.0
6	7	FL4-H	CD8 APC	(4.0, 1.0)	1.00	1024.0
7	8	Time	Time (102.40 sec.)	(0.0, 0.0)	1.00	1024.0

[132]:

# cd3 channel has the label 'FL3-H'
cd3_xform_events = sample.get_channel_events('FL3-H', source='xform')
is_high_cd3 = cd3_xform_events > 0.65

[133]:

p = sample.plot_scatter('FL3-H', 'FSC-H', source='xform', highlight_mask=is_high_cd3)
show(p)

But let’s show these events on a plot of CD4 vs CD8.

[134]:

p = sample.plot_scatter('CD4-FITC-H', 'FL2-H', source='xform', highlight_mask=is_high_cd3)
show(p)

Filter Specific Events

Or, we could just omit the events completely:

[135]:

p = sample.plot_scatter('CD4-FITC-H', 'FL2-H', source='xform', event_mask=is_high_cd3)
show(p)

Or combine the options to hide events and highlight others. Here we’ll highlight CD8 > 0.5 but hide the high CD3 events from above.

[136]:

# cd8 channel 'B' has the label 'FL2-H'
cd8_xform_events = sample.get_channel_events('FL2-H', source='xform')
is_high_cd8 = cd8_xform_events > 0.5

p = sample.plot_scatter('CD4-FITC-H', 'FL2-H', source='xform', event_mask=is_high_cd3, highlight_mask=is_high_cd8)
show(p)

Scatterplot Matrix

Plot multiple scatterplots using the plot_scatter_matrix method. The diagonals will plot a histogram of the channel.

[137]:

help(sample.plot_scatter_matrix)

Help on method plot_scatter_matrix in module flowkit._models.sample:

plot_scatter_matrix(channel_labels_or_numbers=None, source='xform', subsample=True, event_mask=None, highlight_mask=None, color_density=False, plot_height=256, plot_width=256) method of flowkit._models.sample.Sample instance
Returns an interactive scatter plot matrix for all channel combinations
except for the Time channel.

:param channel_labels_or_numbers: List of channel PnN labels or channel
numbers to use for the scatter plot matrix. If None, then all
channels will be plotted (except Time).
:param source: 'raw', 'comp', 'xform' for whether the raw, compensated
or transformed events are used for plotting
:param subsample: Whether to use all events for plotting or just the
subsampled events. Default is True (subsampled events). Plotting
subsampled events is much faster.
:param event_mask: Boolean array of events to plot. Takes precedence
over highlight_mask (i.e. events marked False in event_mask will
never be plotted).
:param highlight_mask: Boolean array of event indices to highlight
in color. Non-highlighted events will be light grey.
:param color_density: Whether to color the events by density, similar
to a heat map. Default is False.
:param plot_height: Height of plot in pixels (screen units)
:param plot_width: Width of plot in pixels (screen units)
:return: A Bokeh Figure object containing the interactive scatter plot
matrix.

[138]:

# For the scatter matrix, subsampling is usually a good idea since there are so many plots
spm = sample.plot_scatter_matrix(
    source='xform',
    channel_labels_or_numbers=['FSC-H', 'SSC-H', 'FL3-H', 'FL4-H'],
    color_density=True
)
show(spm)

Exporting Events

The export method exports the event data to either a new FCS file or a CSV file, with the format determined by filename extension (either ‘.fcs’ or ‘.csv’). Extra options are available for excluding certain events (negative scatter, flagged, subsample) from the exported file.

[139]:

help(sample.export)

Help on method export in module flowkit._models.sample:

export(filename, source='xform', exclude_neg_scatter=False, exclude_flagged=False, exclude_normal=False, subsample=False, include_metadata=False, directory=None) method of flowkit._models.sample.Sample instance
Export Sample event data to either a new FCS file or a CSV file. Format determined by filename extension.

:param filename: Text string to use for the exported file name. File type is determined by
the filename extension (supported types are .fcs & .csv).
:param source: 'orig', 'raw', 'comp', 'xform' for whether the original (no gain applied),
raw (orig + gain), compensated (raw + comp), or transformed (comp + xform) events are
used for exporting
:param exclude_neg_scatter: Whether to exclude negative scatter events. Default is False.
:param exclude_flagged: Whether to exclude flagged events. Default is False.
:param exclude_normal: Whether to exclude "normal" events. This is useful for retrieving all
the "bad" events (neg scatter and/or flagged events). Default is False.
:param subsample: Whether to export all events or just the subsampled events.
Default is False (all events).
:param include_metadata: Whether to include all key/value pairs from the metadata attribute
in the output FCS file. Only valid for .fcs file extension. If False, only the minimum
amount of metadata will be included in the output FCS file. Default is False.
:param directory: Directory path where the exported file will be saved. If None, the file
will be saved in the current working directory.
:return: None

Extracting Only FCS Metadata

FlowKit provides many useful utility functions in addition to the classes. We saw one of these earlier in the load_samples() function. Another useful utility function related to FCS samples is the extract_fcs_metadata() function. This allows retrieval of FCS metadata as a dictionary without parsing the event data. This is significantly faster to parse FCS files, especially for files with many events.

[140]:

fk.extract_fcs_metadata(fcs_path)

[140]:

{'byteord': '4,3,2,1',
 'datatype': 'I',
 'nextdata': '0',
 'sys': 'Macintosh System Software 9.0.4',
 'creator': 'CELLQuestª 3.3',
 'tot': '13367',
 'mode': 'L',
 'par': '8',
 'p1n': 'FSC-H',
 'p1r': '1024',
 'p1b': '16',
 'p1e': '0,0',
 'p1g': '3.67',
 'p2n': 'SSC-H',
 'p2r': '1024',
 'p2b': '16',
 'p2e': '0,0',
 'p2g': '8',
 'p3n': 'FL1-H',
 'p3r': '1024',
 'p3b': '16',
 'p3e': '4,0',
 'p4n': 'FL2-H',
 'p4r': '1024',
 'p4b': '16',
 'p4e': '4,0',
 'p5n': 'FL3-H',
 'p5r': '1024',
 'p5b': '16',
 'p5e': '4,0',
 'p1s': 'FSC-Height',
 'p2s': 'SSC-Height',
 'p3s': 'CD4 FITC',
 'p4s': 'CD8 B PE',
 'p5s': 'CD3 PerCP',
 'p6n': 'FL2-A',
 'p6r': '1024',
 'p6b': '16',
 'p6e': '0,0',
 'timeticks': '100',
 'p7n': 'FL4-H',
 'p7r': '1024',
 'p7e': '4,0',
 'p7b': '16',
 'p7s': 'CD8 APC',
 'p8n': 'Time',
 'p8r': '1024',
 'p8e': '0,0',
 'p8b': '16',
 'p8s': 'Time (102.40 sec.)',
 'sample id': 'Default Patient ID',
 'src': 'Default',
 'case number': 'Default Case Number',
 'cyt': 'FACSCalibur',
 'cytnum': 'E3820',
 'btim': '16:31:33',
 'etim': '16:31:52',
 'bdacqlibversion': '3.1',
 'bdnpar': '7',
 'bdp1n': 'FSC-H',
 'bdp2n': 'SSC-H',
 'bdp3n': 'FL1-H',
 'bdp4n': 'FL2-H',
 'bdp5n': 'FL3-H',
 'bdp6n': 'FL2-A',
 'bdp7n': 'FL4-H',
 'bdword0': '24',
 'bdword1': '394',
 'bdword2': '492',
 'bdword3': '477',
 'bdword4': '566',
 'bdword5': '397',
 'bdword6': '397',
 'bdword7': '397',
 'bdword8': '398',
 'bdword9': '397',
 'bdword10': '300',
 'bdword11': '299',
 'bdword12': '551',
 'bdword13': '4',
 'bdword14': '397',
 'bdword15': '501',
 'bdword16': '481',
 'bdword17': '586',
 'bdword18': '574',
 'bdword19': '100',
 'bdword20': '100',
 'bdword21': '100',
 'bdword22': '100',
 'bdword23': '1',
 'bdword24': '1',
 'bdword25': '0',
 'bdword26': '0',
 'bdword27': '0',
 'bdword28': '136',
 'bdword29': '52',
 'bdword30': '52',
 'bdword31': '52',
 'bdword32': '52',
 'bdword33': '52',
 'bdword34': '12',
 'bdword35': '201',
 'bdword36': '6',
 'bdword37': '138',
 'bdword38': '280',
 'bdword39': '3',
 'bdword40': '3',
 'bdword41': '100',
 'bdword42': '100',
 'bdword43': '0',
 'bdword44': '1023',
 'bdword45': '1023',
 'bdword46': '1023',
 'bdword47': '53',
 'bdword48': '550',
 'bdword49': '56',
 'bdword50': '72',
 'bdword51': '52',
 'bdword52': '0',
 'bdword53': '0',
 'bdword54': '0',
 'bdword55': '0',
 'bdword56': '0',
 'bdword57': '0',
 'bdword58': '0',
 'bdword59': '0',
 'bdword60': '0',
 'bdword61': '0',
 'bdword62': '0',
 'bdword63': '0',
 'bdlasermode': '1',
 'calibfile': 'FALSE',
 'p7thresvol': '52',
 'fil': 'B07',
 'date': '23-Aug-02',
 'number well info keywords': '3',
 '&1sample': '200',
 '&2number of washes': '1',
 '&3mixing vol': '100',
 '&4number of mixes': '2',
 '&5data file prefix part #1\\\\&6data file prefix part #2\\\\&7data file prefix part #3\\\\&8acquisition doc.': 'LYMPH SUBSET ACQ',
 '&9instr. sett. file': 'E#7 Settings #1',
 '&10patient id': ' FJ#192659',
 '&11day': '35d',
 '&12sample id': 'T-cells',
 '&13analysis doc.': ''}

[ ]:

FlowKit Tutorial - Part 1 - The Sample Class