Handling of Tabular Data

This document explains the basics of working with tabular data, and shows how to add, remove data from a table.

Basic Usage

Populate table with data:

from tap import *

# create table with two columns, x and y both of float type
tab=Tab(['x', 'y'], 'ff')
for x in range(1000):
  tab.add_row([x, x**2])

# create a plot
plt=tab.plot('x', 'y', save='x-vs-y.png')

Iterating over table items:

# load table from file
tab=load(...)

# iterate over all rows
for row in tab.rows:
  # print complete row
  print row

for f in tab.foo:
  print f
# iterate over all rows of selected columns
for foo, bar in tab.zip('foo','bar'):
  print foo, bar

Creating Table Objects

Tables can either be initialized with information from memory, or populated with data from load, e.g. CSV files or a pickled dump of a previously constructed table. An empty table can be easily constructed as follows

tab=Tab()

If you want to add columns directly when creating the table, column names and column types can be specified as follows

tab=Tab(['nameX','nameY','nameZ'], 'sfb')

this will create three columns called nameX, nameY and nameZ of type string, float and bool, respectively. When the second argument is omitted, the columns will all have a string type. There will be no data in the table and thus, the table will not contain any rows.

If you want to add data to the table in addition, use the following:

tab=Tab(['nameX','nameY','nameZ'],
          'sfb',
          nameX=['a','b','c'],
          nameY=[0.1, 1.2, 3.414],
          nameZ=[True, False, False])

If values for one column are omitted, they will be filled with NA, but if values are specified, all values must be specified (i.e. same number of values per column).

tab = Tab(['name, age'], 'string,float')

Column Types

Table columns have a specific type, e.g. string, float etc. Each cell in a column must either be of that type or set to not available (None). As a result, a float column can’t contain string values. The following column types exist:

long name abbreviation
string s
float f
int i
bool b

When adding new data to the table, values are automatically coerced (forced) to the column type. When coercing fails, a ValueError is thrown.

Specifying Column Types

The column types can be specified when initialing a new table. For convenience, several different formats are supported, which allow to specify the column types as strings, or list using long, or abbreviated forms. The following 5 examples initialise an empty table with a string, float, int and bool column each.

# abbreviated, compact form
tab = Tab(['x', 'y', 'z', 'u'], 'sfib')

# abbreviated, separated by coma
tab = Tab(['x', 'y', 'z', 'u'], 's, f, i, b')

# extended separated by coma
tab = Tab(['x', 'y', 'z', 'u'], 'string, float, int, bool')


# list abbreviated
tab = Tab(['x', 'y', 'z', 'u'], ['s', 'f', 'i', 'b'])

# list extended
tab = Tab(['x', 'y', 'z', 'u'], ['string', 'float', 'int', 'bool'])

Guessing Column Types

For the lazy, the table supports guessing the column type from data when initialising a new table. The detection of column types tries to convert each value to a particular type, e.g. int. When the type conversion is not successful for any value, the column type is set to string. As a special case, when the data arrays are empty, the array types are set to string.

# initialises a table with an bool and int column
t = Tab(['x','y'], x='True False False'.split(), y='1 NA 3'.split())
print t.col_types # bool int

Adding and Removing Data from a Table

The following methods allow to add and remove data in a row and column-wise manner.

Tab.add_row(data, overwrite=None)

Add a row to the table.

data may either be a dictionary or a list-like object:

  • If data is a dictionary the keys in the dictionary must match the column names. Columns not found in the dict will be initialized to None. If the dict contains list-like objects, multiple rows will be added, if the number of items in all list-like objects is the same, otherwise a ValueError is raised.
  • If data is a list-like object, the row is initialized from the values in data. The number of items in data must match the number of columns in the table. A ValuerError is raised otherwise. The values are added in the order specified in the list, thus, the order of the data must match the columns.

If overwrite is not None and set to an existing column name, the specified column in the table is searched for the first occurrence of a value matching the value of the column with the same name in the dictionary. If a matching value is found, the row is overwritten with the dictionary. If no matching row is found, a new row is appended to the table.

Parameters:
  • data (dict or list-like object) – data to add
  • overwrite (str) – column name to overwrite existing row if value in column overwrite matches
Raises :

ValueError if list-like object is used and number of items does not match number of columns in table.

Raises :

ValueError if dict is used and multiple rows are added but the number of data items is different for different columns.

Example: add multiple data rows to a subset of columns using a dictionary

# create table with three float columns
tab = Tab(['x','y','z'], 'fff')

# add rows from dict
data = {'x': [1.2, 1.6], 'z': [1.6, 5.3]}
tab.add_row(data)
print tab

'''
will produce the table

====  ====  ====
x     y     z
====  ====  ====
1.20  NA    1.60
1.60  NA    5.30
====  ====  ====
'''

# overwrite the row with x=1.2 and add row with x=1.9
data = {'x': [1.2, 1.9], 'z': [7.9, 3.5]}
tab.add_row(data, overwrite='x')
print tab

'''
will produce the table

====  ====  ====
x     y     z
====  ====  ====
1.20  NA    7.90
1.60  NA    5.30
1.90  NA    3.50
====  ====  ====
'''
Tab.add_col(col_name, col_type, data=None)

Add a column to the right of the table.

Parameters:
  • col_name (str) – name of new column
  • col_type (str) – type of new column (long versions: int, float, bool, string or short versions: i, f, b, s)
  • data (scalar or iterable) – data to add to new column.

Example:

tab=Tab(['x'], 'f', x=range(5))
tab.add_col('even', 'bool', itertools.cycle([True, False]))
print tab

'''
will produce the table

====  ====
x     even
====  ====
  0   True
  1   False
  2   True
  3   False
  4   True
====  ====
'''

If data is a constant instead of an iterable object, it’s value will be written into each row:

tab=Tab(['x'], 'f', x=range(5))
tab.add_col('num', 'i', 1)
print tab

'''
will produce the table

====  ====
x     num
====  ====
  0   1
  1   1
  2   1
  3   1
  4   1
====  ====
'''

As a special case, if there are no previous rows, and data is not None, rows are added for every item in data.

Tab.remove_col(col)

Remove column with the given name from the table

Parameters:col (str) – name of column to remove
Tab.rename_col(old_name, new_name)

Rename column old_name to new_name.

Parameters:
  • old_name – Name of the old column
  • new_name – Name of the new column
Raises :

ValueError when old_name is not a valid column

Combining Tables

Tab.extend(tab, overwrite=None)

Append each row of tab to the current table. The data is appended based on the column names, thus the order of the table columns is not relevant, only the header names.

If there is a column in tab that is not present in the current table, it is added to the current table and filled with None for all the rows present in the current table.

If the type of any column in tab is not the same as in the current table a TypeError is raised.

If overwrite is not None and set to an existing column name, the specified column in the table is searched for the first occurrence of a value matching the value of the column with the same name in the dictionary. If a matching value is found, the row is overwritten with the dictionary. If no matching row is found, a new row is appended to the table.

tap.merge(table1, table2, by, only_matching=False)

Returns a new table containing the data from both tables. The rows are combined based on the common values in the column(s) by. The option ‘by’ can be a list of column names. When this is the case, merging is based on multiple columns. For example, the two tables below

x y
1 10
2 15
3 20
x u
1 100
3 200
4 400

when merged by column x, it produces the following output:

x y u
1 10 100
2 15 None
3 20 200
4 None 400

Sorting/Querying data

Tab.to_string(float_format='%.3f', int_format='%d', rows=None)

Convert the table into a string representation.

The output format can be modified for int and float type columns by specifying a formatting string for the parameters ‘float_format’ and ‘int_format’.

The option ‘rows’ specify the range of rows to be printed. The parameter must be a type that supports indexing (e.g. a list) containing the start and end row index, e.g. [start_row_idx, end_row_idx].

Parameters:
  • float_format (str) – formatting string for float columns
  • int_format (str) – formatting string for int columns
  • rows (iterable containing ints) – iterable containing start and end row index
Tab.sort(by, order='+')

Performs an in-place sort of the table, based on column by.

Parameters:
  • by (str) – column name by which to sort
  • order (str (i.e. +, -)) – ascending (-) or descending (+) order
Tab.empty(col_name=None, ignore_nan=True)

Checks if a table is empty.

If no column name is specified, the whole table is checked for being empty, whereas if a column name is specified, only this column is checked.

By default, all NAN (or None) values are ignored, and thus, a table containing only NAN values is considered as empty. By specifying the option ignore_nan=False, NAN values are counted as ‘normal’ values.

Tab.get_unique(col, ignore_nan=True)

Extract a list of all unique values from one column

Parameters:
  • col (str) – column name
  • ignore_nan (bool) – ignore all None values
Tab.has_col(col)

Checks if the column with a given name is present in the table.

Accessing data

The data in a table can be iterated row and column-wise.

Tab.zip(*args)

Allows to conveniently iterate over a selection of columns, e.g.

tab=Tab.load('...')
for col1, col2 in tab.zip('col1', 'col2'):
  print col1, col2

is a shortcut for

tab=Tab.load('...')
for col1, col2 in zip(tab['col1'], tab['col2']):
  print col1, col2
Tab.zip_non_null(*args)

Same as zip(), but only returns rows where none of the values is None.

Tab.filter(*args, **kwargs)

Returns a filtered table only containing rows matching all the predicates in kwargs and args For example,

tab.filter(town='Basel')

will return all the rows where the value of the column “town” is equal to “Basel”. Several predicates may be combined, i.e.

tab.filter(town='Basel', male=True)

will return the rows with “town” equal to “Basel” and “male” equal to true. args are unary callables returning true if the row should be included in the result and false if not.

Tab.search_col_names(regex)

Returns a list of column names matching the regex

Parameters:regex (str) – regex pattern
Returns:list of column names (str)