关于 tf.data.TextLineDataset() 和常见dataset函数

官方原话：
class TextLineDataset(dataset_ops.Dataset):
  """A `Dataset` comprising lines from one or more text files."""
    def __init__(self, filenames, compression_type=None, buffer_size=None):
        Creates a `TextLineDataset`.
        Args:
           filenames: A `tf.string` tensor containing one or more filenames.
           compression_type: (Optional.) A `tf.string` scalar evaluating to one of
           `""` (no compression), `"ZLIB"`, or `"GZIP"`.
           buffer_size: (Optional.) A `tf.int64` scalar denoting the number of bytes
           to buffer. A value of 0 results in the default buffering values chosen
           based on the compression type.

中文含义：
创造一个TextLineDataset()类
参数：
----filenames: 单个或者多个string格式的文件名或者目录
----compression_type：可选！！！格式是ZLIB或者GZIP
----buffer_size: 可选！！！决定缓冲字节数多少

举例：   
# 文件路径可以用list包括起来，多个路径
input_files = ['./input_file11', './input_file22']             
dataset = tf.data.TextLineDataset(input_files)

tf.data.TextLineDataset 接口提供了一种方法从数据文件中读取。我们提供只需要提供文件名（1个或者多个）。这个接口会自动构造一个dataset，类中保存的元素：文中一行，就是一个元素，是string类型的tensor。

小知识：支持data包下多个函数的操作，目前深度学习中最常用的有4个方法如下

1.map()：对元素进行操作

关于代码里面的tf.string_split可以看另外一篇文档：https://blog.csdn.net/xinjieyuan/article/details/90698352

格式：
map(函数)
# map里面的函数决定了dataset中的数据的处理方式。列如：
dataset.map(Lambda string:tf.string_split([string]).values)
# dataset中元素命名为string,对string进行切割操作
'''
注：
tf.string_split(
    source,
    delimiter=' ',
    skip_empty=True
)
source：需要操作的对象，一般是字符串或者多个字符串构成的列表；
delimiter:分割符,默认空字符串
skip_empty：m默认True，暂时没用到过
'''

# 官方代码
map
View source
map(
    map_func,
    num_parallel_calls=None
)
Maps map_func across the elements of this dataset.
This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input.
For example:
a = Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
a.map(lambda x: x + 1)  # ==> [ 2, 3, 4, 5, 6 ]
The input signature of map_func is determined by the structure of each element in this dataset. For example:
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
# Each element is a `tf.Tensor` object.
a = { 1, 2, 3, 4, 5 }
# `map_func` takes a single argument of type `tf.Tensor` with the same
# shape and dtype.
result = a.map(lambda x: ...)
# Each element is a tuple containing two `tf.Tensor` objects.
b = { (1, "foo"), (2, "bar"), (3, "baz") }
# `map_func` takes two arguments of type `tf.Tensor`.
result = b.map(lambda x_int, y_str: ...)
# Each element is a dictionary mapping strings to `tf.Tensor` objects.
c = { {"a": 1, "b": "foo"}, {"a": 2, "b": "bar"}, {"a": 3, "b": "baz"} }
# `map_func` takes a single argument of type `dict` with the same keys as
# the elements.
result = c.map(lambda d: ...)
The value or values returned by map_func determine the structure of each element in the returned dataset.
# `map_func` returns a scalar `tf.Tensor` of type `tf.float32`.
def f(...):
  return tf.constant(37.0)
result = dataset.map(f)
result.output_classes == tf.Tensor
result.output_types == tf.float32
result.output_shapes == []  # scalar
# `map_func` returns two `tf.Tensor` objects.
def g(...):
  return tf.constant(37.0), tf.constant(["Foo", "Bar", "Baz"])
result = dataset.map(g)
result.output_classes == (tf.Tensor, tf.Tensor)
result.output_types == (tf.float32, tf.string)
result.output_shapes == ([], [3])
# Python primitives, lists, and NumPy arrays are implicitly converted to
# `tf.Tensor`.
def h(...):
  return 37.0, ["Foo", "Bar", "Baz"], np.array([1.0, 2.0] dtype=np.float64)
result = dataset.map(h)
result.output_classes == (tf.Tensor, tf.Tensor, tf.Tensor)
result.output_types == (tf.float32, tf.string, tf.float64)
result.output_shapes == ([], [3], [2])
# `map_func` can return nested structures.
def i(...):
  return {"a": 37.0, "b": [42, 16]}, "foo"
result.output_classes == ({"a": tf.Tensor, "b": tf.Tensor}, tf.Tensor)
result.output_types == ({"a": tf.float32, "b": tf.int32}, tf.string)
result.output_shapes == ({"a": [], "b": [2]}, [])
map_func can accept as arguments and return any type of dataset element.
Note that irrespective of the context in which map_func is defined (eager vs. graph), tf.data traces the function and executes it as a graph. To use Python code inside of the function you have two options:
1) Rely on AutoGraph to convert Python code into an equivalent graph computation. The downside of this approach is that AutoGraph can convert some but not all Python code.
2) Use tf.py_function, which allows you to write arbitrary Python code but will generally result in worse performance than 1). For example:
d = tf.data.Dataset.from_tensor_slices(['hello', 'world'])
# transform a string tensor to upper case string using a Python function
def upper_case_fn(t: tf.Tensor) -> str:
    return t.numpy().decode('utf-8').upper()
d.map(lambda x: tf.py_function(func=upper_case_fn,
      inp=[x], Tout=tf.string))  # ==> [ "HELLO", "WORLD" ]
Args:
    map_func: A function mapping a dataset element to another dataset element.
    num_parallel_calls: (Optional.) A tf.int32 scalar tf.Tensor, representing the number elements to process asynchronously in parallel. If not specified, elements will be processed sequentially. If the value tf.data.experimental.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU.
Returns:
    Dataset: A Dataset.

2.shuttle()：

打乱元素的序列
其实就是随机组合

3.zip()

可以把不同的dataset组合起来

# 生成两个不同的dataset
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)))
# 进行组合
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
'''
注：
使用zip()函数时候，注意要把多个dataset用括号包起来
不然会报：
TypeError: zip() takes 1 positional argument but 2 were given
'''

# 官方原文
zip
View source
@staticmethod
zip(datasets)
Creates a Dataset by zipping together the given datasets.
This method has similar semantics to the built-in zip() function in Python, with the main difference being that the datasets argument can be an arbitrary nested structure of Dataset objects. For example:
a = Dataset.range(1, 4)  # ==> [ 1, 2, 3 ]
b = Dataset.range(4, 7)  # ==> [ 4, 5, 6 ]
c = Dataset.range(7, 13).batch(2)  # ==> [ [7, 8], [9, 10], [11, 12] ]
d = Dataset.range(13, 15)  # ==> [ 13, 14 ]
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b))  # ==> [ (1, 4), (2, 5), (3, 6) ]
Dataset.zip((b, a))  # ==> [ (4, 1), (5, 2), (6, 3) ]
# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c))  # ==> [ (1, 4, [7, 8]),
                        #       (2, 5, [9, 10]),
                        #       (3, 6, [11, 12]) ]
# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d))  # ==> [ (1, 13), (2, 14) ]
Args:
    datasets: A nested structure of datasets.
Returns:
    Dataset: A Dataset.

4.filter()

过滤符合要求的元素

# 设置过滤器
def FilterLength(src_len,trg_len):
    len_ok = tf.logical_andgic(
            tf.greater(src_len,1),            # src_len大于1，返回True
            tf.less_equal(trg_len,MAX_LEN)    # trg_len小于MAX_LEN，返回True
        )
    return len_ok 
# 调用过滤器，过滤不符合条件的元素
dataset = dataset.filter(FilterLength)

官方原文：

filter
View source
filter(predicate)
Filters this dataset according to predicate.
d = tf.data.Dataset.from_tensor_slices([1, 2, 3])
d = d.filter(lambda x: x < 3)  # ==> [1, 2]
# `tf.math.equal(x, y)` is required for equality comparison
def filter_fn(x):
  return tf.math.equal(x, 1)
d = d.filter(filter_fn)  # ==> [1]
Args:
    predicate: A function mapping a dataset element to a boolean.
Returns:
    Dataset: The Dataset containing the elements of this dataset for which predicate is True.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。