pyspark.pandas.read_spark_io#

pyspark.pandas.read_spark_io(path=None, format=None, schema=None, index_col=None, **options)[source]#

Load a DataFrame from a Spark data source.

Parameters
pathstring, optional

Path to the data source.

formatstring, optional

Specifies the output data source format. Some common ones are:

  • ‘delta’

  • ‘parquet’

  • ‘orc’

  • ‘json’

  • ‘csv’

schemastring or StructType, optional

Input schema. If none, Spark tries to infer the schema automatically. The schema can either be a Spark StructType, or a DDL-formatted string like col0 INT, col1 DOUBLE.

index_colstr or list of str, optional, default: None

Index column of table in Spark.

optionsdict

All other options passed directly into Spark’s data source.

See also

DataFrame.read_table
DataFrame.read_delta
DataFrame.read_parquet

Examples

>>> ps.range(1).spark.to_spark_io('%s/read_spark_io/data.parquet' % path)
>>> ps.read_spark_io(
...     '%s/read_spark_io/data.parquet' % path, format='parquet', schema='id long')
   id
0   0
>>> ps.range(10, 15, num_partitions=1).spark.to_spark_io('%s/read_spark_io/data.json' % path,
...                                                format='json', lineSep='__')
>>> ps.read_spark_io(
...     '%s/read_spark_io/data.json' % path, format='json', schema='id long', lineSep='__')
   id
0  10
1  11
2  12
3  13
4  14

You can preserve the index in the roundtrip as below.

>>> ps.range(10, 15, num_partitions=1).spark.to_spark_io('%s/read_spark_io/data.orc' % path,
...                                                format='orc', index_col="index")
>>> ps.read_spark_io(
...     path=r'%s/read_spark_io/data.orc' % path, format="orc", index_col="index")
... 
       id
index
0      10
1      11
2      12
3      13
4      14