I Am Trying to Read a Csv File in Python but Am Getting an Error

CSV (comma-separated value) files are a common file format for transferring and storing information. The power to read, manipulate, and write data to and from CSV files using Python is a key skill to chief for whatsoever data scientist or concern analysis. In this postal service, we'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames dorsum to CSV files postal service analysis.

Pandas is the most popular data manipulation package in Python, and DataFrames are the Pandas information type for storing tabular second information.

  1. Load CSV files to Python Pandas
  2. 1. File Extensions and File Types
  3. ii. Data Representation in CSV files
    • Other Delimiters / Separators – TSV files
    • Delimiters in Text Fields – Quotechar
  4. 3. Python – Paths, Folders, Files
    • Finding your Python Path
    • File Loading: Absolute and Relative Paths
  5. 4. Pandas CSV File Loading Errors
  6. Advanced Read CSV Files
    • Specifying Data Types
    • Skipping and Picking Rows and Columns From File
    • Custom Missing Value Symbols
  7.  CSV Format Advantages and Disadvantages
  8. Additional Reading

Load CSV files to Python Pandas

The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:

# Load the Pandas libraries with alias 'pd'  import pandas as pd   # Read information from file 'filename.csv'  # (in the aforementioned directory that your python process is based) # Command delimiters, rows, column names with read_csv (run into later)  data = pd.read_csv("filename.csv")   # Preview the offset v lines of the loaded information  information.head()

While this lawmaking seems simple, an understanding of three fundamental concepts is required to fully grasp and debug the functioning of the data loading process if you run into issues:

  1. Agreement file extensions and file types – what do the letters CSV actually hateful? What's the difference between a .csv file and a .txt file?
  2. Understanding how information is represented inside CSV files – if yous open a CSV file, what does the data actually wait like?
  3. Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
  4. CSV information formats and errors – common errors with the function.

Each of these topics is discussed below, and we finish this tutorial by looking at some more advanced CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.

1. File Extensions and File Types

The commencement step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.

  1. Data is stored on your computer in individual "files", or containers, each with a different name.
  2. Each file contains data of different types – the internals of a Word certificate is quite different from the internals of an image.
  3. Computers decide how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
  4. So, a filename is typically in the form "<random proper noun>.<file extension>". Examples:
    • project1.DOCX – a Microsoft Discussion file chosen Project1.
    • shanes_file.TXT – a unproblematic text file chosen shanes_file
    • IMG_5673.JPG – An image file chosen IMG_5673.
    • Other well known file types and extensions include: XLSX: Excel, PDF: Portable Certificate Format, PNG – images, Aught – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a consummate list of extensions here.
  5. A CSV file is a file with a ".csv" file extension, east.thousand. "information.csv", "super_information.csv". The "CSV" in this example lets the computer know that the data contained in the file is in "comma separated value" format, which we'll hash out below.

File extensions are hidden by default on a lot of operating systems. The first footstep that any self-respecting engineer, software engineer, or data scientist volition do on a new reckoner is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

Folder with file extensions showing. Before working with CSV files, ensure that you can run across your file extensions in your operating system. Different file contents are denoted by the file extension, or letters later on the dot, of the file proper name. e.m. TXT is text, DOCX is Microsoft Word, PNG are images, CSV is comma-separated value data.

To check if file extensions are showing in your system, create a new text document with Notepad (Windows) or TextEdit (Mac) and relieve it to a folder of your choice. If y'all can't see the ".txt" extension in your folder when you view it, you volition have to change your settings.

  • In Microsoft Windows: Open Command Panel > Appearance and Personalization.  Now, click on Folder Options or File Explorer Option, equally it is now called > View tab. In this tab, nether Advance Settings, y'all will meet the option Hide extensions for known file types. Uncheck this option and click on Apply and OK.
  • In Mac Bone: Open Finder > In carte, click Finder > Preferences, Click Avant-garde, Select the checkbox for "Show all filename extensions".

2. Data Representation in CSV files

A "CSV" file, that is, a file with a "csv" filetype, is a bones text file. Any text editor such every bit NotePad on windows or TextEdit on Mac, tin open a CSV file and bear witness the contents. Sublime Text is a wonderful and multi-functional text editor option for any platform.

CSV is a standard for storing tabular data in text format, where commas are used to separate the different columns, and newlines (carriage return / printing enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.

And case tabular array information set and the corresponding CSV-format information is shown in the diagram below.

Pandas read csv function read_csv is used to process this comma-separated file into tabular format in the Python DataFrame. Here we look at the innards of a CSV file to examine how columns are specified.
Comma-separated value files, or CSV files, are unproblematic text files where commas and newlines are used to define tabular data in a structured way.

Note that almost any tabular data tin be stored in CSV format – the format is popular because of its simplicity and flexibility. You tin can create a text file in a text editor, save it with a .csv extension, and open that file in Excel or Google Sheets to see the table form.

Other Delimiters / Separators – TSV files

The comma separation scheme is past far the most pop method of storing tabular information in text files.

However, the choice of the ',' comma character to delimiters columns, however, is capricious, and can exist substituted where needed. Pop alternatives include tab ("\t") and semi-colon (";"). Tab-split up files are known as TSV (Tab-Separated Value) files.

When loading data with Pandas, the read_csv role is used for reading any delimited text file, and by changing the delimiter using the sep  parameter.

Delimiters in Text Fields – Quotechar

One complication in creating CSV files is if yous have commas, semicolons, or tabs actually in one of the text fields that you lot desire to store. In this example, it'due south important to utilise a "quote character" in the CSV file to create these fields.

The quote character tin be specified in Pandas.read_csv using the quotechar argument. By default (as with many systems), information technology's set as the standard quotation marks ("). Any commas (or other delimiters equally demonstrated below) that occur between 2 quote characters will exist ignored every bit column separators.

In the example shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" cavalcade to contain semicolons without being split into more columns.

" data-medium-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-300x215.png" data-large-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-1024x734.png" src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png" alt="Demonstration of semicolon separated file data with quote character to prevent unnecessary splits in columns. Read this CSV file with pandas using read_csv with the ";" sep specified." class="wp-image-1103" width="818" height="586" data-old-src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20818%20586'%3E%3C/svg%3E" data-lazy-src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png">
Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote characters are used if the information in a column may contain the separating character. In this example, the 'NickName' cavalcade contains semicolon characters, and and then this cavalcade is "quoted". Specify the separator and quote character in pandas.read_csv

3. Python – Paths, Folders, Files

When yous specify a filename to Pandas.read_csv, Python will look in your "electric current working directory". Your working directory is typically the directory that you started your Python procedure or Jupyter notebook from.

When filenotfounderrors occur, it can be due to a misspelled filename or a working directory mistake,
Pandas searches your 'current working directory' for the filename that you specify when opening or loading files. The FileNotFoundError can be due to a misspelled filename, or an incorrect working directory.

Finding your Python Path

Your Python path can exist displayed using the built-in os module. The OS module is for operating system dependent functionality into Python programs and scripts.

To find your current working directory, the part required is os.getcwd(). Theos.listdir() role can exist used to display all files in a directory, which is a practiced check to see if the CSV file you are loading is in the directory as expected.

# Find out your current working directory import bone impress(os.getcwd())  # Out: /Users/shane/Documents/web log  # Brandish all of the files found in your current working directory impress(os.listdir(os.getcwd())   # Out: ['test_delimted.ssv', 'CSV Weblog.ipynb', 'test_data.csv']

In the example above, my current working directory is in the '/Users/Shane/Certificate/blog' directory. Any files that are places in this directory will exist immediately available to the Python file open() part or the Pandas read csv part.

Instead of moving the required data files to your working directory, you lot can also change your electric current working directory to the directory where the files reside usingos.chdir().

File Loading: Absolute and Relative Paths

When specifying file names to the read_csv part, you can supply both absolute or relative file paths.

  • A relative pathis the path to the file if you start from your electric current working directory. In relative paths, typically the file volition exist in a subdirectory of the working directory and the path volition non start with a drive specifier, e.m. (data/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
  • An absolute pathis the consummate path from the base of your file system to the file that y'all want to load, e.g. c:/Documents/Shane/data/test_file.csv. Absolute paths will start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)

It's recommended and preferred to utilise relative paths where possible in applications, because absolute paths are unlikely to work on different computers due to dissimilar directory structures.

absolute vs relative file paths
Loading the same file with Pandas read_csv using relative and absolute paths. Relative paths are directions to the file starting at your current working directory, where accented paths e'er outset at the base of your file system.

four. Pandas CSV File Loading Errors

The nearly common error'southward yous'll become while loading information from CSV files into Pandas will exist:

  1. FileNotFoundError: File b'filename.csv' does not exist
    A File Not Institute mistake is typically an issue with path setup, current directory, or file proper name confusion (file extension tin play a function here!)
  2. UnicodeDecodeError: 'utf-8' codec tin can't decode byte in position : invalid continuation byte
    A Unicode Decode Error is typically caused by non specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick set, try opening the file in Sublime Text, and re-saving with encoding 'UTF-8'.
  3. pandas.parser.CParserError: Error tokenizing data.
    Parse Errors tin can be caused in unusual circumstances to practise with your data format – try to add the parameter "engine='python'" to the read_csv office telephone call; this changes the data reading function internally to a slower simply more stable method.

Advanced Read CSV Files

In that location are some additional flexible parameters in the Pandas read_csv() part that are useful to have in your arsenal of data science techniques:

Specifying Data Types

Equally mentioned before, CSV files exercise not incorporate any blazon information for data. Data types are inferred through examination of the peak rows of the file, which tin lead to errors. To manually specify the data types for different columns, thedtype parameter tin can exist used with a lexicon of column names and data types to be applied, for example:dtype={"name": str, "historic period": np.int32}.

Annotation that for dates and engagement times, the format, columns, and other behaviour can exist adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.

Skipping and Picking Rows and Columns From File

Thenrows parameter specifies how many rows from the pinnacle of CSV file to read, which is useful to take a sample of a big file without loading completely. Similarly theskiprowsparameter allows you lot to specify rows to leave out, either at the commencement of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter can be used to specify which columns in the data to load.

Custom Missing Value Symbols

When data is exported to CSV from different systems, missing values tin can be specified with different tokens. Thena_values parameter allows you lot to customise the characters that are recognised every bit missing values. The default values interpreted every bit NA/NaN are: '', '#N/A', '#N/A North/A', '#NA', '-1.#IND', '-ane.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'Due north/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'zero'.

# Advanced CSV loading example  data = pd.read_csv(     "data/files/complex_data_example.tsv",      # relative python path to subdirectory     sep='\t' 					# Tab-separated value file.     quotechar="'",				# single quote allowed as quote character     dtype={"salary": int}, 		        # Parse the salary cavalcade as an integer      usecols=['proper name', 'birth_date', 'bacon'].   # Only load the three columns specified.     parse_dates=['birth_date'], 		# Intepret the birth_date column as a date     skiprows=10, 				# Skip the first 10 rows of the file     na_values=['.', '??'] 			# Take any '.' or '??' values as NA )

 CSV Format Advantages and Disadvantages

Every bit with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Exist aware of the potential pitfalls and bug that you will encounter every bit y'all load, store, and exchange information in CSV format:

On the plus side:

  • CSV format is universal and the data can exist loaded by almost any software.
  • CSV files are simple to understand and debug with a basic text editor
  • CSV files are quick to create and load into retentiveness earlier analysis.

Withal, the CSV format has some negative sides:

  • There is no data type information stored in the text file, all typing (dates, int vs bladder, strings) are inferred from the data only.
  • In that location's no formatting or layout information storable – things similar fonts, borders, column width settings from Microsoft Excel will be lost.
  • File encodings tin get a trouble if in that location are non-ASCII uniform characters in text fields.
  • CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find even so that your CSV data compresses well using zip compression.

As and aside, in an effort to counter some of these disadvantages, 2 prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to exist a fast, simple, open up, flexible and multi-platform data format that supports multiple data types natively.

Additional Reading

  1. Official Pandas documentation for the read_csv function.
  2. Python iii Notes on file paths, working directories, and using the OS module.
  3. Datacamp Tutorial on loading CSV files, including some additional OS commands.
  4. PythonHow Loading CSV tutorial.

zelayajact1969.blogspot.com

Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/

0 Response to "I Am Trying to Read a Csv File in Python but Am Getting an Error"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel