Python: Working with Data

Reading files with Open

  1. file1 = open('file path' , 'mode') is used to open the file you want to load.
  2. Mode can be ‘r’ reading, ‘w’ writing or ‘a’ for appending that is if you want to use the existing file to write instead of creating new ones.
  3. Always remember to close the file, file1.close()
  4. You can use file1.name() to get the file name.
  5. Using with open('file path' , 'mode') as file1: fileContent = file1.read() is a better practice as it automatically closes the file.
  6. fileContent = file1.readline() reads each line of the file and stores it in a list

Writing files with Open

  1. file1 = open('file path' , 'w') is used to open the file in write mode.
  2. file1.write("line to be written\\n") will write the line to the file.
  3. We can use a for loop to write the contents of one file to another by opening them in read and write modes respectively.

Loading and Saving data with Pandas

  1. Pandas is a useful and common library for Data Analysis.
  2. Manual loading of data from files as seen above can be a nightmare. That is the last resort when files do not have a pattern or so called spaghetti files.
  3. import pandas as pd is the library we need.
  4. Loading a csv into Python is as simple as df=pd.read_csv("path_to_csv")
  5. The data from the csv file is loaded into pandas data frame which is a table with column headers.
  6. We can create a data frame from a dictionary. The keys correspond to the column headers while the values are the row entries. df = pd.DataFrame("dict1")
  7. df[["columnName"]] can be used to select the needed column names to form a new data frame for analysis. We can omit out unwanted categorical data.
  8. numpy also provides numpy.loadfromtxt() to load homogenous data structures like array from files.
  9. numpy.genfromtxt() to load simple heterogenous files with column headers and different data types.
  10. pd.read_csv() is the most flexible and mostly used.
  11. loc[row_no, col_name] iloc[row_no,col_no] are used with pandas dataframe for selection of parts of the data frame for analysis. df.loc['0','columnName1'] here loc uses row number and column names as inputs while df.iloc[0,1] takes row and column number. loc, iloc can also be used for slicing like for example iloc[1:2,0:3]
  12. df['col_name'].unique() will display all the unique values of a column.
  13. df[df[condition]] will display all the rows that match the condition, therefore obtaining a new data frame.
  14. To save a file we can use pd.to_csv("fileName.csv")

We can have a discussion here https://www.linkedin.com/in/jayeshrao

Hi good to see y'all, I am an aspiring data analyst and will be posting stuff about Statistics, Python and R and also some interesting projects I do. B-)