by Shantnu Tiwaridata-science

Readexcel method is used to read the excel file in python.And then you have to pass file as an argument. Print (data) simply prints the data of excel file. Now on running the above chunks of code we got the output as below. Conversion of Cell Contents. Some times you want to do conversion of your cell contents from excel.So, here you can.

In this tutorial you’re going to learn how to work with large Excel files in Pandas, focusing on reading and analyzing an xls file and then working with a subset of the original data.

Free Bonus:Click here to download an example Python project with source code that shows you how to read large Excel files.

This tutorial utilizes Python (tested with 64-bit versions of v2.7.9 and v3.4.3), Pandas (v0.16.1), and XlsxWriter (v0.7.3). We recommend using the Anaconda distribution to quickly get started, as it comes pre-installed with all the needed libraries.

Reading the File

The first file we’ll work with is a compilation of all the car accidents in England from 1979-2004, to extract all accidents that happened in London in the year 2000.

Download Samsung Tool PRO for Windows PC from FileHorse. 100% Safe and Secure Free Download (32-bit/64-bit) Latest Version 2020. Z3x shell crack.

Excel

Start by downloading the source ZIP file from data.gov.uk, and extract the contents. Then try to open Accidents7904.csv in Excel. Be careful. If you don’t have enough memory, this could very well crash your computer.

What happens?

You should see a “File Not Loaded Completely” error since Excel can only handle one million rows at a time.

We tested this in LibreOffice as well and received a similar error - “The data could not be loaded completely because the maximum number of rows per sheet was exceeded.”

To solve this, we can open the file in Pandas. Before we start, the source code is on Github.

Pandas

Within a new project directory, activate a virtualenv, and then install Pandas:

Now let’s build the script. Create a file called pandas_accidents.py and the add the following code:

Here, we imported Pandas, read in the file—which could take some time, depending on how much memory your system has—and outputted the total number of rows the file has as well as the available headers (e.g., column titles).

When ran, you should see:

So, there are over six millions rows! No wonder Excel choked. Turn your attention to the list of headers, the first one in particular:

This should read Accident_Index. What’s with the extra xefxbbxbf at the beginning? Well, the x actually means that the value is hexadecimal, which is a Byte Order Mark, indicating that the text is Unicode.

Why does it matter to us?

You cannot assume the files you read are clean. They might contain extra symbols like this that can throw your scripts off.

This file is good, in that it is otherwise clean - but many files have missing data, data in internal inconsistent format, etc. So any time you have a file to analyze, the first thing you must do is clean it. How much cleaning? Enough to allow you to do some analysis. Follow the KISS principle.

What sort of cleanup might you require?

  • Fix date/time. The same file might have dates in different formats, like the American (mm-dd-yy) or European (dd-mm-yy) formats. These need to be brought into a common format.
  • Remove any empty values. The file might have blank columns and/or rows, and this will come up as NaN (Not a number) in Pandas. Pandas provides a simple way to remove these: the dropna() function. We saw an example of this in the last blog post.
  • Remove any garbage values that have made their way into the data. These are values which do not make sense (like the byte order mark we saw earlier). Sometimes, it might be possible to work around them. For example, there could be a dataset where the age was entered as a floating point number (by mistake). The int() function then could be used to make sure all ages are in integer format.

Analyzing

For those of you who know SQL, you can use the SELECT, WHERE, AND/OR statements with different keywords to refine your search. We can do the same in Pandas, and in a way that is more programmer friendly.

To start off, let’s find all the accidents that happened on a Sunday. Looking at the headers above, there is a Day_of_Weeks field, which we will use.

In the ZIP file you downloaded, there’s a file called Road-Accident-Safety-Data-Guide-1979-2004.xls, which contains extra info on the codes used. If you open it up, you will see that Sunday has the code 1.

That’s how simple it is.

Here, we targeted the Day_of_Weeks field and returned a DataFrame with the condition we checked for - day of week 1.

When ran you should see:

As you can see, there were 693,847 accidents that happened on a Sunday.

Let’s make our query more complicated: Find out all accidents that happened on a Sunday and involved more than twenty cars:

Run the script. Now we have 10 accidents:

Let’s add another condition: weather.

Open the Road-Accident-Safety-Data-Guide-1979-2004.xls, and go to the Weather sheet. You’ll see that the code 2 means, “Raining with no heavy winds”.

Add that to our query:

So there were four accidents that happened on a Sunday, involving more than twenty cars, while it was raining:

We could continue making this more and more complicated, as needed. For now, we’ll stop since our main interest is to look at accidents in London.

If you look at Road-Accident-Safety-Data-Guide-1979-2004.xls again, there is a sheet called Police Force. The code for 1 says, “Metropolitan Police”. This is what is more commonly known as Scotland Yard, and is the police force responsible for most (though not all) of London. For our case, this is good enough, and we can extract this info like so:

Run the script. This created a new DataFrame with the accidents handled by the “Metropolitan Police” from 1979 to 2004 on a Sunday:

What if you wanted to create a new DataFrame that only contains accidents in the year 2000?

The first thing we need to do is convert the date format to one which Python can understand using the pd.to_datetime()function. This takes a date in any format and converts it to a format that we can understand (yyyy-mm-dd). Then we can create another DataFrame that only contains accidents for 2000:

When ran, you should see:

So, this is a bit confusing at first. Normally, to filter an array you would just use a for loop with a conditional:

However, you really shouldn’t define your own loop since many high-performance libraries, like Pandas, have helper functions in place. In this case, the above code loops over all the elements and filters out data outside the set dates, and then returns the data points that do fall within the dates.

Nice!

Converting

Chances are that, while using Pandas, everyone else in your organization is stuck with Excel. Want to share the DataFrame with those using Excel?

First, we need to do some cleanup. Remember the byte order mark we saw earlier? That causes problems when writing this data to an Excel file - Pandas throws a UnicodeDecodeError. Why? Because the rest of the text is decoded as ASCII, but the hexadecimal values can’t be represented in ASCII.

Best python excel package

We could write everything as Unicode, but remember this byte order mark is an unnecessary (to us) extra we don’t want or need. So we will get rid of it by renaming the column header:

This is the way to rename a column in Pandas; a bit complicated, to be honest. inplace = True is needed because we want to modify the existing structure, and not create a copy, which is what Pandas does by default.

Now we can save the data to Excel:

Make sure to install XlsxWriter before running:

If all went well, this should have created a file called London_Sundays_2000.xlsx, and then saved our data to Sheet1. Open this file up in Excel or LibreOffice, and confirm that the data is correct.

Conclusion

So, what did we accomplish? Well, we took a very large file that Excel could not open and utilized Pandas to-

  1. Open the file.
  2. Perform SQL-like queries against the data.
  3. Create a new XLSX file with a subset of the original data.

Keep in mind that even though this file is nearly 800MB, in the age of big data, it’s still quite small. What if you wanted to open a 4GB file? Even if you have 8GB or more of RAM, that might still not be possible since much of your RAM is reserved for the OS and other system processes. In fact, my laptop froze a few times when first reading in the 800MB file. If I opened a 4GB file, it would have a heart attack.

Free Bonus:Click here to download an example Python project with source code that shows you how to read large Excel files.

So how do we proceed?

The trick is not to open the whole file in one go. That’s what we’ll look at in the next blog post. Until then, analyze your own data. Leave questions or comments below. You can grab the code for this tutorial from the repo.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

About Shantnu Tiwari

Shantnu has worked in the low level/embedded domain for ten years. Shantnu suffered at the hands of C/C++ for several years before he discovered Python, and it felt like a breath of fresh air.

» More about Shantnu

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Master Real-World Python Skills
With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

Real Python Comment Policy: The most useful comments are those written with the goal of learning from or helping out other readers—after reading the whole article and all the earlier comments. Complaints and insults generally won’t make the cut here.

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Keep Learning

Master Real-World Python Skills With Unlimited Access to Real Python

Already a member? Sign-In

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Latest version

Released:

Library to create spreadsheet files compatible with MS Excel 97/2000/XP/2003 XLS files, on any platform, with Python 2.6, 2.7, 3.3+

Project description

xlwt

This is a library for developers to use to generatespreadsheet files compatible with Microsoft Excel versions 95 to 2003.

The package itself is pure Python with no dependencies on modules or packagesoutside the standard Python distribution.

Please read this before using this package:https://groups.google.com/d/msg/python-excel/P6TjJgFVjMI/g8d0eWxTBQAJ Quicktime 2x speed no sound.

Quick start

Documentation

Documentation can be found in the docs directory of the xlwt package.If these aren’t sufficient, please consult the code in theexamples directory and the source code itself.

The latest documentation can also be found at:https://xlwt.readthedocs.org/en/latest/

Problems?

Try the following in this order:

  • Read the source
  • Ask a question on https://groups.google.com/group/python-excel/

Acknowledgements

xlwt is a fork of the pyExcelerator package, which was developed byRoman V. Kiseliov. This product includes software developed byRoman V. Kiseliov <roman@kiseliov.ru>.

xlwt uses ANTLR v 2.7.7 to generate its formula compiler.

Release historyRelease notifications RSS feed

1.3.0

1.2.0

1.1.2

1.1.1

1.0.0

0.7.5

0.7.4

0.7.3

0.7.2

0.7.1

0.7.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for xlwt, version 1.3.0
Filename, sizeFile typePython versionUpload dateHashes
Filename, size xlwt-1.3.0-py2.py3-none-any.whl (100.0 kB) File type Wheel Python version py2.py3 Upload dateHashes
Filename, size xlwt-1.3.0.tar.gz (153.9 kB) File type Source Python version None Upload dateHashes
Close

Hashes for xlwt-1.3.0-py2.py3-none-any.whl

Hashes for xlwt-1.3.0-py2.py3-none-any.whl
AlgorithmHash digest
SHA256a082260524678ba48a297d922cc385f58278b8aa68741596a87de01a9c628b2e
MD5085e6a73f9bffa8de4abd2c131b8afd5
BLAKE2-2564448def306413b25c3d01753603b1a222a011b8621aed27cd7f89cbc27e6b0f4
Close

Hashes for xlwt-1.3.0.tar.gz

Hashes for xlwt-1.3.0.tar.gz
AlgorithmHash digest
SHA256c59912717a9b28f1a3c2a98fd60741014b06b043936dcecbc113eaaada156c88
MD54b1ca8a3cef3261f4b4dc3f138e383a8
BLAKE2-256069756a6f56ce44578a69343449aa5a0d98eefe04085d69da539f3034e2cd5c1